我写了一个程序来抓取一个网站,由于需要抓取的链接太多,所以我使用Python多处理。当我的程序启动时,一切都很好,我的异常日志记录得很好,但在2-3小时后,2-3或所有4个子进程使用0%的CPU,因为我没有使用async,我的程序的最后一行记录了“Done!”字符串不执行!在进程池的目标函数中,我用try/except语句包装所有代码,这样我的进程就不会崩溃,如果它崩溃,我必须在nohup.log文件中看到一些输出(我在后台用nohup my script.py运行这个脚本!)我不知道发生了什么,这真的让我很生气。
我在网上搜索,发现有人告诉我在pool语句后使用my_pool.close()(因为他说子进程在完成任务后不一定要关闭),但它也不起作用:(
我的代码大约有200行,所以我不能把它们都放在这里!
我给你总结一下。如果你需要某个部分的细节,告诉我
from bs4 import BeautifulSoup
import requests
import urllib.request
import multiprocessing
from orator import DatabaseManager
import os
from datetime import datetime
def login():
requests_session = requests.session()
login_page = requests_session.get("https://www.example.com/login")
payload = {
"username": "XX",
"password": "X",
}
response = requests_session.post("https://www.example.com/auth/eb-login", data=payload, headers=dict(referer="https://www.example.com/login"))
if response.status_code == 200:
return requests_session
else:
return False
def media_crawler(url_article_id):
try:
url = url_article_id[0] + "/images-videos"
article_id = url_article_id[1]
requests_session = url_article_id[2]
db = DatabaseManager(config)
page = requests_session.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
img_wrapper_list = soup.select("div.example")
#Check if we are logged in
if soup.select_one("div.example").text.strip().lower() != "logout":
#if we not we login again
current_session = login()
#if our login failed we log it and stop doing this url task
if current_session == False:
log = open("media.log", "a+")
log.write(datetime.now().strftime('%H:%M:%S') + " We are not logged in and can not log in!: "
"\nArticle ID: " + str(article_id)
+ "\n----------------------------\n"
)
log.close()
print("Error logged!")
return
#otherwise we return the new session
else:
requests_session = current_session
#we go in every image wrapper and take all the images
for img_wrapper in img_wrapper_list:
if not img_wrapper.has_attr("data-jw"):
img_source = img_wrapper.select_one("div.image-wrapper.mg > img")["src"]
image_title = img_wrapper.select_one("div.image-wrapper.mg > img")["alt"]
file_name_with_extension = img_source.split("/")[-1]
file_name = file_name_with_extension.split(".")[0]
file_extension = file_name_with_extension.split(".")[-1]
try:
filename, headers = urllib.request.urlretrieve(img_source, "images/" + str(article_id) + "-" + file_name + "." + file_extension)
file_size = int(headers["Content-Length"]) / 1024
#Store the file in database
#if we got any problem in downloading and storing in
#database we log it and delete the downloaded file(if it downloaded)
except Exception as e:
log = open("media.log", "a+")
log.write(datetime.now().strftime('%H:%M:%S') + " Problem in fetching media: \nURL: "
+ img_source + "\nArticle ID: " + str(article_id) + "\n" + str(e)
+ "\n----------------------------\n"
)
log.close()
print("Error logged!")
try:
os.remove("images/" + str(article_id) + "-" + file_name + "." + file_extension)
except:
pass
#Update the image article to know which article media we download
try:
db.table("articles").where('article_id', article_id).update(image_status=1)
except Exception as e:
log = open("media.log", "a+")
log.write(datetime.now().strftime('%H:%M:%S') + " Problem in updating database record for: "
+ "\nArticle ID: " + str(article_id) + "\n" + str(e)
+ "\n----------------------------\n"
)
log.close()
print("Error logged!")
#this is the try/except wrapper for my whole function
except Exception as e:
log = open("media.log", "a+")
log.write(datetime.now().strftime('%H:%M:%S') + " Problem in this article media: \nURL: "
+ "\nArticle ID: " + str(article_id) + "\n" + str(e)
+ "\n----------------------------\n"
)
log.close()
print("Error logged!")
db.disconnect()
db = DatabaseManager()
current_session = login()
if current_session:
log = open("media.log", "w+")
log.write("Start!\n")
log.close()
articles = db.table("articles").skip(0).take(1000).get()
url_article_id_tuples_list = []
for article in articles:
temp = (article["article_link"], article["article_id"], current_session)
url_article_id_tuples_list.append(temp)
myPool = multiprocessing.Pool()
myPool.
myPool.map(media_crawler, url_article_id_tuples_list)
myPool.close()
log = open("media.log", "a+")
log.write("\nDone!")
log.close()
else:
print("Can not login to the site!")
db.disconnect()
2-3个小时后,我的进程崩溃(我想),它们的CPU使用率达到0%,最后一个命令没有执行
log.write("\nDone!")
我也不认为我没有什么特别的
我不知道幕后到底发生了什么
我的日志文件错误只是关于连接,所以我提醒他们:(
Start!
03:20:31 Problem in this article media:
URL:
Article ID: 190830
'alt'
----------------------------
03:50:05 Problem in fetching media:
URL: https://cdn.example.com/30/91430-004-828719A3.jpg
Article ID: 188625
<urlopen error [Errno 104] Connection reset by peer>
----------------------------
06:15:44 Problem in fetching media:
URL: https://cdn.example.com/15/37715-004-AA71C615.jpg
Article ID: 241940
<urlopen error [Errno 104] Connection reset by peer>
----------------------------
06:23:07 Problem in this article media:
URL:
Article ID: 244457
HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: /biography/Dore-Schary/images-videos (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
----------------------------
06:25:14 Problem in this article media:
URL:
Article ID: 248185
('Connection aborted.', OSError("(104, 'ECONNRESET')"))
----------------------------
06:28:30 Problem in fetching media:
URL: https://cdn.example.com/89/77189-004-9D4A3E0B.jpg
Article ID: 244500
<urlopen error [Errno 104] Connection reset by peer>
----------------------------
06:39:29 Problem in fetching media:
URL: https://cdn.example.com/50/175050-004-8ACF8167.jpg
Article ID: 244763
Remote end closed connection without response
----------------------------
06:39:39 Problem in fetching media:
URL: https://cdn.example.com/34/201734-004-D8779144.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:39:49 Problem in fetching media:
URL: https://cdn.example.com/60/93460-004-B2993A85.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:39:59 Problem in fetching media:
URL: https://cdn.example.com/03/174803-004-DE7B5599.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:40:09 Problem in fetching media:
URL: https://cdn.example.com/81/188981-004-75AB37F3.jpg
Article ID: 244763
<urlopen error [Errno -2] Name or service not known>
----------------------------
06:42:42 Problem in this article media:
URL:
Article ID: 248524
HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: /topic/The-Yearling-novel-by-Rawlings/images-videos (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')")))
我崩溃的进程:
(它们不完全是0%,但随着时间的推移没有添加媒体…)
xxxxx 26137 0.1 1.6 589696 134320 ? Sl May07 1:45 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx 26140 0.3 1.4 379392 120064 ? SN May07 4:52 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx 26141 0.5 1.4 380724 121172 ? S May07 8:55 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx 26142 0.7 1.5 382860 123112 ? S May07 10:37 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx 26143 0.4 1.4 379912 120380 ? S May07 6:15 /home/xxxx/anaconda3/envs/xxx/bin/python3.7 MediaCrawler.py
xxxxx 29324 0.0 0.0 21536 1032 pts/1 S+ 04:20 0:00 grep --color=auto MediaCrawler.py
最佳答案
Tnx的评论。
我做了一些实验,这就是我发现的:
正如Sam Mason所说,访问站点的请求速率很高,我解决这个问题的方法是在每个请求中等待1秒,然后程序结束查找
关于python - 我的子进程即使没有处理异常也不会出现任何错误消息而无提示地崩溃,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56036505/