问题描述
我正在从 https://www.dictionary.com/网站上删除字典数据.目的是从字典页面中删除不需要的元素,然后将其离线保存以进行进一步处理.由于网页的结构有些杂乱,因此下面的代码中提到的要删除的元素可能存在也可能不存在;缺少元素会导致异常(在代码段2中).而且由于在实际代码中,有许多要删除的元素,它们可能存在或不存在,因此,如果对每个这样的语句应用try - except
,则代码行将急剧增加.
I am scrapping dictionary data from https://www.dictionary.com/ website. The purpose is to remove the unwanted elements from the dictionary pages and save them offline for further processing. Because of the webpages are somewhat unstructured there may and may not be the elements present that are mentioned in the code below to remove; the absence of the elements gives an exception (In snippet 2). And since in the actual code, there are many elements to be removed and they may be present or absent, if we apply the try - except
to every such statement the lines of code will increase drasticly.
因此,我正在通过为try - except
创建一个单独的函数(在代码段3中)来解决此问题,我从.但是我无法获得代码片段3中的代码,因为soup.find_all('style')
之类的命令正在返回None
,因为它应该返回类似于代码片段2的所有style
标签的列表.我无法应用所引用的解决方案由于有时我必须直接引用它的parent
或sibling
(例如在soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent
Thus I am working on a work-around for this problem by creating a separate function for try - except
(In snippet 3), the idea of which I got from here. But I am unable to get the code in snippet 3 working as the command such as soup.find_all('style')
is returning None
where as it should return the list of all the style
tags similar to snippet 2. I cannot apply the refered solution directly as sometime I have to reach the intended element to remvove indirectly by refering to its parent
or sibling
such as in soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent
代码段1用于设置代码执行的环境.
Snippet 1 is used to set the environment for code execution.
如果您能提出一些使片段3正常工作的建议,那就太好了.
It would be great if you could provide some suggestion to get snippet 3 working.
代码段1(设置执行代码的环境):
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}
folder = "dictionary_com"
代码段2(有效):
def makedefinition(url):
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = soup.find_all("style") # style tags
remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
代码段3(无效):
soup = None
def safe_execute(command):
global soup
try:
print(soup) # correct soup is printed
print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
return exec(command) # None is being returned for style
except Exception:
print(Exception.with_traceback())
return []
def makedefinition(url):
global soup
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = safe_execute("soup.find_all('style')") # style tags
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
推荐答案
在代码段3中的代码中,您将使用exec
内置方法,该方法将返回None
而不管其参数如何.有关详细信息,请参见此 SO线程.
In your code in snippet 3 you use the exec
builtin method which returns None
regardless of what it does with its argument. For details see this SO thread.
补救措施:
使用exec
修改变量并返回它,而不是返回exec
本身的输出.
Use exec
to modify a variable and return it instead of returning the output of exec
itself.
def safe_execute(command):
d = {}
try:
exec(command, d)
return d['output']
except Exception:
print(Exception.with_traceback())
return []
然后将其命名为:
remove = safe_execute("output = soup.find_all('style')")
执行此代码后,再次返回None
.但是,在调试时,如果在try
部分中打印了soup
正确的soup
值,但是exec(command,d)
给出了NameError: name 'soup' is not defined
.
Upon execution of this code, again None
is returned. Upon debugging however, inside try
section if we print(soup)
a correct soup
value is printed, but exec(command,d)
gives NameError: name 'soup' is not defined
.
通过使用eval()
而不是exec()
克服了这种差异.定义的函数是:
This disparity have been overcome by using eval()
instead of exec()
. The function defined is:
def safe_execute(command):
global soup
try:
output = eval(command)
return(output)
except Exception:
return []
呼叫看起来像:
remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))
这篇关于通过创建单独的函数来使try-except变通办法适用于单行中的许多语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!