我正在从https://www.dictionary.com/网站上删除字典数据。目的是从词典页面中删除不需要的元素,并将其离线保存以进行进一步处理。由于网页的结构有些杂乱,因此以下代码中提到的要删除的元素可能存在也可能不存在;缺少元素会导致异常(代码段2)。并且由于在实际代码中,有许多要删除的元素,它们可能存在或不存在,因此,如果我们将try - except应用于每个此类语句,则代码行将急剧增加。

因此,我正在通过为try - except创建一个单独的函数(在代码段3中)来解决此问题,该想法是我从here获得的。但是我无法获得代码段3中的代码,因为诸如soup.find_all('style')之类的命令正在返回None,因为它应该返回类似于代码段2的所有style标签的列表。我有时无法直接应用引用的解决方案通过引用其parentsibling(例如soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)间接达到想要删除的预期元素

片段1用于设置代码执行的环境。

如果您可以提供一些建议来使摘要3正常运行,那将是很棒的。

代码段1(设置执行代码的环境):

import urllib.request
import requests
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}

folder = "dictionary_com"

代码段2(有效):
def makedefinition(url):
    success = False
    while success==False:
        try:
            request=urllib.request.Request(url,headers=headers)
            final_url = urllib.request.urlopen(request, timeout=5).geturl()
            r = requests.get(final_url, headers=headers, timeout=5)
            success=True
        except:
            success=False

    soup = BeautifulSoup(r.text, 'lxml')

    soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})

    # there are many more elements to remove. mentioned only 2 for shortness
    remove = soup.find_all("style") # style tags
    remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page

    for x in remove: x.decompose()

    return(soup)

# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)

with open(folder+"/demo.html", "w") as file:
    file.write(str(maggi))

代码段3(无效):
soup = None

def safe_execute(command):
    global soup
    try:
        print(soup) # correct soup is printed
        print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
        return exec(command) # None is being returned for style
    except Exception:
        print(Exception.with_traceback())
        return []

def makedefinition(url):
    global soup
    success = False
    while success==False:
        try:
            request=urllib.request.Request(url,headers=headers)
            final_url = urllib.request.urlopen(request, timeout=5).geturl()
            r = requests.get(final_url, headers=headers, timeout=5)
            success=True
        except:
            success=False

    soup = BeautifulSoup(r.text, 'lxml')

    soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})

    # there are many more elements to remove. mentioned only 2 for shortness
    remove = safe_execute("soup.find_all('style')") # style tags
    remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page

    for x in remove: x.decompose()

    return(soup)

# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)

with open(folder+"/demo.html", "w") as file:
    file.write(str(maggi))

最佳答案

在代码段3中的代码中,您使用了exec内置方法,该方法返回None而不管其参数如何处理。有关详细信息,请参见this SO线程。

解决方法:

使用exec修改变量并返回它,而不是返回exec本身的输出。

def safe_execute(command):
   d = {}
   try:
       exec(command, d)
       return d['output']
   except Exception:
       print(Exception.with_traceback())
       return []

然后这样称呼它:
remove = safe_execute("output = soup.find_all('style')")

编辑:

执行此代码后,再次返回None。但是,在调试时,如果我们try正确打印了print(soup)值,则在soup部分内,但是exec(command,d)给出了NameError: name 'soup' is not defined

通过使用eval()而不是exec()已解决了这种差异。定义的函数为:
def safe_execute(command):
    global soup
    try:
        output = eval(command)
        return(output)
    except Exception:
        return []

call 看起来像:
remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))

关于python-3.x - 通过创建单独的函数来使try-except变通办法适用于单行中的许多语句,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56916092/

10-12 18:33