python - 在Python中提取和清理HTML正文文本的最快，最无错误的方法是什么？

我目前有两个函数从python中提取html<body>文本并将其作为一个单词包返回。它们提供同等的输出。我还清理了各种标签，否则会给我垃圾文本（例如<script>代码）。

def html_to_bow_bs(text):
    if text is None or len(text)==0:
        return []

    soup = BeautifulSoup(text, "lxml",parse_only=SoupStrainer('body'))

    # Remove all irrelevant tags
    for elem in soup.findAll(['script','style','a']):
        elem.extract()
    body_text = soup.findAll("body")
    if len(body_text) == 0:
        return []

    # Encoding.  Remove extra whitespace and unprintable characters
    the_text = body_text[0].get_text().encode('utf-8')
    the_text = str(the_text)
    the_text = the_text.strip()
    the_text = re.sub(r'[^\x00-\x7F]+',' ',the_text)
    return [w.lower() for w in the_text.split()]




def html_to_bow_bs_lxml(text):
    if text is None or len(text)==0:
        return []
    body_re = re.findall('<body(.*?)</body>', text, flags=re.DOTALL)
    if len(body_re) == 0:
        return []
    fragment = body_re[0]

    # Remove irrelevant tags
    fragment = re.sub(r'<script.*?</script>', ' ', fragment, flags=re.DOTALL)
    fragment = re.sub(r'<style.*?</style>', ' ', fragment, flags=re.DOTALL)
    text = "<body" + fragment + "</body>"
    soup = BeautifulSoup(text, "lxml")

    if soup is None:
        return []

    # Remote more irrelevant tags
    for elem in soup.findAll(['a']):
        elem.extract()

    # Encoding.  Remove extra whitespace and unprintable characters
    the_text = body_text[0].get_text().encode('utf-8')
    the_text = str(the_text)
    the_text = the_text.strip()
    the_text = re.sub(r'[^\x00-\x7F]+',' ',the_text)
    return [w.lower() for w in the_text.split()]

我的主要要求是匹配输出：来自html_to_bow_bs_lxml(text)的一组单词匹配html_to_bow_bs(text)。目前，两者的运行时间相当；330页的运行时间大约为20秒（慢！）。如果我删除第二个函数中的最后一个soup.findAll(['a'])...extract()并用regex替换，我可以节省6秒的时间。将BeautifulSoup替换为lxml.etree可以再节省10秒，使总运行时间大约为3-4秒。但是，当用regex替换时，
输出并不总是匹配的。当替换BeautifulSoup时，输出不匹配或
由于html格式不正确，我的程序在处理过程中崩溃。如何在保持正确性的同时提高速度？
我已经看到了用python在stackoverflow上提取html的各种建议，但这些建议可以追溯到几年前（例如2012年）。可以理解，自那时以来，对这些库进行了许多更新。
（我也尝试过pyquery，但它并不总是正确提取主体。）

最佳答案

您已经做了很多工作来加快速度-在使用lxml优化解析时，汤过滤器和BeautifulSoup解析器通常是首先要尝试的。
下面是对这个特定代码的一些改进。
删除身体存在检查：

body_text = soup.findAll("body")
if len(body_text) == 0:
    return []

用find()代替。
用if text is None or len(text)==0:替换if not text:。
通过get_text(strip=True)剥离。
改进后的代码：

def html_to_bow_bs(text):
    if not text:
        return []

    soup = BeautifulSoup(text, "lxml", parse_only=SoupStrainer('body'))

    # Remove all irrelevant tags
    for elem in soup.find_all(['script','style','a']):
        elem.extract()

    body = soup.find("body")
    if not body:
        return []

    the_text = body.get_text(strip=True).encode('utf-8')
    the_text = re.sub(r'[^\x00-\x7F]+', ' ', the_text)
    return [w.lower() for w in the_text.split()]

这些只是微小的改进，我认为它们不会改变整体性能。我还将调查：
通过pypy运行脚本（beautifulsoup4是compatible，但您将无法使用lxml解析器-请使用html.parser或html5lib进行尝试）。你甚至不需要修改代码就可以赢得很多。