本文介绍了HTML 编码和 lxml 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图最终解决一些编码问题,这些问题是在尝试使用 lxml 抓取 HTML 时出现的.以下是我遇到的三个示例 HTML 文档:

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:

1.

<!DOCTYPE html>
<html lang='en'>
<head>
   <title>Unicode Chars: 은 —’</title>
   <meta charset='utf-8'>
</head>
<body></body>
</html>

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
    <title>Unicode Chars: 은 —’</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>

我的基本脚本:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

结果是:

Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’

因此,显然示例 1 存在问题,并且缺少 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 标记.此处的解决方案将正确识别示例 1 为 utf-8 和所以它在功能上等同于我的原始代码.

So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.

lxml 文档出现冲突:

The lxml docs appear conflicted:

这里,这个例子似乎表明我们应该使用 UnicodeDammit 将标记编码为 un​​icode.

From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

但是这里它说:

[Y]当您尝试 [解析] Unicode 字符串中的 HTML 数据时会出错,该字符串在标头的元标记中指定字符集.在将 XML/HTML 数据传递到解析器之前,您通常应该避免将其转换为 unicode.它既慢又容易出错.

如果我尝试遵循 lxml 文档中的第一个建议,我的代码现在是:

If I try to follow the the first suggestion in the lxml docs, my code is now:

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

我现在得到以下结果:

Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.

示例 1 现在可以正常工作,但示例 3 由于 <?xml version="1.0" encoding="utf-8"?> 标记而导致错误.

Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.

是否有正确的方法来处理所有这些情况?还有比以下更好的解决方案吗?

Is there a correct way to handle all of these cases? Is there a better solution than the following?

dammit = UnicodeDammit(raw_html)
try:
    doc = fromstring(dammit.unicode_markup)
except ValueError:
    doc = fromstring(raw_html)

推荐答案

lxml几个 与处理 Unicode 相关的问题.在明确指定字符编码时最好使用字节(目前):

lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
    with open(filename, 'rb') as file:
        content = file.read()
        doc = UnicodeDammit(content, is_html=True)

    parser = html.HTMLParser(encoding=doc.original_encoding)
    root = html.document_fromstring(content, parser=parser)
    title = root.find('.//title').text_content()
    print(title)

输出

Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

这篇关于HTML 编码和 lxml 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 07:24