本文介绍了使用lxml解析文件后无法正确显示unicode字符串,可以在读取简单文件时正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lxml模块来解析HTML文件,但正在努力使其与某些UTF-8编码的数据一起使用.我在Windows上使用Python 2.7.例如,考虑一个没有字节顺序标记的UTF-8编码文件,该文件只包含文本字符串Québec.如果我只是使用常规文件处理程序读取文件的内容并解码生成的字符串对象,则得到的长度为6的unicode字符串在写回文件时看起来不错.但是,如果我使用lxml解析文件,则会看到长度为7的unicode字符串,回写到文件时看起来很奇怪.有人可以解释一下lxml发生了什么变化以及如何获取原始的漂亮字符串吗?

I'm attempting to use the lxml module to parse HTML files, but am struggling to get it to work with some UTF-8 encoded data. I'm using Python 2.7 on Windows. For example, consider a UTF-8 encoded file without byte order mark that contains nothing but the text string Québec. If I just read the contents of the file using a regular file handler and decode the resulting string object, I get a length 6 unicode string that looks good when written back to a file. But if I parse the file with lxml, I see get a length 7 unicode string that looks odd when written back to a file. Can someone explain what is happening differently with lxml and how to get the original, pretty string?

例如:

import lxml.html as html
from lxml import etree

f = open("output.txt", "w")

text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

output.txt中产生输出:

String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec

编辑

这里的部分解决方法似乎是使用以下方法解析文件:

A partial workaround here seems to be to parse the file using:

etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

但是,据我所知,基本etree库缺少诸如选择器之类的便捷类,因此允许我在不使用etree.HTMLParser()的情况下使用lxml.html的解决方案仍然有用.

However, as far as I know the base etree library lacks some convenience classes for things like selectors, so a solution that allows me to use lxml.html without etree.HTMLParser() would still be useful.

推荐答案

函数lxml.html.parse 已经已使用lxml.html.HTMLParser的实例,因此您不应该真正反对使用

The function lxml.html.parse already uses an instance of lxml.html.HTMLParser, so you shouldn't really be averse to using

html.parse("input.txt", html.HTMLParser(encoding="utf-8"))

处理utf-8数据

这篇关于使用lxml解析文件后无法正确显示unicode字符串,可以在读取简单文件时正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 21:08