问题描述
我正在尝试使用 此处 提供的 HTML 抓取工具.它适用于他们提供的示例.但是,当我尝试将它与我的 网页,我收到此错误 - 不支持带有编码声明的 Unicode 字符串.请在没有声明的情况下使用字节输入或 XML 片段.
我试过谷歌搜索,但找不到解决方案.我真的很感激任何帮助.我想知道是否有办法使用 Python 将其复制为 HTML.
from lxml import html进口请求page = requests.get('http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=PTEN&ln1=PTEN&start=130&end=140&coords=bp%3AAA&sn=&;ss=&hn=&sh=&id=15#')树 = html.fromstring(page.text)
谢谢.
简短回答:使用 page.content
,而不是 page.text
.
来自 http://lxml.de/parsing.html#python-unicode-strings:
lxml.etree 中的解析器可以直接处理 unicode 字符串……然而,这要求 unicode 字符串本身不指定冲突的编码,因此对它们的真实编码撒谎
来自 http://docs.python-requests.org/en/latest/user/quickstart/#response-content :
请求将自动解码来自服务器的内容 [作为 r.text
]....您还可以以字节 [as r.content
] 的形式访问响应正文.
所以你看,requests.text
和 lxml.etree
都想将 utf-8 解码为 unicode.但是如果我们让requests.text
来解码,那么xml文件里面的编码语句就变成了谎言.
所以,让 requests.content
不进行解码.这样 lxml
将收到一个始终未解码的文件.
I'm trying to use HTML scraper like the one provided here. It works fine for the example they provided. However, when I try using it with my webpage, I receive this error - Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I've tried googling but couldn't find a solution. I'd truly appreciate any help. I'd like to know if there's a way to copy it as HTML using Python.
Edit:
from lxml import html
import requests
page = requests.get('http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=PTEN&ln1=PTEN&start=130&end=140&coords=bp%3AAA&sn=&ss=&hn=&sh=&id=15#')
tree = html.fromstring(page.text)
Thank you.
Short answer: use page.content
, not page.text
.
From http://lxml.de/parsing.html#python-unicode-strings :
From http://docs.python-requests.org/en/latest/user/quickstart/#response-content :
So you see, both requests.text
and lxml.etree
want to decode the utf-8 to unicode. But if we let requests.text
do the decoding, then the encoding statement inside the xml file becomes a lie.
So, let's have requests.content
do no decoding. That way lxml
will receive a consistently undecoded file.
这篇关于使用 lxml 和请求抓取 HTML 会出现 unicode 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!