问题描述
我有一个html页面列表,可能包含某些编码字符。一些示例如下 -
< a href =mailto:lad%20at%20maestro%20dot%20com>
< em> ada&#x40; graphics.maestro.com< / em>
< em> mel&#x40; graphics.maestro.com< / em>
我想解码(转义,我不确定当前的术语)
< a href =mailto:lad at maestro dot com>
< em> [email protected]< / em>
< em> [email protected]< / em>
注意,HTML页面采用字符串格式。此外,我不想使用任何外部库像BeautifulSoup或lxml,只有本机python库是确定。
编辑 p>
下面的解决方案并不完美。 HTML解析器使用urllib2解析转义
UnicodeDecodeError:'ascii'编解码器无法解码位置31中的字节0x94:范围(128)
错误。
您需要取消转义HTML实体和URL,而不引用。
标准库具有和帮助执行这些任务。
import HTMLParser,urllib2
markup ='''< a href =mailto:lad%20at%20maestro% 20dot%20com>
< em> ada&#x40; graphics.maestro.com< / em>
< em> mel&#x40; graphics.maestro.com< / em>'''
result = HTMLParser.HTMLParser()。unescape(urllib2.unquote(markup))
在result.split(\\\
)中的行:
print(line)
结果:
< a href =mailto:lad at maestro dot com>
< em> [email protected]< / em>
< em> [email protected]< / em>
编辑:
您的网页可以包含非ASCII字符,您需要小心解码输入并对输出进行编码。
上传的示例文件的字符集设置为 cp-1252
,所以让我们尝试解码为Unicode:
import codecs
with codecs。 open(filename,encoding =cp1252)as fin:
decoded = fin.read()
result = HTMLParser.HTMLParser()。unescape(urllib2.unquote(decoded))
codecs.open('/ output / file.html','w',encoding ='cp1252')as fou:
fou.write(result)
Edit2:
如果您不关心非ASCII字符,有一个位:
with open(filename)as fin:
decoded = fin.read ascii','ignore')
...
I have a list of html pages which may contain certain encoded characters. Some examples are as below -
<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>
I would like to decode (escape, I'm unsure of the current terminology) these strings to -
<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>
Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.
Edit -
The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)
error in some cases.
You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser
and urllib2
to help with those tasks.
import HTMLParser, urllib2
markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>'''
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"):
print(line)
Result:
<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>
Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252
, so let's try decoding from that to Unicode:
import codecs
with codecs.open(filename, encoding="cp1252") as fin:
decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
fou.write(result)
Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:
with open(filename) as fin:
decoded = fin.read().decode('ascii','ignore')
...
这篇关于编码字符串的解码python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!