编码字符串的解码python | 编码字符串的解码python

本文介绍了编码字符串的解码python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个html页面列表，可能包含某些编码字符。一些示例如下 -

 < a href =mailto：lad％20at％20maestro％20dot％20com> 
< em> ada&＃x40; graphics.maestro.com< / em> 
< em> mel&＃x40; graphics.maestro.com< / em>

我想解码（转义，我不确定当前的术语）

 < a href =mailto：lad at maestro dot com> 
< em> [email protected]< / em> 
< em> [email protected]< / em>

注意，HTML页面采用字符串格式。此外，我不想使用任何外部库像BeautifulSoup或lxml，只有本机python库是确定。

编辑 p>

下面的解决方案并不完美。 HTML解析器使用urllib2解析转义

  UnicodeDecodeError：'ascii'编解码器无法解码位置31中的字节0x94：范围（128）

错误。

解决方案

您需要取消转义HTML实体和URL，而不引用。

标准库具有和帮助执行这些任务。

  import HTMLParser，urllib2 
 
 markup ='''< a href =mailto：lad％20at％20maestro％ 20dot％20com> 
< em> ada&＃x40; graphics.maestro.com< / em> 
< em> mel&＃x40; graphics.maestro.com< / em>'''
 
 result = HTMLParser.HTMLParser（）。unescape（urllib2.unquote（markup）） 
在result.split（\\\
）中的行：
 print（line）

结果：

 < a href =mailto：lad at maestro dot com> 
< em> [email protected]< / em> 
< em> [email protected]< / em>

编辑：

您的网页可以包含非ASCII字符，您需要小心解码输入并对输出进行编码。

上传的示例文件的字符集设置为 cp-1252 ，所以让我们尝试解码为Unicode：

  import codecs 
 with codecs。 open（filename，encoding =cp1252）as fin：
 decoded = fin.read（）
 result = HTMLParser.HTMLParser（）。unescape（urllib2.unquote（decoded））
 codecs.open（'/ output / file.html'，'w'，encoding ='cp1252'）as fou：
 fou.write（result）

Edit2：

如果您不关心非ASCII字符，有一个位：

  with open（filename）as fin：
 decoded = fin.read ascii'，'ignore'）
 ...

I have a list of html pages which may contain certain encoded characters. Some examples are as below -

<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm unsure of the current terminology) these strings to -

 <a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.

Edit -

The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a

UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

error in some cases.

解决方案

You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.

import HTMLParser, urllib2

markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>'''

result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"):
    print(line)

Result:

<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:

import codecs
with codecs.open(filename, encoding="cp1252") as fin:
    decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
    fou.write(result)

Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:

with open(filename) as fin:
    decoded = fin.read().decode('ascii','ignore')
...

这篇关于编码字符串的解码python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！