问题描述
我正在通过 Python 脚本下载和解析网页.我需要它编码为 7 位 ASCII 以供进一步处理.我正在使用请求库 (http://docs.python-requests.org/en/master/) 在一个virtualenv 基于 Ubuntu 16.04 LTS 的任何内容.
I'm downloading and parsing a web page via a Python script. I need itto be encoded into 7-bit ASCII for further processing. I am using therequests library (http://docs.python-requests.org/en/master/) in avirtualenv based upon whatever Ubuntu 16.04 LTS has.
我想要请求包或某个包来处理翻译成 ASCII,不需要我做进一步的翻译编码字符,因为我知道我会错过一些人物.详情如下:
I would like the requests package, or some package, to handle thetranslation into ASCII, without requiring me to do further translationof encoded characters, because I know I am going to miss somecharacters. Details are as follows:
我当前的 Python 脚本(如下所示)使用 ISO-8859-1 编码试图强制将结果数据转换为 7 位 ASCII,取得了部分成功.但是,我已经设置了结果编码 和文本出现时也要对其进行编码.这看起来很奇怪,事实上,完全错误.但即使我接受了,我也有主要问题如下:
My current Python script, shown below, uses an encoding of ISO-8859-1in an attempt to force the result data to be converted to 7-bit ASCII,with some partial success. But, I have set the result encoding andalso encode the text when it comes out. That seems odd, and in fact,downright wrong. But even if I live with that, I have the main issuewhich is as follows:
即使在编码之后,我也看到以似乎在一些非 ASCII 字符集.就好像破折号字符滑落了通过请求编码.下面的脚本解决了这个问题使用 ASCII 搜索和替换多字节破折号编码破折号字符.如果是一个多字节,这没什么大不了的字符,但怀疑还有其他字符需要在我希望处理的其他网页中翻译.难道我只是需要使用除ISO-8859-1"以外的其他编码请求对象?
Even after the encoding, I see dashes encoded in what seems to be insome non-ASCII character set. It is as if the dash characters slippedthrough the requests encoding. The script below hacks around this bysearching for and replacing the multi-byte dash encoding with an ASCIIdash character. This is not a big deal if it is one multi-bytecharacter, but suspect that there are other characters that will needto be translated in other web pages I wish to process. Do I simplyneed to use some other encoding other than 'ISO-8859-1' with therequests object?
这是我的脚本(在 x86_64 上的 Ubuntu 16.04 LTS 上使用 Python 2.7.11):
Here is my script (using Python 2.7.11 on Ubuntu 16.04 LTS on x86_64):
#!/bin/bash
import sys
import os
import string
import re
import requests
url = "https://system76.com/laptops/kudu"
r = requests.get(url)
#
# Why do I have to BOTH set r.encoding AND call r.text.encode
# in order to avoid the errors?:
#
encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)
#
# Split the lines out, find the offending line,
# and translate the multi-byte characters:
#
lines = data.splitlines()
for line in lines:
m = re.search(r'2.6 up to 3.5 GHz', line)
if m:
print "line: {}".format(line)
m = re.search(r'xe2x80x93', line)
# The '-' in the next line is a ASCII dash character:
fixed_line = re.sub(r'xe2x80x93', '-', line)
print "fixed_line {}".format(line)
在 virtualenv 中调用 simple_wget.py 显示:
Invoking simple_wget.py within the virtualenv shows:
theuser@thesystem:~$ simple_wget.py
line: <td>2.6 up to 3.5 GHz – 6 MB cache – 4 cores – 8 threads</td>
fixed_line <td>2.6 up to 3.5 GHz - 6 MB cache - 4 cores - 8 threads</td>
通过 oc -cb
传递该输出以查看八进制值 ("342 200223") 中的 r'xe2x80x93'
对应的破折号字符上面的脚本:
Passing that output through oc -cb
to see the octal values ("342 200223") of the dash characters corresponding to the r'xe2x80x93'
inthe script above:
theuser@thesystem:~$ simple_wget.py | od -cb
0000000 l i n e :
154 151 156 145 072 040 040 040 040 040 040 011 011 011 011 011
0000020 < t d > 2 . 6 u p t o 3
011 074 164 144 076 062 056 066 040 165 160 040 164 157 040 063
0000040 . 5 G H z 342 200 223 6 M B
056 065 040 107 110 172 040 342 200 223 040 066 040 115 102 040
0000060 c a c h e 342 200 223 4 c o r e
143 141 143 150 145 040 342 200 223 040 064 040 143 157 162 145
0000100 s 342 200 223 8 t h r e a d s <
163 040 342 200 223 040 070 040 164 150 162 145 141 144 163 074
0000120 / t d >
f i x e d _ l i n e
057 164 144 076 012 146 151 170 145 144 137 154 151 156 145 040
0000140 < t d > 2 . 6 u p
011 011 011 011 011 011 074 164 144 076 062 056 066 040 165 160
0000160 t o 3 . 5 G H z - 6
040 164 157 040 063 056 065 040 107 110 172 040 055 040 066 040
0000200 M B c a c h e - 4 c o r
115 102 040 143 141 143 150 145 040 055 040 064 040 143 157 162
0000220 e s - 8 t h r e a d s < /
145 163 040 055 040 070 040 164 150 162 145 141 144 163 074 057
0000240 t d >
164 144 076 012
0000244
theuser@thesystem:~$
我尝试过的事情:
https://stackoverflow.com/a/19645137/257924 暗示使用编码ascii
,但它在请求库中窒息.改变脚本为:
https://stackoverflow.com/a/19645137/257924 implies using an encodingof ascii
, but it chokes inside the requests library. Changing thescript to be:
#encoding = 'ISO-8859-1'
encoding = 'ascii' # try https://stackoverflow.com/a/19645137/257924
r.encoding = encoding
data = r.text.encode(encoding)
产量:
theuser@thesystem:~$ ./simple_wget
Traceback (most recent call last):
File "./simple_wget.py", line 18, in <module>
data = r.text.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10166-10168: ordinal not in range(128)
将上面最后一行改为
data = r.text.encode(encoding, "ignore")
导致破折号被删除,而不是翻译,这不是我想要的.
results in the dashes just being removed, not translated which is not what I want.
这也根本不起作用:
encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)
charmap = {
0x2014: u'-', # em dash
0x201D: u'"', # comma quotation mark, double
# etc.
}
data = data.translate(charmap)
因为它给出了这个错误:
because it gives this error:
Traceback (most recent call last):
File "./simple_wget.py", line 30, in <module>
data = tmp2.translate(charmap)
TypeError: expected a string or other character buffer object
据我所知https://stackoverflow.com/a/10385520/257924,由于数据"不是Unicode 字符串.256个字符的翻译表不行无论如何我需要什么.除此之外,这是矫枉过正:里面的东西Python 应该翻译这些多字节字符而不需要在我的脚本级别破解代码.
which is, as far as I can understand fromhttps://stackoverflow.com/a/10385520/257924, due to "data" not being aunicode string. A 256-character translation table is not going to dowhat I need anyhow. And besides that is overkill: something insidePython should translate these multi-byte characters without requiringhack code at my script level.
顺便说一下,我对多语言页面翻译不感兴趣.全部翻译的页面应为美式或英式英语.
By the way, I'm not interested in multi-lingual page translation. Allpages translated are expected to be in US or British English.
推荐答案
Python 拥有干净处理非 ASCII 字符所需的一切……只要您声明正确的编码.您的输入文件是 UTF8 编码,而不是 ISO-8859-1,因为 r'xe2x80x93'
是 EN DASH 字符或 unicode U+2013
的 UTF8 编码代码>.
Python has everything you need to cleanly process non ASCII characters... provided you declare the proper encoding. Your input file is UTF8 encoded, not ISO-8859-1, because r'xe2x80x93'
is the UTF8 encoding for the EN DASH character or unicode U+2013
.
所以你应该:
从请求中加载文本作为真正的 unicode 字符串:
load the text from request as a true unicode string:
url = "https://system76.com/laptops/kudu"
r = requests.get(url)
r.encoding = "UTF-8"
data = r.text # ok, data is a true unicode string
翻译违规字符unicode:
charmap = {
0x2014: u'-', # em dash
0x201D: u'"', # comma quotation mark, double
# etc.
}
data = data.translate(charmap)
它现在可以工作了,因为 translate
映射对于字节和 unicode 字符串是不同的.对于字节字符串,转换表必须是长度为 256 的字符串,而对于 unicode 字符串,它必须是 Unicode 序数到 Unicode 序数、Unicode 字符串或 None 的映射(ref: Python 标准库参考手册).
It will work now, because the translate
map is different for byte and unicode strings. For byte strings, the translation table must be a string of length 256, whereas for unicode strings it must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None (ref: Python Standard Library Reference Manual).
然后您就可以安全地将数据编码为 ascii 字节字符串:
then you can safely encode data to an ascii byte string:
tdata = data.encode('ascii')
如果在 data
unicode 字符串中保留一些未翻译的非 ascii 字符,上述命令将抛出异常.您可以将其视为有助于确保所有内容都已成功转换.
The above command will throw exception if some untranslated non ascii characters remains in the data
unicode string. You can see that as a help to be sure that everything as been successfully converted.
这篇关于在 Python 中将多字节字符转换为 7 位 ASCII的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!