问题描述
我有一个与python(IPython notebook)中的编码问题有关的问题.由于这类问题非常普遍和简单,但我仍然无法真正解决.
I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.
我有一个 CSV 文件在这里,如您所见,该文件中有很多'\ xa0'和其他'\ n'字符.
I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.
我用过
with io.open(train_fname) as f:
for line in f:
line = line.encode("ascii", "replace")
但是它不起作用,我总是得到以下输出.
But it is not working, I always get the following output.
我尝试了其他方法,例如
I tried other methods like
line.replace(u"\ xa0",")
它也不起作用,我还尝试了各种编码来在我的文本编辑(崇高的文本)中打开此CSV文件.我尝试使用Windows-1252,utf-8和所有其他编码,但是在查看此CSV文件时,总是得到\ xa0是我的文本编辑.
line.replace(u"\xa0", " ")
It is not working either, I also tried all kinds of encoding to open this CSV file in my text edit, sublime text.I tried windows-1252, utf-8 and all other encodings, but I always get \xa0 is my text edit when viewing this CSV file.
这是否意味着
已经作为输入文本写入了此CSV文件吗?这不是python编码的问题吗?如果是这种情况,为什么我不能使用replace方法简单地替换此字符串?\ xa0表示文件正在使用哪种编码进行编码?这意味着该文件是用utf-8编写的,但是我试图以ascii或其他方式打开它?
is already written in this CSV file as input text? It is not a problem of python encoding? If it is this case, why cannot I use replace method to simply replace this string? The \xa0 indicates the file is encoding in which encode? This means this file is written in utf-8 but I tried to open it in ascii or other case?
我搜索了许多问题,但它们似乎并没有提供太多帮助.如果我的问题不是很清楚,请问我.非常感谢你!
I searched many questions but they don't seem provide much help. Please ask me if my question is not very clear.Thank you very much!
`
推荐答案
您看到的 \ xa0
是4个字符的序列: \
x
a
0
.所有这些字符都是纯ASCII,因此这里没有字符集问题.
The \xa0
that you see is a sequence of 4 characters: \
x
a
0
. All these characters are plain ASCII, so no character set problem here.
显然,您应该解释这些转义序列.您想用空格代替它们的想法很好,但是您必须注意反斜杠字符.当它以字符串文字形式出现时,必须写为 \\
.所以试试这个:
Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\
. So try this:
line.replace("\\xa0", " ")
或:
line.replace(r"\xa0", " ")
字符串前面的 r
意味着按字面意义解释每个字符,甚至包括反斜杠.
The r
in front of the string means to interpret each character literally, even a backslash.
请注意,CSV文件中的数据充满了不一致之处.例子:
Note that the data in the CSV file is full of inconsistencies. Examples:
-
\ n
可能意味着换行. -
\\ n
也会出现,这也可能意味着换行. -
\ xa0
是不间断的空格,以ISO-8859-1编码. -
\ xc2 \ xa0
是不间断的空间,以UTF-8编码. -
\\ xc2 \\ xa0
也会出现,含义相同. -
\\\\ n
也会出现.
\n
probably means a linebreak.\\n
also appears, and it probably means a linebreak also.\xa0
is a nonbreaking space, encoded in ISO-8859-1.\xc2\xa0
is a nonbreaking space, encoded in UTF-8.\\xc2\\xa0
also appears, with the same meaning.\\\\n
also appears.
因此,要从该文件中获取有意义的内容,应重复解释转义序列,直到没有任何变化为止.之后,尝试将生成的字节序列解释为UTF-8.如果行得通,那就好.如果不是,则将其解释为Codepage 1252(是ISO-8859-1的超集).
So to get meaningful content out of that file, you should repeatedly interpret the escape sequences until nothing changes. After that, try to interpret the resulting byte sequence as UTF-8. If it works, fine. If not, interpret it as Codepage 1252 (which is a superset of ISO-8859-1).
这篇关于Python:Got \ xa0而不是CSV中的空格,并且无法删除或转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!