Python:Got \ xa0而不是CSV中的空格，并且无法删除或转换

本文介绍了Python:Got \ xa0而不是CSV中的空格，并且无法删除或转换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个与python(IPython notebook)中的编码问题有关的问题.由于这类问题非常普遍和简单，但我仍然无法真正解决.

I have a problem related to the encoding problem in python (IPython notebook ). As these kind of problems is very common and simple, but I still cannot really fix it.

我有一个 CSV 文件在这里，如您所见，该文件中有很多'\ xa0'和其他'\ n'字符.

I have a CSV file here, as you can see we got many '\xa0' and other '\n' characters in this file.

我用过

with io.open(train_fname) as f:
for line in f:
    line = line.encode("ascii", "replace")

但是它不起作用，我总是得到以下输出.

But it is not working, I always get the following output.

我尝试了其他方法，例如

I tried other methods like

line.replace(u"\ xa0"，")它也不起作用，我还尝试了各种编码来在我的文本编辑(崇高的文本)中打开此CSV文件.我尝试使用Windows-1252，utf-8和所有其他编码，但是在查看此CSV文件时，总是得到\ xa0是我的文本编辑.

line.replace(u"\xa0", " ")It is not working either, I also tried all kinds of encoding to open this CSV file in my text edit, sublime text.I tried windows-1252, utf-8 and all other encodings, but I always get \xa0 is my text edit when viewing this CSV file.

这是否意味着

已经作为输入文本写入了此CSV文件吗?这不是python编码的问题吗?如果是这种情况，为什么我不能使用replace方法简单地替换此字符串?\ xa0表示文件正在使用哪种编码进行编码?这意味着该文件是用utf-8编写的，但是我试图以ascii或其他方式打开它?

is already written in this CSV file as input text? It is not a problem of python encoding? If it is this case, why cannot I use replace method to simply replace this string? The \xa0 indicates the file is encoding in which encode? This means this file is written in utf-8 but I tried to open it in ascii or other case?

我搜索了许多问题，但它们似乎并没有提供太多帮助.如果我的问题不是很清楚，请问我.非常感谢你！

I searched many questions but they don't seem provide much help. Please ask me if my question is not very clear.Thank you very much!

推荐答案

您看到的 \ xa0 是4个字符的序列: \ x a 0 .所有这些字符都是纯ASCII，因此这里没有字符集问题.

The \xa0 that you see is a sequence of 4 characters: \ x a 0. All these characters are plain ASCII, so no character set problem here.

显然，您应该解释这些转义序列.您想用空格代替它们的想法很好，但是您必须注意反斜杠字符.当它以字符串文字形式出现时，必须写为 \\ .所以试试这个:

Apparently, you are supposed to interpret these escape sequences. Your idea of replacing them with a space is good, but you have to be careful about the backslash character. When it appears in a string literal, it has to be written \\. So try this:

line.replace("\\xa0", " ")

或:

line.replace(r"\xa0", " ")

字符串前面的 r 意味着按字面意义解释每个字符，甚至包括反斜杠.

The r in front of the string means to interpret each character literally, even a backslash.

请注意，CSV文件中的数据充满了不一致之处.例子:

Note that the data in the CSV file is full of inconsistencies. Examples:

\ n 可能意味着换行.
\\ n 也会出现，这也可能意味着换行.
\ xa0 是不间断的空格，以ISO-8859-1编码.
\ xc2 \ xa0 是不间断的空间，以UTF-8编码.
\\ xc2 \\ xa0 也会出现，含义相同.
\\\\ n 也会出现.

\n probably means a linebreak.
\\n also appears, and it probably means a linebreak also.
\xa0 is a nonbreaking space, encoded in ISO-8859-1.
\xc2\xa0 is a nonbreaking space, encoded in UTF-8.
\\xc2\\xa0 also appears, with the same meaning.
\\\\n also appears.

因此，要从该文件中获取有意义的内容，应重复解释转义序列，直到没有任何变化为止.之后，尝试将生成的字节序列解释为UTF-8.如果行得通，那就好.如果不是，则将其解释为Codepage 1252(是ISO-8859-1的超集).

So to get meaningful content out of that file, you should repeatedly interpret the escape sequences until nothing changes. After that, try to interpret the resulting byte sequence as UTF-8. If it works, fine. If not, interpret it as Codepage 1252 (which is a superset of ISO-8859-1).

这篇关于Python:Got \ xa0而不是CSV中的空格，并且无法删除或转换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！