python - 在从文件读取的列表中拆分\xef\xbb\xbf

This question already has answers here:

Split function add: \xef\xbb\xbf…\n to my list

(3 个回答)

5年前关闭。

我尝试读取大数据 file.txt 并拆分所有逗号、点等，因此我在 Python 中使用以下代码读取文件:

file= open("file.txt","r")
importantWords =[]
for i in file.readlines():
    line = i[:-1].split(" ")
    for word in line:
        for j in word:
            word = re.sub('[\!@#$%^&*-/,.;:]','',word)
            word.lower()
        if word not in stopwords.words('spanish'):
            importantWords.append(word)
print importantWords

它打印了 ['\xef\xbb\xbfdataText1', 'dataText2' .. 'dataTextn'] 。

我怎样才能清理那个 \xef\xbb\xbf ？我正在使用 Python 2.7。

最佳答案

是 UTF-8 encoded BOM 。

>>> import codecs
>>> codecs.BOM_UTF8
'\xef\xbb\xbf'

您可以将 codecs.open 与 encoding='utf-8-sig' 一起使用来跳过 BOM 序列:

with codecs.open("file.txt", "r", encoding="utf-8-sig") as f:
    for line in f:
        ...

旁注:不要使用 file.readlines ，只需遍历文件即可。如果您想要的只是迭代文件，file.readlines 将创建不必要的临时列表。

关于python - 在从文件读取的列表中拆分\xef\xbb\xbf，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34304945/