问题描述
我正在尝试解析日志文件,但是文件格式始终为unicode.我想自动化的通常过程:
I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
所以这是我想在Python 3.4中自动化的过程.几乎只是将文件更改为UTF-8
或类似open(filename,'r',encoding='utf-8')
的东西,尽管当我尝试在其上调用read()时,此行将我抛出此错误:
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8
or something like open(filename,'r',encoding='utf-8')
although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
如果我可以转换整个文件(如我的第一种情况)或只用UTF-8打开整个文件,那将非常有用,而我不必每次都进行str.encode(或类似的事情)我分析字符串的时间.
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
任何人都经历过这个,知道我应该使用哪种方法以及如何去做吗?
Anybody been through this and know which method I should use and how to do it?
在python3 repr中,我做到了
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
所以现在我程序中的python代码使用open('file.txt','r',encoding='cp1252')
打开文件.我正在运行很多正则表达式来浏览该文件,但它没有选择它(我认为是因为它不是utf-8).因此,我只需要弄清楚如何从cp1252切换到UTF-8.谢谢@Mark Ransom
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252')
. I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you @Mark Ransom
推荐答案
记事本认为Unicode
是Python的utf16
. Windows"Unicode"文件以FF FE
的字节顺序标记(BOM)开头,它表示小尾数UTF-16.这就是为什么使用utf8
解码文件时会得到以下内容的原因:
What notepad considers Unicode
is utf16
to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE
, which indicates little-endian UTF-16. This is why you get the following when using utf8
to decode the file:
要转换为UTF-8,可以使用:
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
请注意,许多Windows编辑器都喜欢文件开头的UTF-8签名,或者可以假设使用ANSI
. ANSI
实际上是本地语言环境.在美国Windows上,它是cp1252
,但对于其他本地化版本,它会有所不同.如果您打开utf8.txt
仍然看起来仍然是乱码,请在编写时使用encoding='utf-8-sig'
.
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI
instead. ANSI
is really the local language locale. On US Windows it is cp1252
, but it varies for other localized builds. If you open utf8.txt
and it still looks garbled, use encoding='utf-8-sig'
when writing instead.
这篇关于将Python 3 unicode转换为utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!