问题描述
我有一个文件,其文本为UTF-8葡萄牙语.产生文件的人以某种方式选择了错误的编码,并且文本中充满了 mojibake :
I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake:
IDENTIFICAÌàÌÄO instead of identificação
André instead of André
自动工具看不到文件中的任何错误.我尝试使用 Python软件包ftfy 进行修复,但无济于事.除了手动替换所有不正确的字符外,如何修复此文件?
Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail.How can I fix this file, apart from replacing all incorrect characters manually?
推荐答案
André"而不是André"是UTF-8编码的Latin-1解释.您可以通过反转编码/解码来解决它:
"André" instead of "André" is the Latin-1 interpretation of UTF-8 encoding.You can fix it by inverting the encoding/decoding:
>>> 'André'.encode('latin-1').decode('utf-8')
'André'
遵循这种模式的所有情况都可以像这样解决.
All cases following this pattern can be fixed like that.
但是,我无法解释其他情况(对于ç"使用Ìà",对于ã"使用ÌÄ"),因此无法提供解决方案.如果找到Ì",à"和Ä"分别具有代码点C3,A7和A3的编解码器,则可以使用此编解码器代替Latin-1来固定文本.
However, I can't explain the other case (with "Ìà" for "ç" and "ÌÄ" for "ã"), and therefore can't provide a solution.If you can find a codec where "Ì", "à", and "Ä" have the codepoints C3, A7, and A3, respectively, then you can use this instead of Latin-1 for fixing the text.
这篇关于在UTF-8文本中修复Mojibakes的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!