本文介绍了未烘烤的莫吉贝克的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
当您错误地解码了字符时,如何识别原始字符串的可能候选者?
When you have incorrectly decoded characters, how can you identify likely candidates for the original string?
Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png
我知道这个图像文件名应该是一些日语字符。但是,对于urllib引用/取消引用,编码和解码iso8859-1,utf8的各种猜测,我一直无法取消和获取原始文件名。
I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.
腐败是可逆的吗?
推荐答案
您可以使用chardet(通过pip安装):
You could use chardet (install with pip):
import chardet
your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]
try:
correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
print("Could not estimate encoding")
结果:时间试験観点(アニメパス)_10秒(不知道这是否正确)
Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)
对于Python 3(源文件编码为utf8):
For Python 3 (source file encoded as utf8):
import chardet
import codecs
falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"
try:
encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
print("could not encode falsely decoded string")
encoded_str = None
if encoded_str:
detected_encoding = chardet.detect(encoded_str)["encoding"]
try:
correct_str = encoded_str.decode(detected_encoding)
except UnicodeEncodeError:
print("could not decode encoded_str as %s" % detected_encoding)
with codecs.open("output.txt", "w", "utf-8-sig") as out:
out.write(correct_str)
总结:
>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'
这篇关于未烘烤的莫吉贝克的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!