问题描述
我最近写了一个名为的zip文件I / O库,但我正在努力正确从任意zip文件解码zip条目文件名。
I recently wrote a zip file I/O library called zipzap, but I'm struggling with correctly decoding zip entry file names from arbitrary zip files.
现在,声明:
D.2如果未设置通用位11,则为文件名和注释应符合原始ZIP字符编码的
。如果设置了通用位11,则
文件名和注释必须使用UTF-8存储
规范定义的字符编码格式支持Unicode标准版本4.1.0或
。 ..
D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification...
这意味着符合的zip文件将文件名编码为CP437,除非设置了EFS位,在这种情况下文件名是UTF-8。
which means that conforming zip files encode file names as CP437, unless the EFS bit is set, in which case the file names are UTF-8.
不幸的是,许多zip工具似乎没有正确设置EFS位(例如Mac CLI,GUI zip)或使用其他一些编码,通常是默认的系统编码(例如WinZip?)。如果您知道WinZip,7-Zip,Info-Zip,PKZIP,Java JAR / Zip,.NET zip,dotnetzip等如何编码文件名以及他们将版本由字段设置为何时 zipping ,请告诉我。
Unfortunately it seems that a lot of zip tools either don't set the EFS bit correctly (e.g. Mac CLI, GUI zip) or use some other encoding, typically the default system one (e.g. WinZip?). If you know how WinZip, 7-Zip, Info-Zip, PKZIP, Java JAR/Zip, .NET zip, dotnetzip, etc. encode file names and what they set their "version made by" field to when zipping, please tell me.
特别是,当解压缩时,Info-Zip尝试这个:
In particular, Info-Zip tries this when unzipping:
- 文件系统= MS-DOS(0)=> CP437
- 除外:版本= 2.5,2.6 ,4.0 => ISO 8859-1
如果我想支持检查或从任意zip文件中提取并在没有EFS标志的情况下进行合理的尝试文件名编码,我还能做什么?寻找?
If I want to support inspecting or extracting from arbitrary zip files and make a reasonable attempt at the file name encoding without the EFS flag, what can I look for?
推荐答案
确定文件名是否在不使用EFS标志的情况下编码为UTF-8的唯一方法是检查查看是否在其中一个中设置了高位字符。那可能可能意味着该字符是UTF-8编码的。但是,它仍然可能是另一种方式,因为CP437中的某些字符具有高位设置并且不打算解码为UTF-8。
The only way to determine if the filename is encoded as UTF-8 without using the EFS flag is to check to see if the high order bit is set in one of the characters. That could possibly mean that the character is UTF-8 encoded. However, it could still be the other way as there are some characters in CP437 that have the high order bit set and aren't meant to be decoded as UTF-8.
我会坚持使用PKWARE应用笔记规范而不是破解试图符合现有的每个已知zip应用程序的解决方案。
I would stick to the PKWARE app note specification and not hack in a solution that tries to conform to every known zip application in existence.
这篇关于正确解码zip条目文件名 - CP437,UTF-8或?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!