问题描述
我最近编写了一个名为 zipzap 的 zip 文件 I/O 库,但我正在努力正确从任意 zip 文件中解码 zip 条目文件名.
I recently wrote a zip file I/O library called zipzap, but I'm struggling with correctly decoding zip entry file names from arbitrary zip files.
现在,PKWARE 规范 指出:
D.1 ZIP 格式历来仅支持原始 IBM PC 字符编码集,通常称为 IBM Code Page 437...
D.2 如果通用位 11 未设置,文件名和注释应符合到原始的 ZIP 字符编码.如果设置了通用位 11,则文件名和注释必须支持 The Unicode Standard, Version 4.1.0 或使用 UTF-8 存储定义的字符编码形式更大规格...
D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification...
这意味着一致的 zip 文件将文件名编码为 CP437,除非设置了 EFS 位,在这种情况下文件名是 UTF-8.
which means that conforming zip files encode file names as CP437, unless the EFS bit is set, in which case the file names are UTF-8.
不幸的是,似乎很多 zip 工具没有正确设置 EFS 位(例如 Mac CLI、GUI zip)或使用其他一些编码,通常是默认系统编码(例如 WinZip?).如果您知道 WinZip、7-Zip、Info-Zip、PKZIP、Java JAR/Zip、.NET zip、dotnetzip 等如何对文件名进行编码,以及它们在 zipping,请告诉我.
Unfortunately it seems that a lot of zip tools either don't set the EFS bit correctly (e.g. Mac CLI, GUI zip) or use some other encoding, typically the default system one (e.g. WinZip?). If you know how WinZip, 7-Zip, Info-Zip, PKZIP, Java JAR/Zip, .NET zip, dotnetzip, etc. encode file names and what they set their "version made by" field to when zipping, please tell me.
特别是,Info-Zip 在解压时会尝试这个:
In particular, Info-Zip tries this when unzipping:
- 文件系统 = MS-DOS (0) => CP437
- 除外:版本 = 2.5、2.6、4.0 => ISO 8859-1
如果我想支持检查或从任意 zip 文件中提取并合理尝试在没有 EFS 标志的情况下进行文件名编码,我可以寻找什么?
If I want to support inspecting or extracting from arbitrary zip files and make a reasonable attempt at the file name encoding without the EFS flag, what can I look for?
推荐答案
在不使用 EFS 标志的情况下确定文件名是否编码为 UTF-8 的唯一方法是检查高位位是否设置为一个的字符.这可能可能意味着该字符是 UTF-8 编码的.但是,它仍然可能是另一种方式,因为 CP437 中有一些字符设置了高位,并且不打算解码为 UTF-8.
The only way to determine if the filename is encoded as UTF-8 without using the EFS flag is to check to see if the high order bit is set in one of the characters. That could possibly mean that the character is UTF-8 encoded. However, it could still be the other way as there are some characters in CP437 that have the high order bit set and aren't meant to be decoded as UTF-8.
我会坚持 PKWARE 应用笔记规范,而不是在试图符合现有每个已知 zip 应用程序的解决方案中进行破解.
I would stick to the PKWARE app note specification and not hack in a solution that tries to conform to every known zip application in existence.
这篇关于正确解码 zip 条目文件名——CP437、UTF-8 还是?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!