问题描述
这是我期望在执行时打印 found
的 Perl 脚本:
Here's a Perl script that I have expected to print found
when executed:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use Encode;
use constant filename => 'Bärlauch';
open (my $out, '>', filename) or die;
close $out;
opendir(my $dir, '.') or die;
while (my $filename_read = readdir($dir)) {
# $filename_read = encode('utf8', $filename_read);
print "found
" if $filename_read eq filename;
}
脚本首先创建一个名为filename
的文件.(运行脚本后,我可以用 ls
验证文件是否存在,并且该文件不是用funny"字符创建的.)
The script first creates a file with the name of the constant filename
. (After running the script, I can verify the existence of the file with ls
and the file is not created with "funny" characters.)
然后脚本遍历当前工作目录中的文件,如果存在名称与刚刚创建的文件相同的文件,则打印found
.显然应该是这样.
Then the script iterates over the files in the the current working directory and prints found
if there is a file whose name is equal to the file just created. This should obviously be the case.
但是,它没有(Ubuntu、bash、LANG=en_US.UTF8
)
However, it doesn't (Ubuntu, bash, LANG=en_US.UTF8
)
如果我将常量更改为 Barlouch
,它会按预期工作并打印 found
.
If I change the constant to Barlauch
, it works as expected and prints found
.
取消注释 $filename_read = encode('utf8', $filename_read);
不会改变行为.
Uncommenting $filename_read = encode('utf8', $filename_read);
does not change the behavior.
对此是否有解释?我该怎么做才能识别包含 Umlaute 的文件名?
Is there an explanation for this and what do I do have to do in order to recognize a filename with Umlaute in it?
推荐答案
重新表述的问题(按照我的理解)是:
The question rephrased (as I interpret it) is:
为什么 readdir
不返回新创建的文件名?(这里,由设置为 Bärlouch
的变量 filename
表示).
(注意:filename
是一个 Perl 常量变量,这就是为什么它前面缺少 $
符号的原因.)
(Note: filename
is a Perl constant variable, so that's why it's missing the $
sigil in front.)
背景:
首先注意:由于程序开头的use utf8
语句,filename
在编译时会升级为Unicode字符串,因为它包含非ASCII 字符.来自 utf8 pragma 的文档:
First note: due to the use utf8
statement in the beginning of your program, filename
will be upgraded to a Unicode string at compile time, since it contain non-ASCII characters. From the documentation of the utf8 pragma:
启用 utf8 pragma 有以下效果: 源中的字节数不在 ASCII 字符集中的文本将被视为文字 UTF-8 序列的一部分.这包括大多数文字,例如标识符名称、字符串常量和常量正则表达式模式.
而且,根据 perluniintro 部分Perl 的 Unicode 模型" :
and also, according to perluniintro section "Perl's Unicode Model" :
一般原则是 Perl 尽量将其数据保持为 8 位字节尽可能长,但尽快 Unicodeness 不能避免,数据透明升级到Unicode.
...
在内部,Perl 目前使用任何原生的 8 位平台的字符集(例如Latin-1)是,默认为UTF-8,用于编码 Unicode 字符串.
Internally, Perl currently uses whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings.
filename
中的非 ASCII 字符是字母 ä
.如果您使用 ISO 8859-1 扩展 ASCII 编码 (Latin-1),则将其编码为字节值 0xE4
,请参阅此 table 在 ascii-code.com
.但是,如果您从 filename
中删除 ä
字符,它将只包含 ASCII 字符,因此它不会在内部升级为 Unicode,即使您使用了 utf8
编译指示.
The non-ASCII character in filename
is the letter ä
. If you use ISO 8859-1 extended ASCII encoding (Latin-1), it is encoded as the byte value 0xE4
, see this table at ascii-code.com
.However, if you removed the ä
character from filename
, it would contain only ASCII characters, and therefore it would not be internally upgraded to Unicode, even if you used the utf8
pragma.
所以 filename
现在是带有内部 UTF-8
标志集的 Unicode 字符串(参见 utf8 pragma 有关 UTF-8
标志的更多信息).请注意,字母 ä
在 UTF-8 中编码为两个字节 0xC3 0xA4
.
So filename
is now a Unicode string with the internal UTF-8
flag set ( see utf8 pragma for more information on the UTF-8
flag). Note that the letter ä
is encoded in UTF-8 as the two bytes 0xC3 0xA4
.
写入文件:
写入文件时,文件名会发生什么?如果 filename
是一个 Unicode 字符串,它将被编码为 UTF-8.但是,请注意不必先对 filename
进行编码(encode_utf8( filename )
).有关详细信息,请参阅使用 unicode 字符创建文件名.因此文件名以 UTF-8 编码字节的形式写入磁盘.
When writing the file, what happens with the filename? If filename
is a Unicode string, it will be encoded as UTF-8. However, note that it is not necessary to encode filename
first (encode_utf8( filename )
). See Creating filenames with unicode characters for more information. So the filename is written to disk as UTF-8 encoded bytes.
读回文件名:
尝试从磁盘读取文件名时,readdir
不会返回 Unicode 字符串(设置了 UTF-8 标志的字符串),即使文件名包含以 UTF-8 编码的字节.它返回二进制或字节字符串,有关字节字符串与字符 (Unicode) 字符串的讨论,请参阅 perlunitut.
When trying to read the filename back from disk, readdir
does not return Unicode strings (strings with the UTF-8 flag set) even if the filename contains bytes encoded in UTF-8. It returns binary or byte strings, see perlunitut for a discussion of byte strings vs character (Unicode) strings.
为什么 readdir
不返回 Unicode 字符串?首先,根据perlunicode 部分 当 Unicode 不发生时" :
Why doesn't readdir
return Unicode strings? First, according toperlunicode section "When Unicode Does Not Happen" :
Unicode 仍有很多地方(在某些编码或另一个)可以作为参数给出或作为结果接收,或两者兼而有之在 Perl 中,但事实并非如此.(...)
以下是这样的接口.对于所有这些接口 Perl当前(从 v5.16.0 开始)简单地将字节字符串都假定为论据和结果.(...)
The following are such interfaces. For all of these interfaces Perl currently (as of v5.16.0) simply assumes byte strings both as arguments and results. (...)
Perl 不尝试解析 Unicode 角色的一个原因在这些情况下,答案高度依赖于操作系统和文件系统.例如,是否文件名可以是 Unicode,具体是哪种编码,是不完全是一个便携的概念.(...)
One reason that Perl does not attempt to resolve the role of Unicode in these situations is that the answers are highly dependent on the operating system and the file system(s). For example, whether filenames can be in Unicode and in exactly what kind of encoding, is not exactly a portable concept. (...)
- chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, - stat, symlink, truncate, unlink, utime, -X
- %ENV
- glob(又名 <*>)
- 打开,打开目录,系统打开
- qx(又名反引号运算符),系统
- 读取目录,读取链接
所以 readdir
返回字节字符串,因为通常不可能先验地知道文件名的编码.有关为什么这是不可能的背景信息,请参见示例:
So readdir
returns byte strings, since it is in general impossible to know the encoding of a file name a priori. For background information about why this is impossible, see for example:
- 文件名,维基百科,编码互操作性"小节,
- 了解 unix.stackexchange.com 上的 Unix 文件名编码莉>
- filename in Wikipedia, sub section "Encoding interoperability",
- Understanding Unix file name encoding on unix.stackexchange.com
字符串比较:
现在,最后您尝试将读取的文件名 $filename_read
与变量 filename
进行比较:
Now, finally you try to compare the read filename $filename_read
with the variable filename
:
print "found
" if $filename_read eq filename;
在这种情况下,$filename_read
和 filename
之间的唯一区别是 $filename_read
没有设置 UTF-8 标志(它不是 Perl 在内部识别为 Unicode 字符串").
In this case the only difference between $filename_read
and filename
is that $filename_read
does not have the UTF-8 flag set (it is not what Perl internally recognize as a "Unicode string").
现在有趣的是 eq
运算符的结果将取决于 $filename_read
中的字节是否是纯 ASCII.根据 Encode 模块的文档:
The interesting thing now is that the result of the eq
operator will depend upon whether the bytes in $filename_read
is pure ASCII or not. According to the documentation of the Encode module:
在 Perl 中引入 Unicode 支持之前,eq
运算符只是比较了由两个标量表示的字符串.以...开始Perl 5.8,eq
比较两个字符串同时考虑UTF8 标志.
...
当您解码时,生成的 UTF8 标志是打开的——除非您可以明确表示数据.
When you decode, the resulting UTF8 flag is on--unless you can unambiguously represent data.
因此,在您的情况下,eq
将考虑 UTF-8
标志,因为 $file_name_read
不包含纯 ASCII,因此它会考虑两个字符串不相等.如果 $filename_read
和 filename
相同并且只包含纯 ASCII 字节(并且 filename
仍然设置了 UTF-8 标志,$filename_read
没有设置 UTF-8 标志),那么 eq
会认为这两个字符串相等.请参阅编码文档中的讨论,了解有关此行为背景的更多信息.
So in your case, eq
will consider the UTF-8
flag since $file_name_read
does not contain pure ASCII, and as a result it willconsider the two string not equal. If $filename_read
and filename
where identical and did only contain pure ASCII bytes (and filename
still had the UTF-8 flag set, $filename_read
did not have the UTF-8 flag set), then eq
would consider the two strings as equal. Se the discussion in the documentation for Encode more information regarding the background for this behavior.
结论:
因此,如果您相对确信所有文件名都是 UTF-8 编码,则可以通过将从 readdir
返回的字节字符串解码为 Unicode 字符串(强制 UTF-8-8 标志被设置):
So if you are relative confident that all your filenames are UTF-8 encoded, you could solve the issue in your question by decoding the byte string returned from readdir
into a Unicode string (forcing the UTF-8 flag to be set):
$filename_read = Encode::decode_utf8( $filename_read );
更多详情
注意:由于 Unicode 允许相同字符的多种表示形式,因此 Bärlouch
中存在两种形式的 ä
(拉丁文小写字母 A 与组合分音符).例如,
Note: since Unicode allows multiple representations of the same characters, there exists two forms of the ä
(LATIN SMALL LETTER A WITH COMBINING DIAERESIS) in Bärlauch
. For example,
- U+00E4 是 NFC(规范化形式规范组合)形式,
- U+0061.0308 是 NFD(规范化形式规范分解)形式.
在我的平台 (Linux) 上,UTF-8 编码的文件名使用 NFC 形式存储,但在 Mac OS 上,它们使用 NFD 形式.有关详细信息,请参阅 Encode::UTF8Mac
.这意味着如果您在 Linux 机器上工作,例如克隆由 Mac 用户创建的 Git 存储库,您可以轻松地在 Linux 机器上获得 NFD 编码的文件名.所以 Linux 文件系统并不关心文件名的编码方式;它只是将其视为一个字节序列.因此,即使我的语言环境是 "en_US.UTF-8"
,我也可以轻松编写一个创建 ISO-Latin-1 编码文件名的脚本.当前的语言环境设置只是应用程序的指南,但如果应用程序忽略了语言环境设置,也无法阻止它们这样做.
On my platform (Linux), UTF-8 encoded filenames are stored using NFC form, but on Mac OS they use NFD form. See Encode::UTF8Mac
for more information. This means that if you work on a Linux machine, and for example clone a Git repository that was created by a Mac user, you can easily get NFD encoded filenames on your Linux machine. So the Linux filesystem does not care what encoding a filename is in; it just thinks of it as a sequence of bytes. Hence, I could easily write a script that created an ISO-Latin-1 encoded filename, even though my Locale is "en_US.UTF-8"
. The current locale settings are just guidelines for applications, but if the application ignores the locale settings it is nothing that stops them from doing that.
因此,如果您不确定从 readdir
返回的文件名是使用 NFC 还是 NFD,则应始终在解码后进行分解:
So if you are unsure if filenames returned from readdir
are using NFC or NFD, you should always decompose after you have decoded them:
use Unicode::Normalize;
print "found
" if NFD( $filename_read ) eq NFD( filename );
另见 Perl Unicode食谱部分总是分解和重新组合".
See also Perl Unicode Cookbook section "Always Decompose and Recompose".
最后,要了解有关区域设置如何与 Perl 中的 Unicode 协同工作的更多信息,您可以查看:
Finally, to understand more about how the Locale works together with Unicode in Perl, you could have a look at:
- perllocale,部分"Unicode 和 UTF-8",以及
- Encode::Locale.
- perllocale, section "Unicode and UTF-8", and
- Encode::Locale.
这篇关于readdir 以什么编码返回文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!