问题描述
更新
正如@ikegami 所建议的,我将此报告为错误.
错误 #121783 for perl5:Windows:UTF-8 编码输出代码页为 65001 的 cmd.exe 导致意外输出
考虑以下 C 和 Perl 程序,它们都输出字符串αβγ"的 UTF-8 编码标准输出:
C 版:
#include int main(void) {/* UTF-8 编码的 alpha, beta, gamma */字符 x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };看跌期权(x);返回0;}
输出:
C:…> chcp 65001活动代码页:65001C:…> cttt.exeαβγ
Perl 版本:
C:…> perl -e "打印qq{xcexb1xcexb2xcexb3 }"αβγ
据我所知,最后一个八位字节 0xb3
再次输出,在另一行上,被转换为 U+FFFD
.
请注意,重定向输出消除了这种影响.
我还可以验证它是重复的最后一个八位字节:
C:…> perl -e "打印qq{xcexb1xcexb2xcexb3xyz }"αβγxyzz
另一方面,syswrite 避免了这个问题.
C:…> perl -e "syswrite STDOUT, qq{xcexb1xcexb2xcexb3xyz }"αβγxyz
我在使用自建 perl 5.18.2 和 ActiveState 5.16.3 的 Windows 8.1 Pro 64 位和 Windows Vista Home 32 位上的 cmd.exe 窗口中观察到了这一点.
我在 Cygwin、Linux 或 Mac OS X 环境中没有发现问题.此外,Cygwin 的 perl 5.14.4 在 cmd.exe 中产生正确的输出.
此外,当代码页设置为 437 时,C 和 Perl 版本的输出是相同的:
C:…> chcp 437活动代码页:437C:…> cttt.exe╬▒╬▓╬│C:…> perl -e "打印 qq{xcexb1xcexb2xcexb3 }"╬▒╬▓╬│
当 代码页设置为 65001?
PS:我有关于 我的博客.对于这个问题,我试图将所有内容提炼为最简单的情况.
PPS:省略 会产生更有趣的结果:
C:…> perl -e "打印qq{xcexb1xcexb2xcexb3xyz}"αβγxyzxyz
C:…> perl -e "打印qq{xcexb1xcexb2xcexb3}"αβγ γ
以下程序产生正确的输出:
使用utf8;使用严格;使用警告;使用警告 qw(FATAL utf8);binmode(STDOUT, ":unix:encoding(utf8):crlf");打印 'αβγxyz', "
";
输出:
C:…> chcp 65001活动代码页:65001C:…> perl pttt.plαβγxyz
这似乎向我表明 :crlf
层有一些时髦.我不太了解内部原理,目前无法对此做出明智的评论.
经过多次实验,我得出的结论是,如果控制台已经设置为65001代码页,binmode(STDOUT, ":unix:encoding(utf8):crlf");
将工作".但是,请注意以下几点:
binmode(STDOUT, ":unix:encoding(utf8):crlf");打印转储 [地图 {我的 $x = 定义($_)?$_ : '';$x =~ s/A([0-9]+)z/sprintf '0x%08x', $1/eg;$x;} PerlIO::get_layers(STDOUT, details => 1)];打印αβγxyz
";
给我:
---- Unix- ''- 0x01205200- CRLF- ''- 0x00c85200- Unix- ''- 0x01201200- 编码- utf8- 0x00c89200- CRLF- ''- 0x00c8d200αβγxyz和以前一样,我对这件事的全部后果知之甚少.我确实打算在某个时候构建一个调试 perl
来进一步诊断这个问题.
我检查了这个再远一点.以下是该帖子的一些观察结果:
第一个 unix
层的标志是 0x01205200 = CANWRITE |截断 |CRLF |开放 |NOTREG
.为什么在 Windows 上为 unix
层设置了 CRLF
?我不太了解内部结构,无法理解这一点.
然而,第二个 unix
层的标志,由我的显式 binmode
推送,是 0x01201200 = 0x01205200 &〜CRLF.这对我来说是有意义的.
第一层 crlf 的标志是 0x00c85200 = CANWRITE |截断 |CRLF |线缓冲 |快件 |TTY
.我在 :encoding(utf8)
层之后推送的第二个 layer
的标志是 0x00c8d200 = 0x00c85200 |UTF8
.
现在,如果我使用 open my $fh, '>:encoding(utf8)', 'ttt'
打开一个文件,然后转储相同的信息,我会得到:
正如预期的那样,unix
层没有设置 CRLF
标志.
Update
As @ikegami suggested, I reported this as a bug.
Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:
C version:
#include <stdio.h>
int main(void) {
/* UTF-8 encoded alpha, beta, gamma */
char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
puts(x);
return 0;
}
Output:
C:…> chcp 65001 Active code page: 65001 C:…> cttt.exe αβγ
Perl version:
C:…> perl -e "print qq{xcexb1xcexb2xcexb3 }" αβγ �
From what I can tell, the last octet, 0xb3
is being output again, on another line, which is being translated to U+FFFD
.
Note that redirecting output eliminates this effect.
I can also verify that it is the last octet being repeated:
C:…> perl -e "print qq{xcexb1xcexb2xcexb3xyz }" αβγxyz z
On the other hand, syswrite avoids this problem.
C:…> perl -e "syswrite STDOUT, qq{xcexb1xcexb2xcexb3xyz }" αβγxyz
I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.
I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.
Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:
C:…> chcp 437 Active code page: 437 C:…> cttt.exe ╬▒╬▓╬│ C:…> perl -e "print qq{xcexb1xcexb2xcexb3 }" ╬▒╬▓╬│
What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?
PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.
PPS: Leaving out the results in something even more interesting:
C:…> perl -e "print qq{xcexb1xcexb2xcexb3xyz}" αβγxyzxyz
C:…> perl -e "print qq{xcexb1xcexb2xcexb3}" αβγ�γ�
The following program produces the correct output:
use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);
binmode(STDOUT, ":unix:encoding(utf8):crlf");
print 'αβγxyz', "
";
Output:
C:…> chcp 65001 Active code page: 65001 C:…> perl pttt.pl αβγxyz
which seems to indicate to me there is some funkiness with the :crlf
layer. I do not understand the internals enough to comment intelligently about this at this point.
After many experiments, I have come to the conclusion that, if the console is already set to 65001 code page, binmode(STDOUT, ":unix:encoding(utf8):crlf");
will "work". However, note the following:
binmode(STDOUT, ":unix:encoding(utf8):crlf");
print Dump [
map {
my $x = defined($_) ? $_ : '';
$x =~ s/A([0-9]+)z/sprintf '0x%08x', $1/eg;
$x;
} PerlIO::get_layers(STDOUT, details => 1)
];
print "αβγxyz
";
gives me:
--- - unix - '' - 0x01205200 - crlf - '' - 0x00c85200 - unix - '' - 0x01201200 - encoding - utf8 - 0x00c89200 - crlf - '' - 0x00c8d200 αβγxyz
As before, I do not know enough to know the full consequences of this. I do intend to build a debug perl
at some point to further diagnose this.
I examined this a little further. Here are some observations from that post:
The flags for the first unix
layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG
. Why is CRLF
set for the unix
layer on Windows? I do not know about the internals enough to understand this.
However, the flags for the second unix
layer, the one pushed by my explicit binmode
, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.
The flags for the first crlf layer are 0x00c85200 = CANWRITE | TRUNCATE | CRLF | LINEBUF | FASTGETS | TTY
. The flags for the second layer
, which I push after the :encoding(utf8)
layer are 0x00c8d200 = 0x00c85200 | UTF8
.
Now, if I open a file using open my $fh, '>:encoding(utf8)', 'ttt'
, and dump the same information, I get:
--- - unix - '' - 0x00201200 - crlf - '' - 0x00405200 - encoding - utf8 - 0x00409200
As expected, the unix
layer does not set the CRLF
flag.
这篇关于当我的 Perl 程序在 cmd.exe 中输出一个 UTF-8 编码的字符串时,为什么我会重复最后一个八位字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!