当我的 Perl 程序在 cmd.exe 中输出一个 UTF-8 编码的字符串时，为什么我会重复最后一个八位字节?

本文介绍了当我的 Perl 程序在 cmd.exe 中输出一个 UTF-8 编码的字符串时，为什么我会重复最后一个八位字节?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新

正如@ikegami 所建议的，我将此报告为错误.

错误 #121783 for perl5:Windows:UTF-8 编码输出代码页为 65001 的 cmd.exe 导致意外输出

考虑以下 C 和 Perl 程序，它们都输出字符串αβγ"的 UTF-8 编码标准输出:

C 版:

#include int main(void) {/* UTF-8 编码的 alpha, beta, gamma */字符 x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };看跌期权(x)；返回0；}

输出:

C:…> chcp 65001活动代码页:65001C:…> cttt.exeαβγ

Perl 版本:

C:…> perl -e "打印qq{xcexb1xcexb2xcexb3
}"αβγ

据我所知，最后一个八位字节 0xb3 再次输出，在另一行上，被转换为 U+FFFD.

请注意，重定向输出消除了这种影响.

我还可以验证它是重复的最后一个八位字节:

C:…> perl -e "打印qq{xcexb1xcexb2xcexb3xyz
}"αβγxyzz

另一方面，syswrite 避免了这个问题.

C:…> perl -e "syswrite STDOUT, qq{xcexb1xcexb2xcexb3xyz
}"αβγxyz

我在使用自建 perl 5.18.2 和 ActiveState 5.16.3 的 Windows 8.1 Pro 64 位和 Windows Vista Home 32 位上的 cmd.exe 窗口中观察到了这一点.

我在 Cygwin、Linux 或 Mac OS X 环境中没有发现问题.此外，Cygwin 的 perl 5.14.4 在 cmd.exe 中产生正确的输出.

此外，当代码页设置为 437 时，C 和 Perl 版本的输出是相同的:

C:…> chcp 437活动代码页:437C:…> cttt.exe╬▒╬▓╬│C:…> perl -e "打印 qq{xcexb1xcexb2xcexb3
}"╬▒╬▓╬│

当代码页设置为 65001?

PS:我有关于我的博客.对于这个问题，我试图将所有内容提炼为最简单的情况.

PPS:省略会产生更有趣的结果:

C:…> perl -e "打印qq{xcexb1xcexb2xcexb3xyz}"αβγxyzxyz

C:…> perl -e "打印qq{xcexb1xcexb2xcexb3}"αβγ γ

解决方案

以下程序产生正确的输出:

使用utf8；使用严格；使用警告；使用警告 qw(FATAL utf8);binmode(STDOUT, ":unix:encoding(utf8):crlf");打印 'αβγxyz', "
";

输出:

C:…> chcp 65001活动代码页:65001C:…> perl pttt.plαβγxyz

这似乎向我表明 :crlf 层有一些时髦.我不太了解内部原理，目前无法对此做出明智的评论.

经过多次实验，我得出的结论是，如果控制台已经设置为65001代码页，binmode(STDOUT, ":unix:encoding(utf8):crlf");将工作".但是，请注意以下几点:

binmode(STDOUT, ":unix:encoding(utf8):crlf");打印转储 [地图 {我的 $x = 定义($_)?$_ : '';$x =~ s/A([0-9]+)z/sprintf '0x%08x', $1/eg;$x;} PerlIO::get_layers(STDOUT, details => 1)];打印αβγxyz
"；

给我:

---- Unix- ''- 0x01205200- CRLF- ''- 0x00c85200- Unix- ''- 0x01201200- 编码- utf8- 0x00c89200- CRLF- ''- 0x00c8d200αβγxyz

和以前一样，我对这件事的全部后果知之甚少.我确实打算在某个时候构建一个调试 perl 来进一步诊断这个问题.

我检查了这个再远一点.以下是该帖子的一些观察结果:

第一个 unix 层的标志是 0x01205200 = CANWRITE |截断 |CRLF |开放 |NOTREG.为什么在 Windows 上为 unix 层设置了 CRLF?我不太了解内部结构，无法理解这一点.

然而，第二个 unix 层的标志，由我的显式 binmode 推送，是 0x01201200 = 0x01205200 &〜CRLF.这对我来说是有意义的.

第一层 crlf 的标志是 0x00c85200 = CANWRITE |截断 |CRLF |线缓冲 |快件 |TTY.我在 :encoding(utf8) 层之后推送的第二个 layer 的标志是 0x00c8d200 = 0x00c85200 |UTF8.

现在，如果我使用 open my $fh, '>:encoding(utf8)', 'ttt' 打开一个文件，然后转储相同的信息，我会得到:

---- Unix- ''- 0x00201200- CRLF- ''- 0x00405200- 编码- utf8- 0x00409200

正如预期的那样，unix 层没有设置 CRLF 标志.

Update

As @ikegami suggested, I reported this as a bug.

Bug #121783 for perl5: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpected output

Consider the following C and Perl programs which both output a the UTF-8 encoding of the string "αβγ" on standard output:

C version:

#include <stdio.h>

int main(void) {
    /* UTF-8 encoded alpha, beta, gamma */
    char x[] = { 0xce, 0xb1, 0xce, 0xb2, 0xce, 0xb3, 0x00 };
    puts(x);
    return 0;
}

Output:

C:…> chcp 65001
Active code page: 65001

C:…> cttt.exe
αβγ

Perl version:

C:…>  perl -e "print qq{xcexb1xcexb2xcexb3
}"
αβγ
�

From what I can tell, the last octet, 0xb3 is being output again, on another line, which is being translated to U+FFFD.

Note that redirecting output eliminates this effect.

I can also verify that it is the last octet being repeated:

C:…>  perl -e "print qq{xcexb1xcexb2xcexb3xyz
}"
αβγxyz
z

On the other hand, syswrite avoids this problem.

C:…>  perl -e "syswrite STDOUT, qq{xcexb1xcexb2xcexb3xyz
}"
αβγxyz

I have observed this in cmd.exe windows on Windows 8.1 Pro 64-bit and Windows Vista Home 32-bit using both self-built perl 5.18.2 and ActiveState's 5.16.3.

I do not see the problem in Cygwin, Linux, or Mac OS X environments. Also, Cygwin's perl 5.14.4 produces correct output in cmd.exe.

Also, when the code page is set to 437, the output from both the C and the Perl versions is identical:

C:…> chcp 437
Active code page: 437

C:…> cttt.exe
╬▒╬▓╬│

C:…>  perl -e "print qq{xcexb1xcexb2xcexb3
}"
╬▒╬▓╬│

What is causing the last octet to be output twice when printing from perl program in cmd.exe when the code page is set to 65001?

PS: I have some more information and screenshots on my blog. For this question, I have tried to distill everything to the simplest possible cases.

PPS: Leaving out the results in something even more interesting:

C:…> perl -e "print qq{xcexb1xcexb2xcexb3xyz}"
αβγxyzxyz

C:…> perl -e "print qq{xcexb1xcexb2xcexb3}"
αβγ�γ�

解决方案

The following program produces the correct output:

use utf8;
use strict;
use warnings;
use warnings qw(FATAL utf8);

binmode(STDOUT, ":unix:encoding(utf8):crlf");

print 'αβγxyz', "
";

Output:

C:…> chcp 65001
Active code page: 65001
C:…> perl pttt.pl
αβγxyz

which seems to indicate to me there is some funkiness with the :crlf layer. I do not understand the internals enough to comment intelligently about this at this point.

After many experiments, I have come to the conclusion that, if the console is already set to 65001 code page, binmode(STDOUT, ":unix:encoding(utf8):crlf"); will "work". However, note the following:

binmode(STDOUT, ":unix:encoding(utf8):crlf");
print Dump [
    map {
        my $x = defined($_) ? $_ : '';
        $x =~ s/A([0-9]+)z/sprintf '0x%08x', $1/eg;
        $x;
    } PerlIO::get_layers(STDOUT, details => 1)
];
print "αβγxyz
";

gives me:

---
- unix
- ''
- 0x01205200
- crlf
- ''
- 0x00c85200
- unix
- ''
- 0x01201200
- encoding
- utf8
- 0x00c89200
- crlf
- ''
- 0x00c8d200
αβγxyz

As before, I do not know enough to know the full consequences of this. I do intend to build a debug perl at some point to further diagnose this.

I examined this a little further. Here are some observations from that post:

The flags for the first unix layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG. Why is CRLF set for the unix layer on Windows? I do not know about the internals enough to understand this.

However, the flags for the second unix layer, the one pushed by my explicit binmode, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.

Now, if I open a file using open my $fh, '>:encoding(utf8)', 'ttt', and dump the same information, I get:

---
- unix
- ''
- 0x00201200
- crlf
- ''
- 0x00405200
- encoding
- utf8
- 0x00409200

As expected, the unix layer does not set the CRLF flag.

这篇关于当我的 Perl 程序在 cmd.exe 中输出一个 UTF-8 编码的字符串时，为什么我会重复最后一个八位字节?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！