问题描述
I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}
.
So my script should be callable in these two ways, as usual, giving the same output:
./script.pl utf8.txt
cat utf8.txt | ./script.pl
But the outputs differ! Only the second call (using cat
) seems to work as designed, reading UTF-8 properly. Here is the script:
#!/usr/bin/perl -w
binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
while(<>){
my @chars = split //, $_;
print "$_
" foreach(@chars);
}
How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <>
for reading, if possible.
EDIT:
I realized I should probably describe the different outputs. My input file contains this sequence: axCAxA7b
. The method with cat
correctly outputs:
a
xCAxA7
b
But the other method gives me this:
a
xC3x8A
xC2xA7
b
Try to use the pragma open instead:
use strict;
use warnings;
use open qw(:std :utf8);
while(<>){
my @chars = split //, $_;
print "$_" foreach(@chars);
}
You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in @ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in @ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when @ARGV has files. In this case the <> operator opens a new file handle for each file in @ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.
这篇关于如何使用菱形运算符 (<>) 读取 UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!