问题描述
我有一个Unicode字符串,不知道它的编码是什么.当Perl程序读取此字符串时,Perl将使用默认编码吗?如果是这样,我怎么知道它是什么?
I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
我正在尝试摆脱输入中的非ASCII字符.我在某个论坛上找到了它,
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
在未指定输入编码的情况下,上述方法将如何工作?是否应指定如下所示?
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
推荐答案
要找出未知编码在哪种编码中使用,您只需尝试一下即可.模块 Encode :: Detect 和 Encode :: Guess 自动执行此操作. (如果您在编译Encode :: Detect时遇到问题,请尝试使用它的fork Encode :: Detective .)
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
我发现encode 'ascii'
是摆脱非ASCII字符的la脚解决方案.一切都将替换为问号;这太有损了而无用.
I find encode 'ascii'
is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
如果您想要可读的ASCII文本,建议改用 Text :: Unidecode .这也是一种有损编码,但不如上面的encode
可怕.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode
above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
但是,如果可以的话,请避免使用那些有损编码.如果以后要撤消操作,请选择PERLQQ
或XMLCREF
之一.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ
or XMLCREF
.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
这篇关于我如何猜测Perl中字符串的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!