问题描述
我有以下问题:我正在从UTF-8文本文件中读取(并且我通过:encoding(utf-8)告诉Perl,我正在这样做)".
I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").
在十六进制查看器中,文件如下所示:EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
The file looks like this in a hex viewer:EF BB BF 43 6F 6E 66 65 72 65 6E 63 65
打印时将其翻译为会议".我了解被警告的宽字符"是BOM.我想摆脱它(不是因为警告,而是因为它弄乱了我稍后进行的字符串比较).
This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).
因此,我尝试使用以下代码将其删除,但我失败了:
So I tried to remove it using the following code, but I fail miserably:
$ line =〜s/^ \ xEF \ xBB \ xBF//;
$line =~ s/^\xEF\xBB\xBF//;
有人能启发我如何从我通过读取UTF-8文件的第一行获得的字符串中删除UTF-8 BOM吗?
Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?
谢谢!
推荐答案
EF BB BF
是BOM的UTF-8编码,但是您对其进行了解码,因此必须查找其解码形式. BOM是在文件开头使用的零宽度无间断空格(U + FEFF),因此可以执行以下任何操作:
EF BB BF
is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:
s/^\x{FEFF}//;
s/^\N{U+FEFF}//;
s/^\N{ZERO WIDTH NO-BREAK SPACE}//;
s/^\N{BOM}//; # Convenient alias
由于您忘了在输出文件句柄上添加:encoding
图层,因此字符越来越宽.以下将:encoding(UTF-8)
添加到STDIN,STDOUT,STDERR,并将其作为open()
的默认值.
You're getting wide character because you forgot to add an :encoding
layer on your output file handle. The following adds :encoding(UTF-8)
to STDIN, STDOUT, STDERR, and makes it the default for open()
.
use open ':std', ':encoding(UTF-8)';
这篇关于使用Perl从字符串中删除BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!