Perl或Powershell如何从UCS-2 little endian转换为utf-8或进行内联oneliner搜索替换UCS-2文件上的正则表达式

本文介绍了Perl或Powershell如何从UCS-2 little endian转换为utf-8或进行内联oneliner搜索替换UCS-2文件上的正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Windows ActivePerl，但似乎无法获得UCS2小字节序文件的转换以正确转换为utf-8。最好的办法是进行适当的转换，除了首行4个字符用奇怪的中文/日语字符拼凑而成，但其余文件似乎没问题。

I'm using Windows ActivePerl and I can never seem to get conversion of a UCS2 little endian file to convert properly to utf-8. Best i could muster is what seems a proper conversion except that the first line which is 4 characters is mangled in strange chinese/japanese characters but the rest of file seems ok.

我真的想要做的是oneliner / search / replace perl正则表达式：

What I really want is to do oneliner /search/replace perl regex of the usual:

perl -pi.bak -e 's/replacethis/withthat/g;' my_ucs2file.txt

那是行不通的，所以我试图首先查看perl是否可以进行正确的转换，而我被卡住了，我在使用：

That won't work so I tried to first see if perl can do proper conversion and I'm stuck, i'm using:

perl -i.BAKS -MEncode -p -e "Encode::from_to($_, 'UCS-2', 'UTF-8')" My_UCS2file.txt

我尝试使用 UCS2 或 UCS-2LE ，但仍然无法正常使用

I tried using UCS2 or UCS-2LE but still can't get a proper conversion.

我记得有人在 UCS2 文件的开头删除了几位或其他内容，可以正常工作，但是我不记得了...

I recall somewhere someone had to delete a couple bits or something at the beginning of a UCS2 file to get conversion working but I can't remember...

当我尝试使用PowerShell时，它抱怨它不知道 UCS2 / UCS-2 ... ??

When I tried PowerShell it complained it didn't know UCS2 / UCS-2 ...??

赞赏任何想法。我注意到NotePad ++确实将其打开并可以很好地识别它，并且我可以在记事本中进行编辑和重新保存，但是没有命令行功能...

Appreciate any ideas. I noticed NotePad++ does open it and recognize it fine and I can edit and resave in notepad but there's no commandline ability...

推荐答案

一种线性方式是完全避免perl，而只需使用 iconv -f UCS-2LE -t UTF-8 infile> outfile ，但是我不确定Windows上是否可用。

The one liner way is to avoid perl entirely and just use iconv -f UCS-2LE -t UTF-8 infile > outfile, but I'm not sure if that's available on Windows.

因此，将perl作为一个衬里：

So, with perl as a one liner:

$ perl -Mopen="IN,:encoding(UCS-2LE),:std" -C2 -0777 -pe 1 infile > outfile

-0777 结合 -p 一次读取整个文件，而不是一次读取一行，这是您出错的一件事-当代码点为16位时但您将它们视为8位，发现行分隔符将是一个问题。

-C2 表示将UTF-8用于标准输出。

-Mopen = IN，：encoding（UCS-2LE）,: std 表示输入流的默认编码，包括标准输入（因此，它将与重定向输入一起使用文件），是UCS-2LE。有关详细信息，请参见（在脚本中为使用open IN =>'：encoding（UCS-2LE）'，'：std'; ）。说到编码，您遇到的另一个问题是 UCS-2 是 UCS-2BE 的同义词。有关详细信息，请参见。

-0777 combined with -p reads entire files at a time, instead of a line at a time, which is one thing where you were going wrong - when your codepoints are 16 bits but you're treating them as 8 bit ones, finding the line separators is going to be problematic.
-C2 says to use UTF-8 for standard output.
-Mopen="IN,:encoding(UCS-2LE),:std" says that the default encoding for input streams, including standard input (So it'll work with redirected input not just files), is UCS-2LE. See the open pragma for details (In a script it'd be use open IN => ':encoding(UCS-2LE)', ':std';). Speaking of encoding, another issue you're having is that UCS-2 is a synonym for UCS-2BE. See Encode::Unicode for details.

因此，它一次只能读取一个文件，从UCS-2LE转换为perl的内部编码，然后再次以UTF-8格式打印出来。

So that just reads a file at a time, converting from UCS-2LE to perl's internal encoding, and prints it back out again as UTF-8.

如果您不必担心Windows行结束转换，

If you didn't have to worry about Windows line ending conversion,

$ perl -MEncode -0777 -pe 'Encode::from_to($_, "UCS-2LE", "UTF-8")' infile > outfile

也可以。

如果您也希望输出文件也位于UCS-2LE中，而不仅仅是在编码之间进行转换：

If you want the output file to be in UCS-2LE too, and not just convert between encodings:

$ perl -Mopen="IO,:encoding(UCS-2LE),:std" -pe 's/what/ever/' infile > outfile

这篇关于Perl或Powershell如何从UCS-2 little endian转换为utf-8或进行内联oneliner搜索替换UCS-2文件上的正则表达式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！