多字节字符集中的换行控制字符

本文介绍了多字节字符集中的换行控制字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些Perl代码将新行和换行符转换为规范化形式。
输入文本是日语，因此会有多字节字符。

I have some Perl code that translates new-lines and line-feeds to a normalized form.The input text is Japanese, so that there will be multi-byte characters.

仍然可以逐个字节进行这种转换基础（我认为它目前做），或者我必须检测字符集和启用Unicode支持？换句话说，使用字节作为其字符集的一部分的流行编码（Shift-JIS，EUC-JP，UTF-8，ISO-2022-JP）可能被误认为ASCII控制字符？

Is it still possible to do this transformation on a byte-by-byte basis (which I think it currently does), or do I have to detect the character set and enable Unicode support? In other words, are the popular encodings (Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) using bytes as part of their character set that could be mistaken for ASCII control characters?

我只需要CR和LF工作。

I need only CR and LF to work.

更新：添加了ISO-2022-这是一个看起来最麻烦的，它的时髦的转义序列...

Update: Added ISO-2022-JP. And that is the one that looks the most troublesome with its funky escape sequences ...

推荐答案

提及（Shift-JIS，UTF-8，EUC-JP，ISO-2022-JP）使用日语字符内的CR或LF字符。对于UTF-8和EUC-JP，在日语字符内的低ASCII字符和字节之间没有重叠。但是，对于Shift-JIS和ISO-2022-JP，存在重叠，但不在您找到CR和LF的范围。

None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.

For ISO-2022-JP,
First-byte range: 0x21 - 0x7E
Second-byte range: 0x21 - 0x7E

在各种字符集之间来回切换的转义序列字符是：

And the escape sequence characters to switch back and forth between various character sets are:

0x1B, 0x28, 0x24, 0x40, 0x42, and 0x4A

For Shift-JIS,
First-byte range: 0x81 - 0x9F, 0xE0 - 0xEF
Second-byte range: 0x40 - 0x7E, 0x80 - 0xFC
Half-width katakana: 0xA1 - 0xDF

同样，CR和LF没有重叠。

Again, there is no overlap with CR and LF.

这篇关于多字节字符集中的换行控制字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！