问题描述
我有一些Perl代码将新行和换行符转换为规范化形式。
输入文本是日语,因此会有多字节字符。
I have some Perl code that translates new-lines and line-feeds to a normalized form.The input text is Japanese, so that there will be multi-byte characters.
仍然可以逐个字节进行这种转换基础(我认为它目前做),或者我必须检测字符集和启用Unicode支持?换句话说,使用字节作为其字符集的一部分的流行编码(Shift-JIS,EUC-JP,UTF-8,ISO-2022-JP)可能被误认为ASCII控制字符?
Is it still possible to do this transformation on a byte-by-byte basis (which I think it currently does), or do I have to detect the character set and enable Unicode support? In other words, are the popular encodings (Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) using bytes as part of their character set that could be mistaken for ASCII control characters?
我只需要CR和LF工作。
I need only CR and LF to work.
更新:添加了ISO-2022-这是一个看起来最麻烦的,它的时髦的转义序列...
Update: Added ISO-2022-JP. And that is the one that looks the most troublesome with its funky escape sequences ...
推荐答案
提及(Shift-JIS,UTF-8,EUC-JP,ISO-2022-JP)使用日语字符内的CR或LF字符。对于UTF-8和EUC-JP,在日语字符内的低ASCII字符和字节之间没有重叠。但是,对于Shift-JIS和ISO-2022-JP,存在重叠,但不在您找到CR和LF的范围。
None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.
For ISO-2022-JP,
First-byte range: 0x21 - 0x7E
Second-byte range: 0x21 - 0x7E
在各种字符集之间来回切换的转义序列字符是:
And the escape sequence characters to switch back and forth between various character sets are:
0x1B, 0x28, 0x24, 0x40, 0x42, and 0x4A
For Shift-JIS,
First-byte range: 0x81 - 0x9F, 0xE0 - 0xEF
Second-byte range: 0x40 - 0x7E, 0x80 - 0xFC
Half-width katakana: 0xA1 - 0xDF
同样,CR和LF没有重叠。
Again, there is no overlap with CR and LF.
这篇关于多字节字符集中的换行控制字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!