本文介绍了检测UTF-8双字节字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正努力将UTF-8文件中的一行解析为字符串数组.
我的文件包含以下内容:

 Grecki John
12345678901234 


名字"John"从第10位开始. (这里的第二个字符是UTF-8 U + 022F.)
在代码中,我需要做

 LineRead.Substring( 11  4 ) 

以获得"John",应为

 LineRead.Substring( 10  4 )


带有正常字符.

我的问题当然是在这种情况下,如何检测到我需要做一个11而不是10的子字符串?
我尝试了类似

  If   Not  System.Text.Encoding.UTF8的操作. GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead))= System.Text.Encoding.UTF8.GetByteCount(LineRead)然后 

,但这是"à"的情况也是如此,它在String.Length中仅计为1,但在UTF-8中具有2个字节...

如何处理这样的常见情况?
如何防止将1个字符的字节分成几个错误的字符?这样我就可以逐个字符地遍历字符串并对其进行计数?
预先感谢! microsoft.com/en-us/library/system.globalization.stringinfo.aspx">http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx [ ^ ]
无论如何还是要谢谢!!



I''m seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:

Gȯrecki   John
12345678901234


The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do

LineRead.Substring(11,4)

to get "John", where it should be

LineRead.Substring(10,4)


with normal characters.

My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like

If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then 

but that''s also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...

How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!

解决方案



这篇关于检测UTF-8双字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-20 23:41