问题描述
我正努力将UTF-8文件中的一行解析为字符串数组.
我的文件包含以下内容:
Grecki John 12345678901234
名字"John"从第10位开始. (这里的第二个字符是UTF-8 U + 022F.)
在代码中,我需要做
LineRead.Substring( 11 , 4 )
以获得"John",应为
LineRead.Substring( 10 , 4 )
带有正常字符.
我的问题当然是在这种情况下,如何检测到我需要做一个11而不是10的子字符串?
我尝试了类似
If Not System.Text.Encoding.UTF8的操作. GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead))= System.Text.Encoding.UTF8.GetByteCount(LineRead)然后
,但这是"à"的情况也是如此,它在String.Length中仅计为1,但在UTF-8中具有2个字节...
如何处理这样的常见情况?
如何防止将1个字符的字节分成几个错误的字符?这样我就可以逐个字符地遍历字符串并对其进行计数?
预先感谢! microsoft.com/en-us/library/system.globalization.stringinfo.aspx">http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx [ ^ ]
无论如何还是要谢谢!!
I''m seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:
Gȯrecki John 12345678901234
The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do
LineRead.Substring(11,4)
to get "John", where it should be
LineRead.Substring(10,4)
with normal characters.
My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like
If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then
but that''s also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...
How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!
这篇关于检测UTF-8双字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!