本文介绍了UTF-8编码文本中的未知字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些数据的文件。该数据以UTF-8编码(没有BOM)



这些字节通常没有问题需要处理。然而在该文件中知道有一个字节序列我不知道它应该代表什么(我也不能找到任何关于它的信息)



检查日期我在十六进制编辑器中打开了文件。有UTF-8 char序列非常正常( C3 BC ü C3 B6 ö等等。)



然而有以下顺序我不知道如何到达预期的字符:



C3 83 EF BF BF



从上下文中我可以知道它应该代表字符ü。然而,我不知道你怎么可能达到那个序列...





示例文件中的示例( Hex View):

I have a file which contains some data. That data is encoded in UTF-8 (without a BOM)

Those bytes are usually no problem to handle. Yet know in that file there is a byte sequence I don't know what it should represent (neither could I find any information about it too)

To examine the date I opened the file in a hex editor. There were UTF-8 char sequences which were pretty normal (C3 BC for ü and C3 B6 for ö etc.)

Yet then there was the following sequence I don't know how to get to the expected char:

C3 83 EF BF BF

From the context I can gather that it should represent the character ü. Yet I've no idea how you could possibly get to that sequence...


Example how this looks like in the file (Hex View):

54 65 73 74 20 77 69 74 68 20 63 68 61 72 20 22
75 65 22 20 2D 3E 20 C3 BC 20 0D 0A 54 65 73 74
20 77 69 74 68 20 63 68 61 72 20 22 6F 65 22 20
2D 3E 20 C3 B6 0A 0D 0A 4E 6F 77 20 74 68 61 74
20 73 74 72 61 6E 67 65 20 73 65 71 75 65 6E 63
65 3A 20 C3 83 EF BF BF 20 69 74 20 73 68 6F 75
6C 64 20 70 72 6F 62 61 62 6C 79 20 72 65 70 72
65 73 65 6E 74 20 74 68 65 20 63 68 61 72 20 C3
BC





实际文本(UTF -8):





Actual text (UTF-8):

Test with char "ue" -> ü
Test with char "oe" -> ö

Now that strange sequence: Ã it should probably represent the char ü



(好像CP看起来不会让我显示EF BF BF的解码值;))



我在十六进制视图和文本视图中的表示中突出显示了相应的部分。



现在的问题是:



应该 C3 83 EF BF BF 代表什么?我想 C3 83 转换为Ã但是什么是 EF BF BF ?我发现的唯一一件事是,如果你将字符0xFFFF转换为UTF-8 EF BF BF 是你得到的字节序列。但仍然:它究竟代表什么?


(Well looks like CP won't let me display the decode value of EF BF BF ;) )

I've highlighted the according sections in the Hex View and the Representation in the text View.

Now the question:

What should C3 83 EF BF BF represent? I suppose C3 83 translates okay to à but what is EF BF BF? The only thing I found was that if you convert the char 0xFFFF to UTF-8 EF BF BF is the byte sequence that you get. But still: what should it exactly represent?

推荐答案


这篇关于UTF-8编码文本中的未知字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 21:19
查看更多