问题描述
我试图检测给定字符串的编码,以便稍后使用iconv将其转换为utf-8.我想将源编码集限制为utf8,iso8859-1,windows-1251,CP437
I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437
//...
$acceptedEncodings = array('utf-8',
'iso-8859-1',
'windows-1251'
);
$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);
if($srcEncoding)
{
$content = iconv($srcEncoding, 'UTF-8', $content);
}
//...
问题是mb_detect_encoding似乎不接受CP437作为支持的编码,当我给它一个CP437编码的字符串时,它被分类为iso-8859-1,这会导致iconv忽略诸如ü之类的字符.
The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.
我的问题是:有没有一种方法可以更早地检测到CP437编码?使用iconv从CP437转换为UTF-8效果很好,但我只是找不到检测CP437的正确方法.
My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.
非常感谢您.
推荐答案
正如之前讨论过的无数次:从根本上讲,不可能将任何单字节编码与任何其他单字节编码区分开.您得到的是一堆字节.在编码A中,字节x42
可能会映射到字符X,而在编码B中,相同的字节可能会映射到字符Y.但是,关于字节的blob,您所知道的没有什么,因为您只有字节.他们可以表示任何意思.它们在所有编码中均有效.可以识别更复杂的多字节编码(例如UTF-8),因为它们需要遵循更复杂的内部规则.因此,可以肯定地说出这不是无效有效的UTF-8 .但是,不可能100%肯定地说这绝对是UTF-8,而不是ISO-8859 .
As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42
may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.
您需要具有有关接收到的内容的元数据,该数据可以告诉您内容的编码方式.事后对其进行识别是不切实际的.您需要进行实际的内容分析,以确定哪种编码对文本最有意义.
You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.
这篇关于如何使用PHP检测CP437的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!