本文介绍了是否有一个ASCII扩展编码列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我需要决定何时(不是)根据已知的文件编码和所需的输出编码转换文本文件。 如果文本是US-ASCII ,我不需要转换它,如果输出编码是ASCII,UTF-8,Latin1,... 显然,我需要将US-ASCII文件转换为UTF-16或UTF-32 标准编码列表存在于 http://www.iana.org/assignments/character-sets/character-sets.xml 如果符合以下条件,则必须进行转换: 最小字符大小> 1字节或 前127个代码点与US-ASCII不相同。 我想知道: 是否有类似的列表,包含有关每个编码的实现的详细信息(bytelenght,ASCII兼容性)? 我很高兴一个只包含 Qt5支持的编解码器。 EDIT 我已经找到问题的答案 是否所有8位或8位的编解码器都是ASCII的超集? 字词:US-ASCII可以解释为任何8或8位编码吗? 此处:字符集这不是ASCII的超集 相反,这将有助于知道: 有一个字符集列表,它们是ASCII的超集。 这看起来很有前景: mime.charsets - 是ASCII超集的字符集列表, ,但我找不到实际的mime.charsets档案。解决方案解码给定编码中的字节0x00 - 0x7F,并检查字符是否与ASCII匹配。例如,在Python 3.x中: def is_ascii_superset(encoding):代码范围: if bytes([codepoint])。decode(encoding,'ignore')!= chr(codepoint): return False return True 这给出: > > is_ascii_superset('US-ASCII') true >>> is_ascii_superset('windows-1252') True >>> is_ascii_superset('ISO-8859-15') True >>> is_ascii_superset('UTF-8') True >>> is_ascii_superset('UTF-16') False >>> is_ascii_superset('IBM500')#EBCDIC的变体 False EDIT:获取C ++中的Qt版本支持的每个编码的US-ASCII兼容性: code> #include< QTextCodec> #include< QMap> typedef enum { eQtCodecUndefined, eQtCodecAsciiIncompatible, eQtCodecAsciiCompatible,} tQtCodecType; QMap< QByteArray,tQtCodecType> QtCodecTypes() { QMap< QByteArray,tQtCodecType> CodecTypes; //如何测试Qt对ASCII数据的解释? QList< QByteArray> available = QTextCodec :: availableCodecs(); QTextCodec * referenceCodec = QTextCodec :: codecForName(UTF-8); //因为Qt没有US-ASCII,但我们只测试字节0-127和UTF-8是US-ASCII的超集 if(referenceCodec == 0) { qDebug (Unable to get reference codec'UTF-8'); return CodecTypes; } for(int i = 0; i { const QByteArray name = available.at(i); QTextCodec * currCodec = QTextCodec :: codecForName(name); if(currCodec == NULL) { qDebug(Unable to get codec for'%s',qPrintable(QString(name))); CodecTypes.insert(name,eQtCodecUndefined); continue; } tQtCodecType type = eQtCodecAsciiCompatible; for(uchar j = 0; j { const char c = ; // character to test< 2 ^ 8 QString sRef,sTest; sRef = referenceCodec-> toUnicode(& c,1); //将字符转换为UTF-16(QString内部),假设它是ASCII(通过UTF-8) sTest = currCodec-> toUnicode(& c,1); //将字符转换为UTF-16,假设它是类型[currCodec] if(sRef!= sTest)//比较两个UTF-16表示 - >如果它们相等,这些编解码器对于Qt { type = eQtCodecAsciiIncompatible; break; } } CodecTypes.insert(name,type); } return CodecTypes; } I need to decide when (not) to convert a text file based on the known file encoding and the desired output encoding.If the text is US-ASCII, I don't need to convert it if the output encoding is ASCII, UTF-8, Latin1, ...Obviously I need to convert a US-ASCII file to UTF-16 or UTF-32.A list of standard encodings exists athttp://www.iana.org/assignments/character-sets/character-sets.xmlA conversion is necessary if:the minimal character size is > 1 byte orthe first 127 code points are not the same as US-ASCII.I'd like to know:Is there a similar list with details (bytelenght, ASCII-compatibility) about the implementation of each encoding?I'd be happy about a list containing only codecs supported by Qt5.EDITI already found an answer to the questionAre all 8-or-variable8-bit-based codecs a superset of ASCII?In other words: Can US-ASCII be interpreted as any 8-or-variable8-bit-based encoding?here: Character set that is not a superset of ASCIIInstead, it would be helpful to know:Is there a list of character sets which are supersets of ASCII?This looks promising:mime.charsets - list of character sets which are ASCII supersets,but I couldn't find an actual mime.charsets file. 解决方案 An alternative approach is to decode the bytes 0x00 - 0x7F in the given encoding, and check that the characters match ASCII. For example, in Python 3.x:def is_ascii_superset(encoding): for codepoint in range(128): if bytes([codepoint]).decode(encoding, 'ignore') != chr(codepoint): return False return TrueThis gives:>>> is_ascii_superset('US-ASCII')True>>> is_ascii_superset('windows-1252')True>>> is_ascii_superset('ISO-8859-15')True>>> is_ascii_superset('UTF-8')True>>> is_ascii_superset('UTF-16')False>>> is_ascii_superset('IBM500') # a variant of EBCDICFalseEDIT: Get US-ASCII compatibility for each encoding supported by your Qt version in C++:#include <QTextCodec>#include <QMap>typedef enum{ eQtCodecUndefined, eQtCodecAsciiIncompatible, eQtCodecAsciiCompatible,} tQtCodecType;QMap<QByteArray, tQtCodecType> QtCodecTypes(){ QMap<QByteArray, tQtCodecType> CodecTypes; // How to test Qt's interpretation of ASCII data? QList<QByteArray> available = QTextCodec::availableCodecs(); QTextCodec *referenceCodec = QTextCodec::codecForName("UTF-8"); // because Qt has no US-ASCII, but we only test bytes 0-127 and UTF-8 is a superset of US-ASCII if(referenceCodec == 0) { qDebug("Unable to get reference codec 'UTF-8'"); return CodecTypes; } for(int i = 0; i < available.count(); i++) { const QByteArray name = available.at(i); QTextCodec *currCodec = QTextCodec::codecForName(name); if(currCodec == NULL) { qDebug("Unable to get codec for '%s'", qPrintable(QString(name))); CodecTypes.insert(name, eQtCodecUndefined); continue; } tQtCodecType type = eQtCodecAsciiCompatible; for(uchar j = 0; j < 128; j++) // UTF-8 == US-ASCII in the lower 7 bit { const char c = (char)j; // character to test < 2^8 QString sRef, sTest; sRef = referenceCodec->toUnicode(&c, 1); // convert character to UTF-16 (QString internal) assuming it is ASCII (via UTF-8) sTest = currCodec->toUnicode(&c, 1); // convert character to UTF-16 assuming it is of type [currCodec] if(sRef != sTest) // compare both UTF-16 representations -> if they are equal, these codecs are transparent for Qt { type = eQtCodecAsciiIncompatible; break; } } CodecTypes.insert(name, type); } return CodecTypes;} 这篇关于是否有一个ASCII扩展编码列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-14 19:39