在一个字符串搜索多字节UTF

在一个字符串搜索多字节UTF

本文介绍了它是安全使用`strstr`在一个字符串搜索多字节UTF-8字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

随着我的previous问题:可以讨论,但这无关辩论)

: I should precise that I do not want to use wchar_t type and that strings I handle are UTF-8 encoded (I am aware this choice can be discussed but this an irrelevant debate)

推荐答案

没有的strstr 不适合包含多字节字符的字符串。

No strstr is not suitable for strings containing multi-byte characters.

如果您正在寻找不包含包含多字节字符的字符串中的多字节字符的字符串,它可能会误报。 (当使用语言环境的strstr(掘的东西,@some)可能会误报)

If you are searching for a string that doesn't contain multi-byte character inside a string that contains multi-byte character, it may give false positive. (While using shift-jis encoding in japanese locale, strstr("掘something", "@some") may give false positive)

+---------+----+----+----+
|   c1    | c2 | c3 | c4 |  <--- string
+---------+----+----+----+

     +----+----+----+
     | c5 | c2 | c3 |  <--- string to search
     +----+----+----+

如果尾随C1的一部分(不小心)与C5匹配,你可能会得到不正确的结果。我会建议使用UNI code与UNI code子检查功能或字节串支票的功能。 (为例)

If trailing part of c1 (accidentally) matches with c5, you may get incorrect result. I would suggest using unicode with unicode substring check function or multibyte substring check functions. (_mbsstr for example)

修改

根据从OP更新的疑问,能在UTF-8环境中,例如假阳性存在
因此,答案是UTF-8以这样的方式设计,它是不受字符的部分的不匹配,如上所示,并导致任何假阳性。因此,它是完全可以放心使用的strstr 使用UTF-8 codeD多字节字符。

Edit
Based on updated question from OP that "can such false positive exist in an UTF-8 context"So the answer is UTF-8 is designed in such a way that it is immune to partial mismatch of character as shown above and cause any false positive. So it is completely safe to use strstr with UTF-8 coded multibyte characters.

这篇关于它是安全使用`strstr`在一个字符串搜索多字节UTF-8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 21:38