问题描述
人在我的网站上搜索和其中的一些搜索是这些的:
People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
我的问题是,有没有什么办法来检测字符串,类似于那些上面?
My question is there any way to detect strings that similar to ones above ?
我想这是不可能探测到它们的100%,但任何解决方案将受到欢迎:)
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
编辑:我的意思是jibberish搜索。比如有人喜欢搜索asdqweasdqw,paykaprkg,iwepr wepr流串在我的搜索引擎,我想检测jibberish搜索。
edit: I mean the "jibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
这并不重要,如果搜索结果将是0或其他任何东西。我不能用这个逻辑。
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
一些新的品牌或产品将被忽略,如果我会考虑普通的话。
Some new brands or products will be ignored if I will consider "regular words".
感谢您的帮助
推荐答案
您可以从英文一堆文字塑造性格的典范角色转换。因此,例如,你发现它是有多么普遍要经过T($ P $共同ptty的)一个H。在英语中,你期望一个Q后,你会得到一个U。如果你得到一个Q后面不是一个'U'其他的东西,这将非常低的概率发生,因此它应该是pretty的惊人。规范化计数你的表,让你有一个概率。那么对于一个查询,穿行于矩阵,计算你需要的过渡产品。然后由查询的长度规范化。当数低,你可能有乱码的查询(或东西用不同的语言)。
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
如果你有一堆查询日志,您可能首先使普通英语文本模型,然后重加权自己的查询在模型的训练阶段。
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
有关的背景,了解马尔可夫链的。
编辑,我用Python实现这个位置:
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
和buggedcom在PHP重写了它:
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
这篇关于有什么方法来检测类似putjbtghguhjjjanika字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!