本文介绍了如何在有序键值存储中进行大于内存字典的模糊字符串匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种算法和存储模式来对比内存字典更大的字符串进行匹配.

I am looking for an algorithm and storage schema to do string matching over a bigger than memory dictionary.

我的最初尝试是受 https://swtch.com/~rsc/regexp的启发/regexp4.html ,用于存储字典中每个单词的三元组合,例如单词apple在索引处被拆分为$apapppplplele$时间.所有与它们来自的单词相关的三字母组合.

My initial attempt, inspired from https://swtch.com/~rsc/regexp/regexp4.html, was to store trigams of every word of the dictionary for instance the word apple is split into $ap, app, ppl, ple and le$ at index time. All of those trigram as associated with the word they came from.

然后,我查询时间,对必须匹配的输入字符串执行相同的操作.我在数据库中查找所有这些三联词,并将其与匹配的三联词的数量关联的映射存储在候选词中.然后,我继续计算每个候选人之间的levenshtein距离,并应用以下公式:

Then I query time, I do the same for the input string that must be matched. I look up every of those trigram in the database and store in the candidate words in mapping associated with the number of matching trigrams in them. Then, I proceed to compute the levenshtein distance between every candidate and apply the following formula:

score(query, candidate) = common_trigram_number(query, candidate) - abs(levenshtein(query, candidate))

这种方法有两个问题,首先是候选人的选择范围太广.其次,levenshtein距离太慢,无法计算.

There is two problems with this approach, first the candidate selection is too broad. Second the levenshtein distance is too slow to compute.

修复第一个问题,可能会使第二个问题无法优化.

Fixing the first, could make the second useless to optimize.

我想到了另一种方法,在索引时,我将存储单词(可能与频率相关联)而不是存储三字组.在查询时,我可以使用levenshtein和频率查找query字符串的连续前缀和分数.

I thought about another approach, at index time, instead of storing trigrams, I will store words (possibly associated with frequency). At query time, I could lookup successive prefixes of the query string and score using levenshtein and frequency.

尤其是,我不是在寻找一种算法,该算法可以使我得到的字符串的距离为1、2等...我只想从字典中找到一个或多或少相关单词的分页列表.实际选择由用户进行.

In particular, I am not looking for an algorithm that gives me strings at a distance of 1, 2 etc... I would like to just have a paginated list of more-or-less relevant words from the dictionary. The actual selection is made by the user.

还必须有可能用有序的键值存储(例如rocksdb或wiredtiger)来表示它.

Also it must be possible to represent it in terms of ordered key-value store like rocksdb or wiredtiger.

推荐答案

simhash捕获(小)字符串之间的相似性.但这并不能真正解决在大于RAM数据集的情况下查询大多数相似字符串的问题.我认为,原始论文建议对一些排列进行索引,它需要大量内存,并且没有利用OKVS的有序特性.

simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.

我想我发现了一个散列,该散列允许在散列的前缀内捕获相似性:

I think I found a hash that allows to capture similarity inside the prefix of the hash:

In [1]: import fuzz

In [2]: hello = fuzz.bbkh("hello")

In [3]: helo = fuzz.bbkh("helo")

In [4]: hellooo = fuzz.bbkh("hellooo")

In [5]: salut = fuzz.bbkh("salut")

In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix
Out[6]: 213

In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))
Out[7]: 12

In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))
Out[8]: 0

在对Wikidata标签进行小型测试之后,似乎可以得出良好的结果:

After small test over wikidata labels it seems to give good results:

$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france    0
** farrance      -2
** freande   -2
** defrance      -2

real    0m0.054s

$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace   -1
** france    -2
** fernacre      -2

real    0m0.060s

$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium   -2

real    0m0.047s


$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium   0
** ajbelgium     -2

real    0m0.059s

$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium   -1
** beijum    -2

real    0m0.047s

这是一个实现:

HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)

chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])


def ngram(string, n):
    return [string[i:i+n] for i in range(len(string)-n+1)]


def integer2booleans(integer):
    return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


def merkletree(booleans):
    assert len(booleans) == HASH_SIZE
    length = (2 * len(booleans) - 1)
    out = [False] * length
    index = length - 1
    booleans = list(reversed(booleans))
    while len(booleans) > 1:
        for boolean in booleans:
            out[index] = boolean
            index -= 1
        new = []
        for (right, left) in chunks(booleans, 2):
            value = right or left
            new.append(value)
        booleans = new
    return out


def bbkh(string):
    integer = 0
    string = "$" + string + "$"
    for gram in ngram(string, 2):
        hotbit = ONE_HOT_ENCODER.index(gram)
        hotinteger = 1 << hotbit
        integer = integer | hotinteger
    booleans = integer2booleans(integer)
    tree = merkletree(booleans)
    fuzz = ''.join('1' if x else '0' for x in tree)
    buzz = int(fuzz, 2)
    hash = buzz.to_bytes(BBKH_LENGTH, 'big')
    return hash


def lcp(a, b):
    """Longest Common Prefix between a and b"""
    out = []
    for x, y in zip(a, b):
        if x == y:
            out.append(x)
        else:
            break
    return ''.join(out)

评估:使用Wikipedia 常见拼写错误个单词,大约有8k个单词.考虑到最接近的前10个单词,每个查询大约需要20毫秒才能获得77%的成功.考虑到前100名,每个查询花费不到200毫秒的时间就能产生94%的成功率.大多数错误都来自连接词(例如,约"而不是约")或连字符之间的破折号.

Evaluation: Using Wikipedia common misspelled words, there is around 8k words. Considering top 10 nearest words, yields 77% success with each query taking around 20ms. Considering top 100, yields 94% success with each query taking less than 200ms. Most mistakes come from joined words (e.g. "abouta" instead of "about a") or words separated with a dash.

> https://github.com/amirouche/fuzzbuzz中检出代码/blob/master/typofix.py

注意:计算simhash而不是输入字符串,仅适用于一袋引理或词干,确实可以找到相似的文档.

Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.

这篇关于如何在有序键值存储中进行大于内存字典的模糊字符串匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 05:36