查找词干和词尾的组合

查找词干和词尾的组合

我有“stems”和“ends”(可能不是正确的词)的映射,如下所示:

all_endings = {
 'birth': set(['place', 'day', 'mark']),
 'snow': set(['plow', 'storm', 'flake', 'man']),
 'shoe': set(['lace', 'string', 'maker']),
 'lock': set(['down', 'up', 'smith']),
 'crack': set(['down', 'up',]),
 'arm': set(['chair']),
 'high': set(['chair']),
 'over': set(['charge']),
 'under': set(['charge']),
}

但是,当然要长得多。我还用另一种方法制作了相应的字典:
all_stems = {
 'chair': set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'up': set(['lock', 'crack', 'vote']),
 'down': set(['lock', 'crack', 'fall']),
 'smith': set(['lock']),
 'place': set(['birth']),
 'day': set(['birth']),
 'mark': set(['birth']),
 'plow': set(['snow']),
 'storm': set(['snow']),
 'flake': set(['snow']),
 'man': set(['snow']),
 'lace': set(['shoe']),
 'string': set(['shoe']),
 'maker': set(['shoe']),
}

我现在尝试提出一种算法,以找到两个或多个“词根”匹配两个或多个“结尾”的任何匹配项。例如,在上方,它会与锁和裂纹上下匹配,从而导致
lockdown
lockup
crackdown
crackup

但是不包括'upvote', 'downfall' or 'locksmith'(这是导致我最大的问题)。我得到如下的误报:
pancake
cupcake
cupboard

但是我只是在“循环”中转转。 (双关语意),我似乎一无所获。我会向正确的方向踢任何脚步。

到目前为止,代码困惑而无用,您可能应该忽略它们:
findings = defaultdict(set)
for stem, endings in all_endings.items():
    # What stems have matching endings:
    for ending in endings:
        otherstems = all_stems[ending]
        if not otherstems:
            continue
        for otherstem in otherstems:
            # Find endings that also exist for other stems
            otherendings = all_endings[otherstem].intersection(endings)
            if otherendings:
                # Some kind of match
                findings[stem].add(otherstem)

# Go through this in order of what is the most stems that match:

MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
    for this_stem in match:
        other_stems = set() # Stems that have endings in common with this_stem
        other_endings = set() # Endings this stem have in common with other stems
        this_endings = all_endings[this_stem]
        for this_ending in this_endings:
            for other_stem in all_stems[this_ending] - set([this_stem]):
                matching_endings = this_endings.intersection(all_endings[other_stem])
                if matching_endings:
                    other_endings.add(this_ending)
                    other_stems.add(other_stem)

        stem_matches = all_stems[other_endings.pop()]
        for other in other_endings:
            stem_matches = stem_matches.intersection(all_stems[other])

        if len(stem_matches) >= MINMATCH:
            for m in stem_matches:
                for e in all_endings[m]:
                    print(m+e)

最佳答案

它不是特别漂亮,但是如果将字典分为两个列表并使用显式索引,则这非常简单:

all_stems = {
 'chair' : set(['high', 'arm']),
 'charge': set(['over', 'under']),
 'fall'  : set(['down', 'water', 'night']),
 'up'    : set(['lock', 'crack', 'vote']),
 'down'  : set(['lock', 'crack', 'fall']),
}

endings     = all_stems.keys()
stem_sets   = all_stems.values()

i = 0
for target_stem_set in stem_sets:
    i += 1
    j  = 0

    remaining_stems = stem_sets[i:]
    for remaining_stem_set in remaining_stems:
        j += 1
        union = target_stem_set & remaining_stem_set
        if len(union) > 1:
            print "%d matches found" % len(union)
            for stem in union:
                print "%s%s" % (stem, endings[i-1])
                print "%s%s" % (stem, endings[j+i-1])

输出:
$ python stems_and_endings.py
2 matches found
lockdown
lockup
crackdown
crackup

基本上,我们要做的就是依次遍历每个集合,并将其与其余每个集合进行比较,以查看是否存在两个以上的匹配项。我们永远不必尝试早于当前设置的设置,因为它们已经在先前的迭代中进行过比较。其余的(索引等)只是记账。

关于python - 查找词干和词尾的组合,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/4749418/

10-10 06:30