我有“stems”和“ends”(可能不是正确的词)的映射,如下所示:
all_endings = {
'birth': set(['place', 'day', 'mark']),
'snow': set(['plow', 'storm', 'flake', 'man']),
'shoe': set(['lace', 'string', 'maker']),
'lock': set(['down', 'up', 'smith']),
'crack': set(['down', 'up',]),
'arm': set(['chair']),
'high': set(['chair']),
'over': set(['charge']),
'under': set(['charge']),
}
但是,当然要长得多。我还用另一种方法制作了相应的字典:
all_stems = {
'chair': set(['high', 'arm']),
'charge': set(['over', 'under']),
'up': set(['lock', 'crack', 'vote']),
'down': set(['lock', 'crack', 'fall']),
'smith': set(['lock']),
'place': set(['birth']),
'day': set(['birth']),
'mark': set(['birth']),
'plow': set(['snow']),
'storm': set(['snow']),
'flake': set(['snow']),
'man': set(['snow']),
'lace': set(['shoe']),
'string': set(['shoe']),
'maker': set(['shoe']),
}
我现在尝试提出一种算法,以找到两个或多个“词根”匹配两个或多个“结尾”的任何匹配项。例如,在上方,它会与锁和裂纹上下匹配,从而导致
lockdown
lockup
crackdown
crackup
但是不包括
'upvote', 'downfall' or 'locksmith'
(这是导致我最大的问题)。我得到如下的误报:pancake
cupcake
cupboard
但是我只是在“循环”中转转。 (双关语意),我似乎一无所获。我会向正确的方向踢任何脚步。
到目前为止,代码困惑而无用,您可能应该忽略它们:
findings = defaultdict(set)
for stem, endings in all_endings.items():
# What stems have matching endings:
for ending in endings:
otherstems = all_stems[ending]
if not otherstems:
continue
for otherstem in otherstems:
# Find endings that also exist for other stems
otherendings = all_endings[otherstem].intersection(endings)
if otherendings:
# Some kind of match
findings[stem].add(otherstem)
# Go through this in order of what is the most stems that match:
MINMATCH = 2
for match in sorted(findings.values(), key=len, reverse=True):
for this_stem in match:
other_stems = set() # Stems that have endings in common with this_stem
other_endings = set() # Endings this stem have in common with other stems
this_endings = all_endings[this_stem]
for this_ending in this_endings:
for other_stem in all_stems[this_ending] - set([this_stem]):
matching_endings = this_endings.intersection(all_endings[other_stem])
if matching_endings:
other_endings.add(this_ending)
other_stems.add(other_stem)
stem_matches = all_stems[other_endings.pop()]
for other in other_endings:
stem_matches = stem_matches.intersection(all_stems[other])
if len(stem_matches) >= MINMATCH:
for m in stem_matches:
for e in all_endings[m]:
print(m+e)
最佳答案
它不是特别漂亮,但是如果将字典分为两个列表并使用显式索引,则这非常简单:
all_stems = {
'chair' : set(['high', 'arm']),
'charge': set(['over', 'under']),
'fall' : set(['down', 'water', 'night']),
'up' : set(['lock', 'crack', 'vote']),
'down' : set(['lock', 'crack', 'fall']),
}
endings = all_stems.keys()
stem_sets = all_stems.values()
i = 0
for target_stem_set in stem_sets:
i += 1
j = 0
remaining_stems = stem_sets[i:]
for remaining_stem_set in remaining_stems:
j += 1
union = target_stem_set & remaining_stem_set
if len(union) > 1:
print "%d matches found" % len(union)
for stem in union:
print "%s%s" % (stem, endings[i-1])
print "%s%s" % (stem, endings[j+i-1])
输出:
$ python stems_and_endings.py
2 matches found
lockdown
lockup
crackdown
crackup
基本上,我们要做的就是依次遍历每个集合,并将其与其余每个集合进行比较,以查看是否存在两个以上的匹配项。我们永远不必尝试早于当前设置的设置,因为它们已经在先前的迭代中进行过比较。其余的(索引等)只是记账。
关于python - 查找词干和词尾的组合,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/4749418/