问题描述
我熟悉 R 中 tm 包中的词干和补全.
I'm familiar with word stemming and completion from the tm package in R.
我试图想出一种快速而肮脏的方法来查找给定单词的所有变体(在某个语料库中).例如,如果我的输入是白细胞"和白细胞",我想得到白细胞".
I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".
如果我现在必须这样做,我可能会选择这样的:
If I had to do it right now, I would probably just go with something like:
library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"),
ignore.case = T, x = dictionary, value = T)
我使用 Lovins 是因为 Snowball 的 Porter 似乎不够激进.
I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.
我愿意为其他词干分析器、脚本语言(Python?)或完全不同的方法提出建议.
I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.
推荐答案
此解决方案需要预处理您的语料库.但是一旦完成,它就是一个非常快速的字典查找.
This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.
from collections import defaultdict
from stemming.porter2 import stem
with open('/usr/share/dict/words') as f:
words = f.read().splitlines()
stems = defaultdict(list)
for word in words:
word_stem = stem(word)
stems[word_stem].append(word)
if __name__ == '__main__':
word = 'leukocyte'
word_stem = stem(word)
print(stems[word_stem])
对于/usr/share/dict/words
语料库,这会产生结果
For the /usr/share/dict/words
corpus, this produces the result
['leukocyte', "leukocyte's", 'leukocytes']
它使用 stemming
模块,可以安装了
It uses the stemming
module that can be installed with
pip install stemming
这篇关于一个(生物医学)词的词干的所有可能的词形补全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!