问题描述
是否有某种方法可以识别一个单词可能是/不太可能是一个人的名字?
Is there some way to recognize that a word is likely to be/is not likely to be a person's name?
因此,如果我看到理解"一词,则将获得0.01的概率,而约翰逊"一词将返回0.99的概率,而史密斯等单词将返回0.75,而苹果公司则为0.15.
So if I see the word "understanding" I would get a probability of 0.01, whereas the word "Johnson" would return a probability of 0.99, while a word like Smith would return 0.75 and a word like Apple 0.15.
有没有办法做到这一点?
Is there any way to do this?
目标是,如果有人进行搜索,例如说Charles Darwin galapagos
,则搜索引擎猜测它应该在author字段中搜索Charles
和Darwin
,在title和abstract字段中搜索galapagos
.
The goal is, if someone searches for, say Charles Darwin galapagos
, the search engine guesses that it should search the author field for Charles
and Darwin
and the title and abstract fields for galapagos
.
推荐答案
我的快速技巧是:
可从人口普查局的地名列表中按受欢迎程度获取列表,该列表是免费提供的.给每个名字一个标准化的流行度分数(1.0 =流行度,0.0 =最少).
Get the list from the census bureau of names in order of popularity, it's freely available. Give each name a normalized popularity score (1.0 = most popular, 0.0 = least).
然后,获取开源词典,并进行一些研究以汇总每个单词的频率得分.您可以在wiktionary的此处找到一个.给每个单词分配一个流行度评分,即1.0到0.0.方便的是,如果您在频率列表中找不到一个单词,就可以假设它是一个非常不常见的单词.
Then, get an opensource dictionary, and do some research to pull together a frequency score for every word. You can find one here, at wiktionary. Assign every word a popularity score, 1.0 to 0.0. The convenient thing is that if you can't find a word on the frequency list, you get to assume it's a pretty uncommon word.
在两个列表中都查找一个单词.如果仅在一个或另一个上,则说明已完成.如果两者都使用,则使用公式来计算加权概率...类似(名称流行度)/(名称流行度+其他流行度).如果不在任何一个列表中,则可能是名称.
Look for a word on both lists. If it's on just one or the other, you're done. If it's on both, use a formula to compute a weighted probability... something like (Name Popularity) / (Name Popularity + Other Popularity). If it's not on either list, it's probably a name.
这篇关于识别人名与词典词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!