问题描述
我已经使用stanford nlp软件包编写了以下代码.
I have written below code using stanford nlp packages.
GenderAnnotator myGenderAnnotation = new GenderAnnotator();
myGenderAnnotation.annotate(annotation);
但是对于句子安妮上学",它无法识别安妮的性别.
But for the sentence "Annie goes to school", it is not able to identify the gender of Annie.
应用程序的输出为:
[Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON]
[Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O]
[Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O]
[Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O]
[Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]
获得性别的正确方法是什么?
What is the correct approach to get the gender?
推荐答案
如果您的命名实体识别器为令牌输出PERSON
,则您可以基于第一个使用(或构建,如果没有一个)性别分类器名称.例如,请参阅NLTK库教程页面中的性别识别部分.他们使用以下功能:
If your named entity recognizer outputs PERSON
for a token, you might use (or build if you don't have one) a gender classifier based on first names. As an example, see the Gender Identification section from the NLTK library tutorial pages. They use the following features:
- 姓氏的最后一个字母.
- 名字的第一个字母.
- 姓名长度(字符数).
- 字符会标出现(布尔值名称中是否包含字符).
但是,我有一种直觉,即使用字符n-gram频率-可能最多使用字符三字母组-会给您带来很好的结果.
Though, I have a hunch that using character n-gram frequency---possibly up to character trigrams---will give you pretty good results.
这篇关于自然语言处理中的性别识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!