使用Stanford NLP的西班牙POS标签-是否可以获取人员/号码/性别?

本文介绍了使用Stanford NLP的西班牙POS标签-是否可以获取人员/号码/性别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Stanford NLP对西班牙语文本进行POS标签.我可以为每个单词获取一个POS标签，但是我注意到我只获得了Ancora标签的前四个部分，而缺少人，数字和性别的后三个部分.

为什么Stanford NLP只使用简化版的Ancora标签?
是否可以使用Stanford NLP来获取整个标签?

这是我的代码(请原谅jruby ...):

props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")

pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")

我得到这个作为输出:

没关系，我看到Stanford NLP不支持西班牙语词义化.)

解决方案

这是确保高标记准确性的实际决定. (在标签上保留形态信息会使整个标签器遭受数据稀疏性的困扰，不仅在形态标注上，而且在整个范围上都更加糟糕.)

不.但是，您可以使用一个简单的基于规则的系统来做很多事情，或者使用Stanford分类器来训练您自己的形态注释器. (如果选择任何一条路径，请随时共享您的代码！)

I'm using Stanford NLP to do POS tagging for Spanish texts. I can get a POS Tag for each word but I notice that I am only given the first four sections of the Ancora tag and it's missing the last three sections for person, number and gender.

Why does Stanford NLP only use a reduced version of the Ancora tag?
Is it possible to get the entire tag using Stanford NLP?

Here is my code (please excuse the jruby...):

props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")

pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")

I am getting this as the output:

Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)

解决方案

This was a practical decision made to ensure high tagging accuracy. (Retaining morphological information on tags caused the entire tagger to suffer from data sparsity, and do worse not only on morphological annotation but all over the board.)

No. You could get quite far doing this with a simple rule-based system, though, or use the Stanford Classifier to train your own morphological annotator. (Feel free to share your code if you pick either path!)

这篇关于使用Stanford NLP的西班牙POS标签-是否可以获取人员/号码/性别?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！