问题描述
Spacy 的 pos tagger 真的很方便,可以直接在原始句子上打标签.
Spacy's pos tagger is really convenient, it can directly tag on raw sentence.
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I am eating")
但我正在使用 nltk
中的标记器.那么如何使用标记化的句子,如['I', 'am', 'eating']
而不是 Spacy 的标注者的我在吃"?
But I'm using tokenizer from nltk
. So how to use a tokenized sentence like['I', 'am', 'eating']
rather than 'I am eating' for the Spacy's tagger?
顺便说一句,我在哪里可以找到详细的 Spacy 文档?我只能在官方网站
BTW, where can I found detailed Spacy documentation?I can only find an overview on the official website
谢谢.
推荐答案
有两个选项:
您围绕
nltk
标记器编写一个包装器,并使用它来将文本转换为 spaCy 的Doc
格式.然后用新的自定义函数覆盖nlp.tokenizer
.更多信息:https://spacy.io/usage/linguistic-features#custom-分词器.
You write a wrapper around the
nltk
tokenizer and use it to convert text to spaCy'sDoc
format. Then overwritenlp.tokenizer
with that new custom function. More info here: https://spacy.io/usage/linguistic-features#custom-tokenizer.
直接从字符串列表中生成一个 Doc
,如下所示:
Generate a Doc
directly from a list of strings, like so:
doc = Doc(nlp.vocab, words=[u"I", u"am", u"eating", u"."],空格=[真、真、假、假])
定义 spaces
是可选的 - 如果您省略它,默认情况下每个单词后面都会跟一个空格.这在使用时很重要,例如doc.text
之后.更多信息:https://spacy.io/usage/linguistic-features#own-注释
Defining the spaces
is optional - if you leave it out, each word will be followed by a space by default. This matters when using e.g. the doc.text
afterwards. More information here: https://spacy.io/usage/linguistic-features#own-annotations
[edit]: 注意 nlp
和 doc
是 spaCy 中的标准"变量名,它们对应于变量 sp
和 sen
分别在您的代码中
[edit]: note that nlp
and doc
are sort of 'standard' variable names in spaCy, they correspond to the variables sp
and sen
respectively in your code
这篇关于如何使用标记化的句子作为 Spacy 的 PoS 标记器的输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!