问题描述
SpaCy的文档在此处中提供了一些有关添加新some语的信息.. >
但是,我想知道:
(1)我什么时候应该调用以下函数?
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS)
根据入门指南此处,spaCy的典型用法如下: :
import spacy
nlp = spacy.load('en')
# Should I call the function add_lookups(...) here?
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
(2)在处理管道中何时处理规范异常?
我假设这样一个典型的管道:tokenizer->标记器->解析器-> ner.
在分词器之前是否处理过规范异常?并且,规范例外组件相对于其他预处理组件(如停用词,词条消除器)的组织方式(请参阅组件的完整列表此处)?什么先于什么?
对于SpaCy来说是新手,我们将不胜感激.谢谢!
规范异常是语言数据的一部分,而属性getter(获取文本并返回规范的函数)是使用语言类初始化的,例如 此处的假设是,规范异常通常是特定于语言的,因此应在语言数据中定义,而与处理管道无关.规范也是词法属性,因此它们的获取者生活在基础词素上,即词汇表中与上下文无关的条目(与令牌(上下文中的单词)相反). 但是,关于 请记住, 例如,默认情况下,spaCy将所有货币符号归一化为 如果您打算训练自己的模型时考虑到您的自定义规范,则可能需要考虑实现自定义语言子类.另外,如果您认为要添加的语默认情况下应包含在spaCy中,则始终可以提交拉动请求,例如英语 SpaCy's documentation has some information on adding new slangs here. However, I'd like to know: (1) When should I call the following function? The typical usage of spaCy, according to the introduction guide here, is something as follows: (2) When in the processing pipeline are norm exceptions handled? I'm assuming a typical pipeline as such: tokenizer -> tagger -> parser -> ner. Are norm exceptions handled right before the tokenizer? And also, how is the norm exceptions component organized with respect to the other pre-processing components such as stop words, lemmatizer (see full list of components here)? What comes before what? Am new to spaCy and much help would be appreciated. Thanks! The norm exceptions are part of the language data and the attribute getter (the function that takes a text and returns the norm), is initialised with the language class, e.g. The assumption here is that the norm exceptions are usually language-specific and should thus be defined in the language data, independent of the processing pipeline. Norms are also lexical attributes, so their getters live on the underlying lexeme, the context-insensitive entry in the vocabulary (as opposed to a token, which is the word in context). However, the nice thing about the Keep in mind that the For example, by default, spaCy normalises all currency symbols to If you're planning on training your own model that takes your custom norms into account, you might want to consider implementing a custom language subclass. Alternatively, if you think that the slang terms you want to add should be included in spaCy by default, you can always submit a pull request, for example to the English 这篇关于如何将自定义y语添加到spaCy的norm_exceptions.py模块中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!English
.您可以在这里一个>.这一切都发生在构建管道之前.token.norm_
的好处是它是可写的–因此,您可以轻松添加自定义管道组件,它在您自己的字典中查找令牌的文本,并在必要时覆盖规范:def add_custom_norms(doc):
for token in doc:
if token.text in YOUR_NORM_DICT:
token.norm_ = YOUR_NORM_DICT[token.text]
return doc
nlp.add_pipe(add_custom_norms, last=True)
NORM
属性还用作模型中的功能,因此,根据要添加或覆盖的规范,您可能只想在 称为标记器,解析器或实体识别器."$"
,以确保它们都收到相似的表示,即使其中之一在训练数据中的使用频率较低.如果您的自定义组件现在用"Euro"
覆盖"€"
,这也会对模型的预测产生影响.因此,您可能会发现MONEY
实体的预测不准确.norm_exceptions.py
.lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS)
import spacy
nlp = spacy.load('en')
# Should I call the function add_lookups(...) here?
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
English
. You can see an example of this here. This all happens before the pipeline is even constructed.token.norm_
is that it's writeable – so you can easily add a custom pipeline component that looks up the token's text in your own dictionary, and overwrites the norm if necessary:def add_custom_norms(doc):
for token in doc:
if token.text in YOUR_NORM_DICT:
token.norm_ = YOUR_NORM_DICT[token.text]
return doc
nlp.add_pipe(add_custom_norms, last=True)
NORM
attribute is also used as a feature in the model, so depending on the norms you want to add or overwrite, you might want to only apply your custom component after the tagger, parser or entity recognizer is called."$"
to ensure that they all receive similar representations, even if one of them is less frequent in the training data. If your custom component now overwrites "€"
with "Euro"
, this will also have an impact on the model's predictions. So you might see less accurate predictions for MONEY
entities.norm_exceptions.py
.