本文介绍了spacy 如何将话题标签作为一个整体进行标记?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在包含主题标签的句子中,例如推文,spacy 的标记器将主题标签拆分为两个标记:

导入空间nlp = spacy.load('en')doc = nlp(u'This is a #sentence.')[文档中的 t 表示 t]

输出:

[This, is, a, #, sentence, .]

我想按如下方式标记标签,这可能吗?

[This, is, a, #sentence, .]
解决方案
  1. 您可以进行一些前后字符串操作,这将使您绕过基于#"的标记化,并且易于实现.例如
>>>>进口重新>>>>进口空间>>>>nlp = spacy.load('en')>>>>句子 = u'这是我的推特更新 #MyTopic'>>>>解析 = nlp(句子)>>>>[token.text for token in parsed]

 [u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
>>>>new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence)>>>>new_sentence u'这是我的推特更新 ZZZPLACEHOLDERZZZMyTopic'>>>>解析 = nlp(new_sentence)>>>>[token.text for token in parsed]

 [u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
>>>>[x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]

 [u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
  1. 您可以尝试在 spacy 的标记器中设置自定义分隔符.我不知道这样做的方法.

UPDATE :您可以使用正则表达式来查找您希望作为单个令牌保留的令牌范围,并使用此处提到的 span.merge 方法重新标记:https://spacy.io/docs/api/span#merge

合并示例:

>>>进口空间>>>进口重新>>>nlp = spacy.load('en')>>>my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'>>>解析 = nlp(my_str)>>>[(x.text,x.pos_) for x in parsed][(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]>>>index = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]>>>索引[(15, 25), (26, 36)]>>>对于开始,结束于索引:... parsed.merge(start_idx=start,end_idx=end)...#MyHashOne#MyHashTwo>>>[(x.text,x.pos_) for x in parsed][(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]>>>

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:

import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]

output:

[This, is, a, #, sentence, .]

I'd like to have hashtags tokenized as follows, is that possible?

[This, is, a, #sentence, .]
解决方案
  1. You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g
 [u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
 [u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
 [u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
  1. You can try setting custom seperators in spacy's tokenizer.I am not aware of methods to do that.

UPDATE : You can use a regex to find span of token you would want to stay as single token, and retokenize using span.merge method as mentioned here : https://spacy.io/docs/api/span#merge

Merge example:

>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
...
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>

这篇关于spacy 如何将话题标签作为一个整体进行标记?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-13 13:40