我有一堆用户查询。在其中,某些查询也包含垃圾字符,例如。 I work in Google asdasb asnlkasn
我只需要I work in Google
import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')
def check_ner(word):
doc = nlp(word)
ner_list = []
for token in doc.ents:
ner_list.append(token.text)
return ner_list
sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)
final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not
w.isalpha() or w in ner_list)
我尝试了此操作,但是由于ner将
google asdasb asnlkasn
检测为Work_of_Art
或有时将asdasb asnlkasn
检测为Person,因此这不会删除字符。我必须包含ner,因为
words = set(nltk.corpus.words.words())
在语料库中没有Google,Microsoft,Apple等或任何其他NER值。 最佳答案
您可以使用它来识别您的非单词。
words = set(nltk.corpus.words.words())
sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
尝试使用这个。感谢@DYZ answer。
但是,由于您说过您需要针对Google,Apple等的NER,这会导致错误的识别,因此您可以做的是使用波束解析为NER的这些预测计算分数。然后,您可以使用这些分数为NER设置可接受的阈值,并将其降低到该阈值以下。我相信这些无意义的词在诸如人员之类的分类中将获得较低的概率评分,如果不需要它们,您可以将其全部用于删除诸如艺术品之类的类别。
使用beamparse进行评分的示例:
import spacy
import sys
from collections import defaultdict
nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'
with nlp.disable_pipes('ner'):
doc = nlp(text)
threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)
entity_scores = defaultdict(float)
for beam in beams:
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score
print ('Entities and scores (detected with beam search)')
for key in entity_scores:
start, end, label = key
score = entity_scores[key]
if ( score > threshold):
print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))
它在我的测试中有效,而NER无法识别。
关于python - 我想从Python 3.x中的句子中删除非英语单词,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59301446/