python - 我想从Python 3.x中的句子中删除非英语单词

我有一堆用户查询。在其中，某些查询也包含垃圾字符，例如。 I work in Google asdasb asnlkasn
我只需要I work in Google

import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')

def check_ner(word):
    doc = nlp(word)
    ner_list = []
    for token in doc.ents:
        ner_list.append(token.text)
    return ner_list



sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)

final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not
w.isalpha() or w in ner_list)

我尝试了此操作，但是由于ner将google asdasb asnlkasn检测为Work_of_Art或有时将asdasb asnlkasn检测为Person，因此这不会删除字符。
我必须包含ner，因为words = set(nltk.corpus.words.words())在语料库中没有Google，Microsoft，Apple等或任何其他NER值。

最佳答案

您可以使用它来识别您的非单词。

words = set(nltk.corpus.words.words())

sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())

尝试使用这个。感谢@DYZ answer。

但是，由于您说过您需要针对Google，Apple等的NER，这会导致错误的识别，因此您可以做的是使用波束解析为NER的这些预测计算分数。然后，您可以使用这些分数为NER设置可接受的阈值，并将其降低到该阈值以下。我相信这些无意义的词在诸如人员之类的分类中将获得较低的概率评分，如果不需要它们，您可以将其全部用于删除诸如艺术品之类的类别。

使用beamparse进行评分的示例：

import spacy
import sys
from collections import defaultdict

nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'


with nlp.disable_pipes('ner'):
    doc = nlp(text)


threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

print ('Entities and scores (detected with beam search)')
for key in entity_scores:
    start, end, label = key
    score = entity_scores[key]
    if ( score > threshold):
        print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

它在我的测试中有效，而NER无法识别。

关于python - 我想从Python 3.x中的句子中删除非英语单词，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59301446/