python - 使用pattern.en格式化整个文本？

我需要分析一些文本以进行机器学习。我认识的一位数据科学家建议我在项目中使用pattern.en。

我将给我的程序一个关键字（例如：披萨），它必须根据我给他的一些文本对一些“趋势”进行排序。（示例：我给他提供了有关披萨上的花生酱的文章，因此该程序将确定花生酱是一种增长趋势。）

因此，一开始，我必须“清理”文本。我知道pattern.en可以将单词识别为名词，动词，副词等。我想删除所有确定词，文章和其他“无意义”的单词进行分析，但是我不知道该怎么做。我尝试parse()以便获得：

s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely"
parsedS = parse(s)
print(parsedS)

输出：

Hello/UH/hello ,/,/, how/WRB/how is/VBZ/be it/PRP/it going/VBG/go ?/./?
I/PRP/i am/VBP/be tired/VBN/tire actually/RB/actually ,/,/, did/VBD/do not/RB/not sleep/VB/sleep enough/RB/enough .../:/...
That/DT/that is/VBZ/be bad/JJ/bad for/IN/for work/NN/work ,/,/, definitely/RB/definitely

因此，我想删除带有“ UH”，“，”，“ PRP”等标签的单词，但我不知道该怎么做，并且不会弄乱句子（出于分析目的，我会忽略不带句子的句子）我的示例中的“披萨”一词）

我不知道我的解释是否很清楚，请随时问我您是否不了解。

编辑-更新：在canyon289的回答之后，我想逐句而不是针对整个文本。我试过了：

for sentence in Text(s):
    sentence = sentence.split(" ")
    print("SENTENCE :")
    for word in sentence:
        if not any(tag in word for tag in dont_want):
            print(word)

但是我有以下错误：

AttributeError: 'Sentence' object has no attribute 'split'

我该如何解决这个问题？

最佳答案

这应该为你工作

s = "Hello, how is it going ? I am tired actually, did not sleep   enough... That is bad for work, definitely"
s = parse(s)

#Create a list of all the tags you don't want
dont_want = ["UH", "PRP"]

sentence = parse(s).split(" ")

#Go through all the words and look for any occurence of the tag you don't want
#This is done through a nested list comprehension
[word for word in sentence if not any(tag in word for tag in dont_want)]

  [u'，/，/ O / O'，u'how / WRB / O / O'，u'is / VBZ / B-VP / O'，u'going / VBG / B-VP / O'，
  u'am / VBP / B-VP / O'，u'tired / VBN / I-VP / O'，u'actually / RB / B-ADVP / O'，
  u'，/，/ O / O'，u'did / VBD / B-VP / O'，u'not / RB / I-VP / O'，u'sleep / VB / I-VP / O'，
  u'enough / RB / B-ADVP / O'，u'... /：/ O / O \ nThat / DT / O / O'，u'is / VBZ / B-VP / O'，
  u'bad / JJ / B-ADJP / O'，u'for / IN / B-PP / B-PNP'，u'work / NN / B-NP / I-PNP'，
  u'，/，/ O / O'，绝对是'RB / B-ADVP / O']

关于python - 使用pattern.en格式化整个文本？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/30054101/