我打算使用spacy和textacy来识别英语中的句子结构。

例如:
猫坐在垫子上-SVO,猫跳了起来,拿起了 cookies -SVV0。
那只猫吃了 cookies 和 cookies 。 -SVOO。

该程序应该读取一个段落并以SVO,SVOO,SVVO或其他自定义结构返回每个句子的输出。

迄今为止的努力:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"]
VERB = ["ROOT"]
OBJ = ["dobj", "pobj", "dobj"]
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)

输出:
(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
  • 问题1:SVO被覆盖。为什么?
  • 问题2:如何将句子识别为SVOO SVO SVVO等?

  • 编辑1:

    我正在概念化的某种方法。
    from __future__ import unicode_literals
    import spacy,en_core_web_sm
    import textacy
    nlp = en_core_web_sm.load()
    sentence = 'I will go to the mall.'
    doc = nlp(sentence)
    chk_set = set(['PRP','MD','NN'])
    result = chk_set.issubset(t.tag_ for t in doc)
    if result == False:
        print "SVO not identified"
    elif result == True: # shouldn't do this
        print "SVO"
    else:
        print "Others..."
    

    编辑2:

    取得进一步进展
    from __future__ import unicode_literals
    import spacy,en_core_web_sm
    import textacy
    nlp = en_core_web_sm.load()
    sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
    doc = nlp(sentence)
    print(" ".join([token.dep_ for token in doc]))
    

    电流输出:

    det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct

    预期产量:
    SVO SVVO SVOO
    

    想法是将依赖项标签分解为简单的主语-动词和宾语模型。

    如果没有其他选择,可以考虑使用正则表达式来实现。但这是我的最后选择。

    编辑3:

    研究了this link之后,有了一些改进。
    def testSVOs():
        nlp = en_core_web_sm.load()
        tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
        svos = findSVOs(tok)
        print(svos)
    

    电流输出:
    [(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
    

    预期输出:

    我期待句子的符号。尽管我能够提取SVO上如何将其转换为SVO表示法。它更多是模式识别,而不是句子内容本身。
    SVO SVO SVOO
    

    最佳答案



    这是textacy问题。这部分效果不佳,请参见blog



    您应该解析依赖关系树。 SpaCy提供了这些信息,您只需要编写一组规则即可使用.head.left.right.children属性将其提取出来。

    >>for word in text:
        print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))
    
            The    DT        det        DET cat
            cat    NN      nsubj       NOUN sat
            sat   VBD       ROOT       VERB sat
             on    IN       prep        ADP sat
            the    DT        det        DET mat
            mat    NN       pobj       NOUN on
              .     .      punct      PUNCT sat
             of    IN       ROOT        ADP of
            the    DT        det        DET lab
            art    NN   compound       NOUN lab
            lab    NN       pobj       NOUN of
              .     .      punct      PUNCT of
            The    DT        det        DET cat
            cat    NN      nsubj       NOUN jumped
         jumped   VBD       ROOT       VERB jumped
            and    CC         cc      CCONJ jumped
         picked   VBD       conj       VERB jumped
             up    RP        prt       PART picked
            the    DT        det        DET biscuit
        biscuit    NN       dobj       NOUN picked
              .     .      punct      PUNCT jumped
            The    DT        det        DET cat
            cat    NN      nsubj       NOUN ate
            ate   VBD       ROOT       VERB ate
        biscuit    NN       dobj       NOUN ate
            and    CC         cc      CCONJ biscuit
        cookies   NNS       conj       NOUN biscuit
              .     .      punct      PUNCT ate
    

    我建议您查看此code,只需将pobj添加到OBJECTS列表中,您将覆盖SVO和SVOO。稍微摆弄一下就可以得到SVVO。

    08-24 23:22