


I try to build a POS-tagged corpus from external .txt files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution:

  1. 将文件读入纯文本语料库:

  1. Read files with into a plain text corpus:

from nltk.corpus.reader import PlaintextCorpusReader
my_corp = PlaintextCorpusReader(".", r".*\.txt")

  • 带有内置Penn POS-tagger的标记语料库:

  • Tag corpus with built-in Penn POS-tagger:

    my_tagged_corp= nltk.batch_pos_tag(my_corp.sents())

  • (顺便说一句,在此点,Python抛出了错误:NameError: name 'batch' is not defined)

    (By the way, at this pont Python threw an error: NameError: name 'batch' is not defined)

    1. 将标记的句子写到文件中

    1. Write out tagged sentences into file:

    taggedfile = open("output.txt" , "w")
    for sent in dd_tagged:
        line = " ".join( w+"/"+t for (w, t) in sent )
    taggedfile.write(line + "\n")
    taggedfile.close ()

  • 最后,再次以标记的语料读取此输出:

  • And finally, read this output again as tagged corpus:

    from nltk.corpus.reader import TaggedCorpusReader
    my_corpus2 = TaggedCorpusReader(".",r"output.txt")

  • 对于一个非常普通的任务(分块总是请求带标签的语料库)来说,这都是非常不便的.我的问题是:是否有更紧凑,更优雅的方法来实现这一目标?例如,可以同时获取原始输入文件和标记器的语料库阅读器?

    That is all very inconvenient for a quite common task (chunking always requests tagged corpus). My question is: is there a more compact and elegant way to implement this? A corpus reader that gets raw input files and a tagger at the same time for instance?



    I got the working solution for this:Kindly refer to link for step by step procedure.


    一旦您遵循 1 中的命令,就会生成泡菜文件这是您标记的语料库.

    Once you follow commands from 1 pickle file will be generated and this is your tagged corpus.


    Once pickle file is generated you can check whether your tagger is working fine by running following piece of code:

    import nltk.data
    tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
    tagger.tag(['some', 'words', 'in', 'a', 'sentence'])


    09-15 03:36