python - 使用PlainTextCorpusReader创建语料库并对其进行分析

我对python还是比较陌生，并且对了解如何使用NLTK的PlainTextCorpusReader方面创建语料库很感兴趣。我可以导入所有文档。但是，当我运行代码以标记整个语料库中的文本时，它将返回错误。如果这个问题是重复的，我深表歉意，但我想对此有所了解。

这是导入文档的代码。我的计算机上有一堆与2016 DNC相关的文档（为重现性，请从https://github.com/lin-jennifer/2016NCtranscripts中获取部分或全部文本文件）

import os
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import stopwords

corpus_root = '/Users/JenniferLin/Desktop/Data/DNCtexts'
DNClist = PlaintextCorpusReader(corpus_root, '.*')

DNClist.fileids()

#Print the words of one of the texts to make sure everything is loaded
DNClist.words('dnc.giffords.txt')

type(DNClist)

str(DNClist)

当我去标记文本时，这是代码和输出

码：

from nltk.tokenize import sent_tokenize, word_tokenize

DNCtokens = sent_tokenize(DNClist)

输出：TypeError: expected string or bytes-like object

即使我执行DNClist.paras()之类的操作，也会出现读取UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9b in position 7: invalid start byte的错误

我想知道在加载文档或标记过程中是否存在错误。

非常感谢！

最佳答案

看起来您想要做的是标记文件夹中的纯文本文档。如果这是您想要的，则可以通过向PlainTextCorpusReader询问标记来做到这一点，而不是尝试将句子标记符传递给PlainTextCorpusReader。所以代替

DNCtokens = sent_tokenize(DNClist)

请考虑

DNCtokens = DNClist.sents()获取句子，或DNCtokens = DNClist.paras()获取段落。

source code for the reader显示它拥有一个单词标记器和一个句子标记器，并将调用它们进行看起来像您想要的标记。

关于python - 使用PlainTextCorpusReader创建语料库并对其进行分析，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/57395685/