一, 获取文本语料库
一个文本语料库是一大段文本。它通常包含多个单独的文本,但为了处理方便,我们把他们头尾连接起来当做一个文本对待。
1. 古腾堡语料库
nltk包含古腾堡项目(Project Gutenberg)电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包,然后尝试nltk.corpus.gutenberg.fileids().实例如下:
>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt'
, 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-a
lice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.t
xt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', '
shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'w
hitman-leaves.txt']
>>>
运行结果显示的是nltk包含了该语料库的哪些文本。我们可以对其中的任意文本进行操作。
1)统计词数。实例如下:
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427
>>>
2)索引文本。实例如下:
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprise")
Displaying 1 of 1 matches:
that Emma could not but feel some surprise , and a little displeasure , on he
>>>
3)获取文本的标识符,词,句。实例如下:
>>> for fileid in gutenberg.fileids():
... raw = gutenberg.raw(fileid)
... num_chars = len(raw)
... words = gutenberg.words(fileid)
... num_words = len(words)
... sents = gutenberg.sents(fileid)
... num_sents = len(sents)
... vocab = set([w.lower() for w in gutenberg.words(fileid)])
... num_vocab = len(vocab)
... print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid))
...
887071 192427 7752 austen-emma.txt
466292 98171 3747 austen-persuasion.txt
673022 141576 4999 austen-sense.txt
4332554 1010654 30103 bible-kjv.txt
38153 8354 438 blake-poems.txt
249439 55563 2863 bryant-stories.txt
84663 18963 1054 burgess-busterbrown.txt
144395 34110 1703 carroll-alice.txt
457450 96996 4779 chesterton-ball.txt
406629 86063 3806 chesterton-brown.txt
320525 69213 3742 chesterton-thursday.txt
935158 210663 10230 edgeworth-parents.txt
1242990 260819 10059 melville-moby_dick.txt
468220 96825 1851 milton-paradise.txt
112310 25833 2163 shakespeare-caesar.txt
162881 37360 3106 shakespeare-hamlet.txt
100351 23140 1907 shakespeare-macbeth.txt
711215 154883 4250 whitman-leaves.txt >>> raw[:1000]
"[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses fo
r my Body let us write, (for we are one,)\nThat should I after return,\nOr, long
, long hence, in other spheres,\nThere to some group of mates the chants resumin
g,\n(Tallying Earth's soil, trees, winds, tumultuous waves,)\nEver with pleas'd
smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and
now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BO
OK I. INSCRIPTIONS]\n\n} One's-Self I Sing\n\nOne's-self I sing, a simple sepa
rate person,\nYet utter the word Democratic, the word En-Masse.\n\nOf physiology
from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for th
e Muse, I say\n the Form complete is worthier far,\nThe Female equally with t
he Male I sing.\n\nOf Life immense in passion, pulse, and power,\nCheerful, for
freest action form'd under the laws divine,\nThe Modern Man I sing.\n\n\n\n} As
I Ponder'd in Silence\n\nAs I ponder'd in silence,\nReturning upon my poems, c"
>>>
>>> words
['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]
>>> sents
[['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '', ']'], ['Come',
',', 'said', 'my', 'soul', ',', 'Such', 'verses', 'for', 'my', 'Body', 'let', 'u
s', 'write', ',', '(', 'for', 'we', 'are', 'one', ',)', 'That', 'should', 'I', '
after', 'return', ',', 'Or', ',', 'long', ',', 'long', 'hence', ',', 'in', 'othe
r', 'spheres', ',', 'There', 'to', 'some', 'group', 'of', 'mates', 'the', 'chant
s', 'resuming', ',', '(', 'Tallying', 'Earth', "'", 's', 'soil', ',', 'trees', '
,', 'winds', ',', 'tumultuous', 'waves', ',)', 'Ever', 'with', 'pleas', "'", 'd'
, 'smile', 'I', 'may', 'keep', 'on', ',', 'Ever', 'and', 'ever', 'yet', 'the', '
verses', 'owning', '--', 'as', ',', 'first', ',', 'I', 'here', 'and', 'now', 'Si
gning', 'for', 'Soul', 'and', 'Body', ',', 'set', 'to', 'them', 'my', 'name', ',
'], ...]
raw表示的是文本中所有的标识符,words是词,sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(), raw() 和 sents()以外,大多数nltk语料库阅读器还包括多种访问方法。
2. 网络和聊天文本
古腾堡项目包含的是成千上万的书籍,它们比较正式,代表了既定的文学。除此之外, nltk中还有很多的网络文本小集合,其内容包括Firefox交流论坛,在纽约无意中听到的对话,《加勒比海盗》的电影剧本,个人广告和葡萄酒的评论。访问该部分的文本实例如下:
>>> for fileid in webtext.fileids():
... print("%s %s ..." % (fileid, webtext.raw(fileid)[:65]))
...
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se
...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there! [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr
...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun
...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ... >>>
3. 即时消息聊天会话语料库
该语料库最初是由美国海军研究生院为研究自动检测互联网入侵者而收集的,包含超过1000个帖子,被分成15个文件,每个文件包含几百个从特定日期和特定年龄的聊天室收集的帖子。文件名包含日期,聊天室和帖子的数量。引用实例如下:
4.布朗语料库
布朗语料库是第一个百万词级的英语电子语料库,其中包含500个不同来源的文本,按照文体分类,如新闻,社论等。它主要用于研究文体之间的系统性差异(又叫做文体学的语言学研究)。我们可以将语料库作为词链表或者句子链表来访问。
1)按特定类别或文件阅读
>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'new', 'news', 'religion', 'reviews', 'r
omance', 'science_fiction']
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews', ])
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investiga
tion', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no
', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['T
he', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the',
'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'o
f', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks',
'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which
', 'the', 'election', 'was', 'conducted', '.'], ...]
>>>
2)比较不同文体之间情态动词的用法
>>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news')
>>> fdist=nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
... print("%s:%d" %(m, fdist[m]))
...
can:94
could:87
may:93
might:38
must:53
will:389
>>>
5. 路透社语料库
路透社语料库包含10788个新闻文档,共计130万字。这些文档分成90个主题,按照训练和测试分为两组,这样分割是为了方便运用训练和测试算法的自动检测文档的主题。与布朗语料库不同,路透社语料库的类别是相互重叠的,因为新闻报道往往涉及多个主题。我们可以查找由一个或多个文档涵盖的主题,也可以查找包含在一个或者多个类别中的文档。应用实例如下:
>>> reuters.categories()
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', ...]
>>> reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']
>>> reuters.categories(['training/9865', 'training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']
>>> reuters.fileids('barley')
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/158
75',....]
>>> reuters.fileids(['barley', 'corn'])
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/152
87', 'test/15341', 'test/15618', 'test/15648', 'test/15649', ...]
>>>
>>> reuters.words('training/9865')[:14]
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', '
operators', 'have', 'requested', 'licences', 'to', 'export']
>>> reuters.words(['training/9865', 'training/9880'])
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
>>> reuters.words(categories='barley')
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
>>> reuters.words(categories=['barley', 'corn'])
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]
>>>
6.就职演说预料库
该语料库实际上是55个文本的集合,每个文本都是一个总统的演说。这个集合的一个显著特征是时间维度。
>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson
.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monro
e.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.t
xt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt
', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt
', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1
885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.tx
t', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt
', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt'
, '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosev
elt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961
-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Car
ter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.t
xt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
>>> [fileid[:4] for fileid in inaugural.fileids()]
['', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '']
>>>
需要注意的是每个文本的年代都出现在它的文件名中。要从文件名中获得年代,使用fileid[:4]提取前四个字符。
>>> import nltk
>>> cfd=nltk.ConditionalFreqDist(
... (target, fileid[:4]
... )
... for fileid in inaugural.fileids()
... for w in inaugural.words(fileid)
... for target in ['america', 'citizen']
... if w.lower().startswith(target))
>>> cfd.plot()
以上实例是词汇america和citizen随时间推移的使用情况。就职演说语料库中所有以america或citizen开始的词都将被计数。每个演讲单独计数并绘制出图形,这样就能观察出随时间变化这些用法的演变趋势。计数没有与文档长度进行归一化处理。
7.标注文本语料库
许多文本语料库都包含语言学标注,有词性标注,命名实体,句法结构,语义角色等。nltk中提供了几种很方便的方法来访问这几个语料库,而且还包含有语料库和语料样本的数据包,用于教学和科研时可以免费下载。
8.其他语言语料库
nltk还包含多国语言语料库。比如udhr,包含有超过300种语言的世界人权宣言。这个语料库的fileids包括有关文件所使用的字符编码信息,比如:UTF8或者Latin1。利用条件频率分布来研究“世界人权宣言”(udhr)语料库中不同语言版本中的字长差异。应用实例如下:
>>> from nltk.corpus import udhr
>>> languages=['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut'
, 'Hungarian_Magyar', 'Ibibio_Efik']
>>>
>>> cfd=nltk.ConditionalFreqDist(
... (lang, len(word))
... for lang in languages
... for word in udhr.words(lang+'-Latin1'))
>>> cfd.plot(cumulative=True)
>>>
9.nltk中定义的基本语料库函数
示例 | 描述 |
fileids() | 语料库中的文件 |
fileids([categories]) | 分类对应的语料库中的文件 |
categories() | 语料库中的分类 |
categories([fileids]) | 文件对应的语料库中的分类 |
raw() | 语料库的原始内容 |
raw([fileids=[f1, f2, f3]) | 指定文件的原始内容 |
raw(categories=[c1, c2]) | 指定分类的原始内容 |
words() | 整个语料库中的词汇 |
words(fileids=[f1,f2,f3]) | 指定文件中的词汇 |
words(categories=[c1,c2]) | 指定分类中的词汇 |
sents() | 指定分类中的句子 |
sents(fileids=[f1,f2,f3]) | 指定文件中的句子 |
sents(categories=[c1,c2]) | 指定分类中的句子 |
abspath(fileid) | 指定文件在磁盘上的位置 |
encoding(fileid) | 文件编码(如果知道的话) |
open(fileid) | 打开指定语料库文件的文件流 |
root() | 到本地安装的语料库根目录的路径 |
readme() | 语料库中的README文件的内容 |
10.载入自己的语料库
>>> from nltk.corpus import *
>>> corpus_root = r"E:\corpora" //本地存放文本的目录,原始的nltk数据库存放目录为D:\
>>> wordlists=PlaintextCorpusReader(corpus_root, '.*')
>>> wordlists.fileids() //获取文件列表
['README', 'aaaaaaaaaaa.txt', 'austen-emma.txt', 'austen-persuasion.txt', 'auste //其中的aaaaaaaaaaa.txt是自定义的文件
n-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess
-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown
.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'luo.txt', 'melville-
moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-ha
mlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
>>>
自己的语料库加载成功后,我们就可以使用各种函数对其中的语料进行操作。