Python re.split() vs nltk word_tokenize 和 sent_tokenize

本文介绍了Python re.split() vs nltk word_tokenize 和 sent_tokenize的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只是想知道 NLTK 在单词/句子标记化方面是否会比正则表达式更快.

解决方案

默认 nltk.word_tokenize() 是使用 Treebank 分词器模拟 Penn Treebank 标记器.

请注意，str.split() 没有实现语言学意义上的标记，例如:

>>>sent = "这是一个 foo, bar 语句.">>>发送.split()['This', 'is', 'a', 'foo,', 'bar', 'sentence.']>>>从 nltk 导入 word_tokenize>>>word_tokenize(发送)['this', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']

通常用于用指定的分隔符分隔字符串，例如在制表符分隔的文件中，您可以使用 str.split('') 或者当您尝试通过换行符分割字符串时当您的文本文件每行有一个句子.

让我们在 python3 中做一些基准测试:

导入时间从 nltk 导入 word_tokenize导入 urllib.requesturl = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'响应 = urllib.request.urlopen(url)数据 = response.read().decode('utf8')对于 _ 范围(10):开始 = time.time()对于 data.split('
') 中的行:line.split()打印 ('str.split():	', time.time() - 开始)对于 _ 范围(10):开始 = time.time()对于 data.split('
') 中的行:word_tokenize(行)打印 ('word_tokenize():	', time.time() - 开始)

[输出]:

str.split(): 0.05451083183288574str.split(): 0.054320573806762695str.split(): 0.05368804931640625str.split(): 0.05416440963745117str.split(): 0.05299568176269531str.split(): 0.05304527282714844str.split(): 0.05356955528259277str.split(): 0.05473494529724121str.split(): 0.053118228912353516str.split(): 0.05236077308654785word_tokenize(): 4.056122779846191word_tokenize(): 4.052812337875366word_tokenize(): 4.042144775390625word_tokenize(): 4.101543664932251word_tokenize(): 4.213029146194458word_tokenize(): 4.411528587341309word_tokenize(): 4.162556886672974word_tokenize(): 4.225975036621094word_tokenize(): 4.22914719581604word_tokenize(): 4.203172445297241

如果我们尝试另一个前沿 NLTK 中的标记器来自 https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:

导入时间从 nltk.tokenize 导入 ToktokTokenizer导入 urllib.requesturl = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'响应 = urllib.request.urlopen(url)数据 = response.read().decode('utf8')toktok = ToktokTokenizer().tokenize对于 _ 范围(10):开始 = time.time()对于 data.split('
') 中的行:toktok(线)打印 ('toktok:	', time.time() - 开始)

[输出]:

toktok: 1.5902607440948486toktok:1.5347232818603516toktok:1.4993178844451904toktok:1.5635688304901123toktok:1.5779635906219482toktok:1.8177132606506348toktok:1.4538452625274658toktok:1.5094449520111084toktok:1.4871931076049805toktok:1.4584410190582275

(注:文本文件来源来自https://github.com/Simdiva/DSL-任务)

如果我们查看原生的 perl 实现，ToktokTokenizer 的 python 与 perl 时间是可比的.但是在 python 实现中这样做，正则表达式是在 perl 中预编译的，但不是证据仍在布丁中:

alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl解析 raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133连接到 raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... 已连接.HTTP 请求已发送，正在等待响应... 200 OK长度:2690 (2.6K) [text/plain]保存到:‘tok-tok.pl’100%[==============================================================================================================================>] 2,690 --.-K/s in 0s2016-02-11 20:36:36 (259 MB/s) - tok-tok.pl"已保存 [2690/2690]alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt解析 raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133连接到 raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... 已连接.HTTP 请求已发送，正在等待响应... 200 OK长度:3483550(3.3M)【文字/普通】保存到:'test.txt'100%[==============================================================================================================================>] 3,483,550 363KB/s in 7.4s2016-02-11 20:36:46 (459 KB/s) - test.txt"已保存 [3483550/3483550]alvas@ubi:~$ time perl tok-tok.pl <测试.txt >/tmp/空真正的 0m1.703s用户 0m1.693s系统 0m0.008salvas@ubi:~$ time perl tok-tok.pl <测试.txt >/tmp/空真正的 0m1.715s用户 0m1.704s系统 0m0.008salvas@ubi:~$ time perl tok-tok.pl <测试.txt >/tmp/空真正的 0m1.700s用户 0m1.686s系统 0m0.012salvas@ubi:~$ time perl tok-tok.pl <测试.txt >/tmp/空真正的 0m1.727s用户 0m1.700s系统 0m0.024salvas@ubi:~$ time perl tok-tok.pl <测试.txt >/tmp/空真正的 0m1.734s用户 0m1.724s系统 0m0.008s

(注意:在对 tok-tok.pl 进行计时时，我们必须将输出通过管道传输到文件中，因此这里的计时包括机器输出到文件所需的时间，而在nltk.tokenize.ToktokTokenizer 时间，不包括输出到文件的时间)

关于sent_tokenize()，它有点不同，在不考虑准确性的情况下比较速度基准有点古怪.

考虑一下:

如果正则表达式将文本文件/段落拆分为 1 个句子，则速度几乎是瞬时的，即完成了 0 个工作.但这将是一个可怕的句子标记器......
如果文件中的句子已经被分隔，那么这只是比较str.split('')vs re.split('') 和 nltk 与句子标记化无关;P

有关 sent_tokenize() 在 NLTK 中如何工作的信息，请参阅:

nltk punkt 的训练数据格式
在 NLTK 中使用 PunktSentenceTokenizer

因此要有效地比较 sent_tokenize() 与其他基于正则表达式的方法(不是 str.split(''))，还必须评估准确性并有一个数据集，其中包含以标记化格式人工评估的句子.

考虑这个任务:https://www.hackerrank.com/challenges/从段落到句子

给定文本:

在第三类中，他包括那些兄弟(大多数)在共济会中只看到外部形式和仪式，以及珍视这些形式的严格表现而不必担心它们的主旨或意义.这就是威拉斯基，甚至是格兰德主要小屋的主人.最后，对于第四类也是一个很多兄弟都属于，尤其是那些最近加入了.根据皮埃尔的观察，这些人是没有什么都信，什么都不想，而是加入了共济会只是为了与那些富有的年轻兄弟交往通过他们的关系或等级有影响力，并且其中有小屋里有很多人.皮埃尔开始对他的事情感到不满正在做.共济会，无论如何他在这里看到的，有时在他看来，这仅仅是基于外在的.他没想到怀疑共济会本身，但怀疑俄罗斯砌体已经采取了走错了路，偏离了原来的原则.所以对年底出国升职命令的秘密.在这种情况下该怎么办?到赞成革命，推翻一切，以武力击退?不！我们远非如此.每一次暴力改革都值得谴责，因为它在人保持原样的情况下，完全无法消除邪恶，而且因为智慧不需要暴力.但跑过有什么意义像那样?"伊拉金的新郎说.一旦她错过了它并转身把它拿走，任何杂种都可以拿走，"伊拉金同时说时间，他的疾驰和兴奋让他喘不过气来.

我们想得到这个:

在第三类中，他包括那些在共济会中只看到外在形式和仪式的兄弟(大多数)，并珍视这些形式的严格表现，而不会担心它们的主旨或意义.威拉斯基甚至是主要小屋的大师.最后，第四类也有很多兄弟，尤其是最近加入的兄弟.根据皮埃尔的观察，这些人不相信任何事情，也不渴望任何事情，而加入共济会只是为了与富有的年轻兄弟交往，这些兄弟因他们的关系或等级而具有影响力，其中有很多人在会所里.皮埃尔开始对自己的所作所为不满意.无论如何，在他看来，共济会有时对他来说似乎只是基于外部.他没有想到怀疑共济会本身，而是怀疑俄罗斯共济会走上了错误的道路，偏离了其最初的原则.因此，在年底，他出国去了解教团的更高机密.在这些情况下该怎么办?赞成革命，推翻一切，以武力击退?不！我们离那还很远.每一次暴力改革都值得谴责，因为它完全无法在人保持原样的情况下消除邪恶，也因为智慧不需要暴力.可是这样跑过去有什么用呢?"伊拉金的新郎说.一旦她错过了它并把它拒之门外，任何杂种都可以接受它，"伊拉金同时说道，他的疾驰和兴奋让他气喘吁吁.

所以简单地做 str.split('') 不会给你任何东西.即使不考虑句子的顺序，你也会得到 0 个肯定的结果:

>>>text = """在第三类中，他包括那些在共济会中只看到外在形式和仪式的兄弟(大多数)，并珍视这些形式的严格表现而不关心它们的主旨或意义.例如威拉斯基，甚至最后，第四类也有很多兄弟，特别是那些最近加入的.根据皮埃尔的观察，这些人没有任何信仰，也没有任何欲望，而是加入了共济会只是为了与富有的年轻兄弟们交往，这些兄弟们通过他们的关系或等级有影响力，并且在小屋里有很多人.皮埃尔开始对他的所作所为不满意.共济会，无论如何，他在这里看到的有时在他看来只是基于外在.他没有想到怀疑共济会本身，而是怀疑俄罗斯共济会走错了路，背离了它的本源.基本原则.所以到了年底，他出国接受教团的更高机密.在这种情况下该怎么办?拥护革命，推翻一切，以武力击退?不！我们离那还很远.每一次暴力改革都值得谴责，因为它完全无法在人保持原样的情况下消除邪恶，也因为智慧不需要暴力.可是这样跑过去有什么用呢?"伊拉金的新郎说.一旦她错过了它并把它拒之门外，任何杂种都可以接受它，"伊拉金同时说道，他的疾驰和兴奋让他气喘吁吁.""">>>answer = """在第三类中，他包括那些在共济会中只看到外在形式和仪式的兄弟(大多数)，并珍视这些形式的严格表现，而不会担心它们的主旨或意义....... 威拉斯基，甚至是主要小屋的大师.…… 最后，第四类也有很多兄弟，尤其是最近加入的兄弟.……根据皮埃尔的观察，这些人不相信任何事情，也不渴望任何事情，而加入共济会只是为了与通过他们的关系或等级有影响力的富有的年轻兄弟交往，其中有很多在小屋里.…… 皮埃尔开始对自己的所作所为不满意.... 无论如何，在他看来，共济会有时只是基于外部因素.……他没想到怀疑共济会本身，而是怀疑俄罗斯共济会走错了路，背离了原来的原则....... 所以在年底，他出国接受了更高阶的秘密.... 在这种情况下该怎么办?……赞成革命，推翻一切，以武力击退?... 不！......我们离那还很远.……每一次暴力改革都值得谴责，因为它完全无法在人保持原样的情况下消除邪恶，也因为智慧不需要暴力.…… 可是这样跑过去有什么用呢?伊拉金的新郎说.......一旦她错过了它并把它拒之门外，任何杂种都可以接受它，"伊拉金同时说道，他的疾驰和兴奋让他喘不过气来."">>>>>>output = text.split('')>>>sum(1 表示在 text.split('') 中发送，如果在回答中发送)0

I was going through this question.

Am just wondering whether NLTK would be faster than regex in word/sentence tokenization.

解决方案

The default nltk.word_tokenize() is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer.

Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.:

>>> sent = "This is a foo, bar sentence."
>>> sent.split()
['This', 'is', 'a', 'foo,', 'bar', 'sentence.']
>>> from nltk import word_tokenize
>>> word_tokenize(sent)
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.']

It is usually used to separate strings with specified delimiter, e.g. in a tab-separated file, you can use str.split('') or when you are trying to split a string by the newline when your textfile has one sentence per line.

And let's do some benchmarking in python3:

import time
from nltk import word_tokenize

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

for _ in range(10):
    start = time.time()
    for line in data.split('
'):
        line.split()
    print ('str.split():	', time.time() - start)

for _ in range(10):
    start = time.time()
    for line in data.split('
'):
        word_tokenize(line)
    print ('word_tokenize():	', time.time() - start)

[out]:

str.split():     0.05451083183288574
str.split():     0.054320573806762695
str.split():     0.05368804931640625
str.split():     0.05416440963745117
str.split():     0.05299568176269531
str.split():     0.05304527282714844
str.split():     0.05356955528259277
str.split():     0.05473494529724121
str.split():     0.053118228912353516
str.split():     0.05236077308654785
word_tokenize():     4.056122779846191
word_tokenize():     4.052812337875366
word_tokenize():     4.042144775390625
word_tokenize():     4.101543664932251
word_tokenize():     4.213029146194458
word_tokenize():     4.411528587341309
word_tokenize():     4.162556886672974
word_tokenize():     4.225975036621094
word_tokenize():     4.22914719581604
word_tokenize():     4.203172445297241

If we try a another tokenizers in bleeding edge NLTK from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl:

import time
from nltk.tokenize import ToktokTokenizer

import urllib.request
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

toktok = ToktokTokenizer().tokenize

for _ in range(10):
    start = time.time()
    for line in data.split('
'):
        toktok(line)
    print ('toktok:	', time.time() - start)

[out]:

toktok:  1.5902607440948486
toktok:  1.5347232818603516
toktok:  1.4993178844451904
toktok:  1.5635688304901123
toktok:  1.5779635906219482
toktok:  1.8177132606506348
toktok:  1.4538452625274658
toktok:  1.5094449520111084
toktok:  1.4871931076049805
toktok:  1.4584410190582275

(Note: the source of the text file is from https://github.com/Simdiva/DSL-Task)

If we look at the native perl implementation, the python vs perl time for the ToktokTokenizer is comparable. But do that in the python implementation the regexes are pre-compiled while in perl, it isn't but then the proof is still in the pudding:

alvas@ubi:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
--2016-02-11 20:36:36--  https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690 (2.6K) [text/plain]
Saving to: ‘tok-tok.pl’

100%[===============================================================================================================================>] 2,690       --.-K/s   in 0s

2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690]

alvas@ubi:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
--2016-02-11 20:36:38--  https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3483550 (3.3M) [text/plain]
Saving to: ‘test.txt’

100%[===============================================================================================================================>] 3,483,550    363KB/s   in 7.4s

2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550]

alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.703s
user    0m1.693s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.715s
user    0m1.704s
sys 0m0.008s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.700s
user    0m1.686s
sys 0m0.012s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.727s
user    0m1.700s
sys 0m0.024s
alvas@ubi:~$ time perl tok-tok.pl < test.txt > /tmp/null

real    0m1.734s
user    0m1.724s
sys 0m0.008s

(Note: When timing the tok-tok.pl, we had to pipe the output into a file, so the timing here includes the time the machine takes to output to file, whereas in the nltk.tokenize.ToktokTokenizer timing, it's doesn't include time to output into a file)

With regards to sent_tokenize(), it's a little different and comparing speed benchmark without considering accuracy is a little quirky.

Consider this:

If a regex splits a textfile/paragraph up in 1 sentence, then the speed is almost instantaneous, i.e. 0 work done. But that would be a horrible sentence tokenizer...
If sentences in a file is already separated by , then that is simply a case of comparing how str.split('') vs re.split('') and nltk would have nothing to do with the sentence tokenization ;P

For information on how sent_tokenize() works in NLTK, see:

training data format for nltk punkt
Use of PunktSentenceTokenizer in NLTK

So to effectively compare sent_tokenize() vs other regex based methods (not str.split('')), one would have to evaluate also the accuracy and have a dataset with humanly evaluated sentence in a tokenized format.

Consider this task: https://www.hackerrank.com/challenges/from-paragraphs-to-sentences

Given the text:

We want to get this:

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
Such were Willarski and even the Grand Master of the principal lodge.
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
Pierre began to feel dissatisfied with what he was doing.
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
What is to be done in these circumstances?
To favor revolutions, overthrow everything, repel force by force?
No!
We are very far from that.
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
"But what is there in running across it like that?" said Ilagin's groom.
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.

So simply doing str.split('') will give you nothing. Even without considering the order of the sentences, you will yield 0 positive result:

>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance.
... Such were Willarski and even the Grand Master of the principal lodge.
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined.
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.
... Pierre began to feel dissatisfied with what he was doing.
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals.
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles.
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.
... What is to be done in these circumstances?
... To favor revolutions, overthrow everything, repel force by force?
... No!
... We are very far from that.
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence.
... "But what is there in running across it like that?" said Ilagin's groom.
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement."""
>>>
>>> output = text.split('
')
>>> sum(1 for sent in text.split('
') if sent in answer)
0

这篇关于Python re.split() vs nltk word_tokenize 和 sent_tokenize的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！