问题描述
我试图用python中的nltk实现一个正则表达式标记器,但是结果是这样的:
I tried to implement a regular expression tokenizer with nltk in python, but the result is this:
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
但是想要的结果是这样的:
But the wanted result is this:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
为什么?错误在哪里?
推荐答案
您应将所有捕获组都变为非捕获组:
You should turn all capturing groups to non-capturing:
-
([A-Z]\.)+
>(?:[A-Z]\.)+
-
\w+(-\w+)*
->\w+(?:-\w+)*
-
\$?\d+(\.\d+)?%?
至\$?\d+(?:\.\d+)?%?
([A-Z]\.)+
>(?:[A-Z]\.)+
\w+(-\w+)*
->\w+(?:-\w+)*
\$?\d+(\.\d+)?%?
to\$?\d+(?:\.\d+)?%?
问题在于,当在模式中定义了多个捕获组时,regexp_tokenize
似乎正在使用re.findall
来返回捕获元组列表.请参见此 nltk.tokenize软件包参考:
The issue is that regexp_tokenize
seems to be using re.findall
that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:
此外,我不确定您是否要使用与包含所有大写字母的范围匹配的:-_
,将-
放在字符类的末尾.
Also, I am not sure you wanted to use :-_
that matches a range including all uppercase letters, put the -
to the end of the character class.
因此,使用
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
这篇关于nltk正则表达式标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!