更新:
尽管进行了严格的清理,但带有句点的某些单词仍会完整地标记为句点,包括在句点和引号之间用空格填充的字符串。我在Jupyter Notebook中创建了一个公共链接,其中包含问题的示例:https://drive.google.com/file/d/0B90qb2J7ZLYrZmItME5RRlhsVWM/view?usp=sharing
或更简短的例子:
word_tokenize('This is a test. "')
['This', 'is', 'a', 'test.', '``']
但是当使用其他类型的双引号时消失:
word_tokenize('This is a test. ”')
['This', 'is', 'a', 'test', '.', '”']
原版的:
我提取了大量的文本,并创建了一个计数器来查看每个单词的计数,然后将该计数器转移到一个数据框上以便于处理。每段文字都是100到5000个字之间的大字符串。包含字数的数据框看起来像这样,例如,只接收字数为11的字:
allwordsdf[(allwordsdf['count'] == 11)]
words count
551 throughlin 11
1921 rampd 11
1956 pinhol 11
2476 reckhow 11
我注意到的是,有很多单词并没有完全被阻止,并且它们的末尾都有句点。例如:
4233 activist. 11
9243 storyline. 11
我不确定是什么原因造成的。我知道这通常是分别阻止句点,因为句点行位于:
23 . 5702880
另外,似乎并不是每个“活动家”实例都在这样做:
len(articles[articles['content'].str.contains('activist.')])
9600
不知道我是否忽略了某件事-昨天我遇到了problem with the NLTK stemmer that was a bug,并且我不知道是那件事还是我在做某事(总是更有可能)。
感谢您的指导。
编辑:
这是我正在使用的功能:
progress = 0
start = time.time()
def stem(x):
end = time.time()
tokens = word_tokenize(x)
global start
global progress
progress += 1
sys.stdout.write('\r {} percent, {} position, {} per second '.format(str(float(progress / len(articles))),
str(progress), (1 / (end - start))))
stems = [stemmer.stem(e) for e in tokens]
start = time.time()
return stems
articles['stems'] = articles.content.apply(lambda x: stem(x))
编辑2:
Here is a JSON对某些数据:所有字符串,标记和词干。
这是我在标记和词干查找所有单词后仍带有句点的所有内容的摘要:
allwordsdf[allwordsdf['words'].str.contains('\.')] #dataframe made from the counter dict
words count
23 . 5702875
63 years. 1231
497 was. 281
798 lost. 157
817 jie. 1
819 teacher.24
858 domains.1
875 fallout.3
884 net. 23
889 option. 89
895 step. 67
927 pool. 30
936 that. 4245
954 compute.2
1001 dr. 11007
1010 decisions. 159
该切片的长度约为49,000。
编辑3:
阿尔瓦斯(Alvas)的答案将句号减少了一半,减少到24,000个唯一词,总数为518980,这是一个很大的数目。正如我发现的那样,问题在于它每次都使用句号和引号引起来。例如,使用字符串“ sickened。”,它在标记化单词中出现一次。
如果我搜索语料库:
articles[articles['content'].str.contains(r'sickened\.[^\s]')]
它在整个小游戏中唯一显示的位置是:
...said he was “sickened.” Trump's running mate...
这不是一个孤立的事件,而是我在搜索这些术语时一遍又一遍的发现。它们每次都带有引号。标记生成器不仅不能处理带有字符周期引号字符的单词,而且还可以处理字符周期引号空格。
最佳答案
您需要在词干化之前对字符串进行标记:
>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> text = 'This is a foo bar sentence, that contains punctuations.'
>>> porter = PorterStemmer()
>>> [porter.stem(word) for word in text.split()]
[u'thi', 'is', 'a', 'foo', 'bar', 'sentence,', 'that', u'contain', 'punctuations.']
>>> [porter.stem(word) for word in word_tokenize(text)]
[u'thi', 'is', 'a', 'foo', 'bar', u'sentenc', ',', 'that', u'contain', u'punctuat', '.']
在数据框中:
porter = PorterStemmer()
articles['tokens'] = articles['content'].apply(word_tokenize)
articles['stem'] = articles['tokens'].apply(lambda x: [porter.stem(word) for word in x])
>>> import pandas as pd
>>> from nltk.stem import PorterStemmer
>>> from nltk import word_tokenize
>>> sents = ['This is a foo bar, sentence.', 'Yet another, foo bar!']
>>> df = pd.DataFrame(sents, columns=['content'])
>>> df
content
0 This is a foo bar, sentence.
1 Yet another, foo bar!
# Apply tokenizer.
>>> df['tokens'] = df['content'].apply(word_tokenize)
>>> df
content tokens
0 This is a foo bar, sentence. [This, is, a, foo, bar, ,, sentence, .]
1 Yet another, foo bar! [Yet, another, ,, foo, bar, !]
# Without DataFrame.apply
>>> df['tokens'][0]
['This', 'is', 'a', 'foo', 'bar', ',', 'sentence', '.']
>>> [porter.stem(word) for word in df['tokens'][0]]
[u'thi', 'is', 'a', 'foo', 'bar', ',', u'sentenc', '.']
# With DataFrame.apply
>>> df['tokens'].apply(lambda row: [porter.stem(word) for word in row])
0 [thi, is, a, foo, bar, ,, sentenc, .]
1 [yet, anoth, ,, foo, bar, !]
# Or if you like nested lambdas.
>>> df['tokens'].apply(lambda x: map(lambda y: porter.stem(y), x))
0 [thi, is, a, foo, bar, ,, sentenc, .]
1 [yet, anoth, ,, foo, bar, !]
关于python - NLTK词干偶尔在词干中包含标点符号,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45091297/