我正在尝试在数据框的列中应用RegexpTokenizer。

数据框:

    all_cols
0   who is your hero and why
1   what do you do to relax
2   can't stop to eat
4   how many hours of sleep do you get a night
5   describe the last time you were relax


脚本:

import re
import nltk
import pandas as pd
from nltk import RegexpTokenizer

#tokenization of data and suppression of None (NA)
df['all_cols'].dropna(inplace=True)

tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].apply(tokenizer)


错误:


  TypeError:“ RegexpTokenizer”对象不可调用


但是我不明白。当我使用其他nltk标记化模式word_tokenize时,效果很好...

最佳答案

请注意,在调用RegexpTokenizer时,您只是在创建带有一组参数的类的实例(调用其__init__方法)。
为了实际使用指定的模式标记dataframe列,您必须调用其RegexpTokenizer.tokenize方法:

tokenizer = RegexpTokenizer("[\w']+")
df['all_cols'] = df['all_cols'].map(tokenizer.tokenize)

       all_cols
0  [who, is, your, hero, and, why]
1   [what, do, you, do, to, relax]
...

关于python - 在 Pandas 数据框中使用RegexpTokenizer,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57039945/

10-12 19:07