实验对比了一下三种切分方式:
1,2 : nltk.word_tokenize : 分离缩略词,(“Don't” =>'Do', "n't") 表句子切分的“,” "." 单独成词。
3 : TreebankWordTokenizer: 分离缩略词, 表句子切分的 “,"单独成词,句号“.”被删去。
4 : PunktWordTokenizer: 报错: cannot import name 'PunktWordTokenizer'
5 : WordPunctTokenizer: 将标点转化为全新标识符实现切分。(“Don't” =>'Don', "'", 't')
import nltk
text = "We're excited to let you know that. Harry, 18 years old, will join us on Nov. 29. Don't tell him." text_tokenized = nltk.word_tokenize(text)
print("1: word_tokenize:", text_tokenized)
print("length: ", len(text_tokenized)) from nltk import word_tokenize
text_tokenized_2 = word_tokenize(text)
print("2: word_tokenize:", text_tokenized_2)
print("length: ", len(text_tokenized_2)) from nltk.tokenize import TreebankWordTokenizer
tokenizer3 = TreebankWordTokenizer()
text_tokenized_3 = tokenizer3.tokenize(text)
print("3: TreebankWordTokenizer", text_tokenized_3)
print("length: ", len(text_tokenized_3)) # from nltk.tokenize import PunktWordTokenizer
# tokenizer4 = PunktWordTokenizer()
# text_tokenized_4 = tokenizer4.tokenize(text)
# print("4: PunktWordTokenizer", text_tokenized_4)
# print("length: ", len(text_tokenized_4)) from nltk.tokenize import WordPunctTokenizer
tokenizer5 = WordPunctTokenizer()
text_tokenized_5 = tokenizer5.tokenize(text)
print("5: WordPunctTokenizer", text_tokenized_5)
print("length: ", len(text_tokenized_5))
输出:
: word_tokenize: ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '', '.', 'Do', "n't", 'tell', 'him', '.']
length:
: word_tokenize: ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '', '.', 'Do', "n't", 'tell', 'him', '.']
length:
: TreebankWordTokenizer ['We', "'re", 'excited', 'to', 'let', 'you', 'know', 'that.', 'Harry', ',', '', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov.', '29.', 'Do', "n't", 'tell', 'him', '.']
length:
: WordPunctTokenizer ['We', "'", 're', 'excited', 'to', 'let', 'you', 'know', 'that', '.', 'Harry', ',', '', 'years', 'old', ',', 'will', 'join', 'us', 'on', 'Nov', '.', '', '.', 'Don', "'", 't', 'tell', 'him', '.']
length: