问题描述
我读过一篇论文,该论文使用ngram计数作为分类器的功能,我想知道这到底是什么意思.
I've read a paper that uses ngram counts as feature for a classifier, and I was wondering what this exactly means.
示例文本:"Lorem ipsum dolor sit amet,consetetur sadipseli elitr,sed diam"
Example text: "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam"
我可以从本文中创建字母组合,二元组,三字母组合等,在这里我必须定义在哪个级别"上创建这些字母组合. 级别"可以是字符,音节,单词,...
I can create unigrams, bigrams, trigrams, etc. out of this text, where I have to define on which "level" to create these unigrams. The "level" can be character, syllable, word, ...
因此,从以上句子中创建字母组合词会简单地创建所有单词的列表吗?
So creating unigrams out of the sentence above would simply create a list of all words?
创建二元组会导致单词对将紧随其后的单词组合在一起?
Creating bigrams would result in word pairs bringing together words that follow each other?
因此,如果论文谈论ngram计数,它只是在文本中创建unigram,二字组,trigrams等,并计数哪个ngram出现的频率?
So if the paper talks about ngram counts, it simply creates unigrams, bigrams, trigrams, etc. out of the text, and counts how often which ngram occurs?
python的nltk包中是否存在现有方法?还是我必须实现自己的版本?
Is there an existing method in python's nltk package? Or do I have to implement a version of my own?
推荐答案
我找到了我的旧代码,也许有用.
I found my old code, maybe it's useful.
import nltk
from nltk import bigrams
from nltk import trigrams
text="""Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam tempus vitae. Morbi justo mauris,
congue sit amet imperdiet ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam"""
# split the texts into tokens
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
tri_tokens = trigrams(tokens)
# print trigrams count
print [(item, tri_tokens.count(item)) for item in sorted(set(tri_tokens))]
>>>
[(('adipiscing', 'elit.', 'nullam'), 2), (('amet', 'consectetur', 'adipiscing'), 2),(('amet', 'imperdiet', 'ipsum'), 1), (('congue', 'sit', 'amet'), 1), (('consectetur', 'adipiscing', 'elit.'), 2), (('diam', 'tempus', 'vitae.'), 1), (('dolor', 'sit', 'amet'), 2), (('elit.', 'nullam', 'ornare'), 2), (('imperdiet', 'ipsum', 'dolor'), 1), (('ipsum', 'dolor', 'sit'), 2), (('justo', 'mauris', 'congue'), 1), (('lacus', 'quis', 'pellentesque'), 2), (('lorem', 'ipsum', 'dolor'), 1), (('mauris', 'congue', 'sit'), 1), (('morbi', 'justo', 'mauris'), 1), (('nullam', 'ornare', 'tempor'), 2), (('ornare', 'tempor', 'lacus'), 2), (('pellentesque', 'diam', 'tempus'), 1), (('quis', 'pellentesque', 'diam'), 2), (('sit', 'amet', 'consectetur'), 2), (('sit', 'amet', 'imperdiet'), 1), (('tempor', 'lacus', 'quis'), 2), (('tempus', 'vitae.', 'morbi'), 1), (('vitae.', 'morbi', 'justo'), 1)]
这篇关于什么是ngram计数以及如何使用nltk实施?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!