问题描述
所以我有如下关键字列表,
So I have list of keywords as follows,
[u"ALZHEIMER'S DISEASE, OLFACTORY, AGING",
u"EEG, COGNITIVE CONTROL, FATIGUE",
u"AGING, OBESITY, GENDER",
u"AGING, COGNITIVE CONTROL, BRAIN IMAGING"]
然后我想使用 CountVectorizer
来标记化,以便我的模型具有以下字典:
Then I want to use CountVectorizer
to tokenize so that my model has following dictionary:
[{'ALZHEIMER\'S DISEASE': 0, 'OLFACTORY': 1, 'AGING': 2, 'BRAIN IMAGING': 3, ...}]
基本上,我想将逗号视为我的标记化模式(最后一个除外).但是,请随意在每个列表的末尾连接 ,
.这是我现在拥有的代码片段:
Basically, I want to treat comma as my tokenize pattern (except the last one). However, feel free to concat ,
at the end of each list. Here is a code snippet that I have right now:
from sklearn.feature_extraction.text import CountVectorizer
ls = [u"ALZHEIMER'S DISEASE, OLFACTORY, AGING",
u"EEG, COGNITIVE CONTROL, FATIGUE",
u"AGING, OBESITY, GENDER",
u"AGING, COGNITIVE CONTROL, BRAIN IMAGING"]
tfidf_model = CountVectorizer(min_df=1, max_df=1, token_pattern=r'(\w{1,}),')
tfidf_model.fit_transform(ls)
print tfidf_model.vocabulary_.keys()
>>> [u'obesity', u'eeg', u'olfactory', u'disease']
如果您想了解更多信息,请随时发表评论.
Feel free to comments if you want more information.
推荐答案
这是我的回答.我首先将每个文档转换为列表列表(每个都是术语).
Here is answer that I made. I first transform to each documents to list of list (each is terms).
docs = list(map(lambda s: s.lower().split(', '), ls)) # list of list
我创建了一个函数来从列表中的单词生成字典,然后将单词列表转换为稀疏矩阵
I created a function to generate dictionary from words in a list and then transform list of words to sparse matrix
import collections
from itertools import chain
def tag_to_sparse(docs):
docs_list = list(chain(*docs))
docs_list = [w for doc in docs for w in doc]
counter = collections.Counter(docs_list)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
vocabulary = dict([(c[0], i) for i, c in enumerate(count_pairs)])
row_ind = list()
col_ind = list()
for i, doc in enumerate(docs):
for w in doc:
row_ind.append(i)
col_ind.append(vocabulary[w])
value = [1]*len(row_ind)
X = sp.csr_matrix((value, (row_ind, col_ind)))
X.sum_duplicates()
return X, vocabulary
我可以把它X,vocabulary = tag_to_sparse(docs)
得到稀疏矩阵和词汇字典.
I can just put it X, vocabulary = tag_to_sparse(docs)
to get sparse matrix and vocabulary dictionary.
我刚刚找到了答案,以便您可以使用 tokenizer
I just found the answer so that you can trick scikit-learn to recognize ,
by using tokenizer
vocabulary = list(map(lambda x: x.lower().split(', '), ls))
vocabulary = list(np.unique(list(chain(*vocabulary))))
from sklearn.feature_extraction.text import CountVectorizer
model = CountVectorizer(vocabulary=vocabulary, tokenizer=lambda x: x.split(', '))
X = model.fit_transform(ls)
这篇关于CountVectorizer 中的令牌模式,scikit-learn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!