数据框的文本列上创建一个TfidfVectorizer

数据框的文本列上创建一个TfidfVectorizer

本文介绍了在大 pandas 数据框的文本列上创建一个TfidfVectorizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从存储在庞大的数据框,是从CSV文件(无法容纳在内存中)加载的.我正在尝试使用块对数据帧进行迭代,但是它返回的生成器对象不是方法 TfidfVectorizer .我猜我在编写如下所示的生成器方法ChunkIterator时做错了.

I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer. I guess I am doing something wrong while writing a generator method ChunkIteratorshown below.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus  = df['text_column'].values
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())



#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):
    for chunk in pd.read_csv(csvfilename, chunksize=1):
       yield chunk['text_column'].values

corpus  = ChunkIterator(csvfilename)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

任何人都可以建议如何修改上述ChunkIterator方法或使用数据框.我想避免为数据框.以下是一些虚拟csv文件数据,用于重新创建场景.

Can anybody please advise how to modify the ChunkIterator method above, or any other approach using dataframe. I would like to avoid creating separate text files for each row in the dataframe. Following is some dummy csv file data for recreating the scenario.

id,text_column,tags
001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"

推荐答案

该方法可以很好地接受生成器.但这需要可迭代的原始文档,即字符串.您的生成器是 numpy.ndarray 对象的可迭代对象.因此,尝试以下操作:

The method accepts generators just fine. But it requires a iterable of raw documents, i.e. strings. Your generator is an iterable of numpy.ndarray objects. So try something like:

def ChunkIterator(filename):
    for chunk in pd.read_csv(csvfilename, chunksize=1):
        for document in chunk['text_column'].values:
            yield document

注意,我不太了解您为什么在这里使用熊猫.只需使用常规的csv模块,例如:

Note, I don't really understand why you are using pandas here. Just use the regular csv module, something like:

import csv
def doc_generator(filepath, textcol=0, skipheader=True):
    with open(filepath) as f:
        reader = csv.reader(f)
        if skipheader:
            next(reader, None)
        for row in reader:
            yield row[textcol]

因此,在您的情况下,例如,将1传递给textcol:

So, in your case, pass 1 to textcol, for example:

In [1]: from sklearn.feature_extraction.text import TfidfVectorizer

In [2]: import csv
   ...: def doc_generator(filepath, textcol=0, skipheader=True):
   ...:     with open(filepath) as f:
   ...:         reader = csv.reader(f)
   ...:         if skipheader:
   ...:             next(reader, None)
   ...:         for row in reader:
   ...:             yield row[textcol]
   ...:

In [3]: vectorizer = TfidfVectorizer()

In [4]: result = vectorizer.fit_transform(doc_generator('testing.csv', textcol=1))

In [5]: result
Out[5]:
<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>

In [6]: result.todense()
Out[6]:
matrix([[ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,
          0.        ,  0.38408524,  0.        ,  0.38408524],
        [ 0.        ,  0.6876236 ,  0.        ,  0.28108867,  0.        ,
          0.53864762,  0.28108867,  0.        ,  0.28108867],
        [ 0.51184851,  0.        ,  0.        ,  0.26710379,  0.51184851,
          0.        ,  0.26710379,  0.51184851,  0.26710379],
        [ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,
          0.        ,  0.38408524,  0.        ,  0.38408524]])

这篇关于在大 pandas 数据框的文本列上创建一个TfidfVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 20:35