本文介绍了将大 csv 转换为稀疏矩阵以在 sklearn 中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个以 csv 格式保存的 ~30GB(压缩 ~1.7 GB | 180K 行 x 32K 列)矩阵.我想将此矩阵转换为稀疏格式,以便能够在内存中加载完整的数据集,以便使用 sklearn 进行机器学习.填充的单元格包含小于 1 的浮点数.大矩阵的一个警告是目标变量存储为最后一列.允许在 sklearn 中使用这个大矩阵的最佳方法是什么?IE.如何在不将原始矩阵加载到内存中的情况下将 ~30GB csv 转换为 scipy 稀疏格式?

I have a ~30GB (~1.7 GB compressed | 180K rows x 32K columns) matrix saved in a csv format. I would like to convert this matrix to sparse format to be able to load the full dataset in memory for machine learning with sklearn. The cells that are populated contain float numbers less than 1. A caveat of the large matrix is the target variable is stored as the last column. What is the best method to allow this large matrix to be utilized in sklearn? I.E. How can you transition the ~30GB csv into a scipy sparse format without loading the original matrix into memory?

伪代码

  1. 移除目标变量(保持订单不变)
  2. 将 ~30 GB 矩阵转换为稀疏格式(帮助!!)
  3. 将稀疏格式加载到内存和目标变量中以运行机器学习管道(我该怎么做?)

推荐答案

您可以很容易地在内存中按行构建稀疏矩阵:

You can row-wise build a sparse matrix in memory pretty easily:

import numpy as np
import scipy.sparse as sps

input_file_name = "something.csv"
sep = "	"

def _process_data(row_array):
    return row_array

sp_data = []
with open(input_file_name) as csv_file:
    for row in csv_file:
        data = np.fromstring(row, sep=sep)
        data = _process_data(data)
        data = sps.coo_matrix(data)
        sp_data.append(data)


sp_data = sps.vstack(sp_data)

这将更容易写入 hdf5,这是一种比文本文件更好的以这种规模存储数字的方式.

This will be easier to write into hdf5 which is a way better way to store numbers at this scale than a text file.

这篇关于将大 csv 转换为稀疏矩阵以在 sklearn 中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 21:05