本文介绍了稀疏=真时与稀疏=假时pd.get_dummies数据帧的大小相同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个字符串列的数据框,我希望将其转换为分类数据,以便我可以运行一些模型并从中提取重要特征.

I have a dataframe with several string columns that I want to convert to categorical data so that I can run some models and extract important features from.

但是,由于唯一值的数量,单热编码数据扩展为大量列,这会导致性能问题.

However, due to the amount of unique values, the one-hot encoded data expands into a large number of columns which is causing performance issues.

为了解决这个问题,我正在尝试get_dummies中的Sparse = True参数.

To combat this, I'm experimenting with the Sparse = True parameter in get_dummies.

test1 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000))
test2 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000),sparse = True)

但是,当我检查两个比较对象的信息时,它们占用的内存量相同.似乎Sparse = True占用的空间更少.为什么呢?

However, when I check info for my two comparison objects, they take up the same amount of memory. It doesn't seem like Sparse = True uses any less space. Why is that?

test1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries,...
dtypes: uint8(2253)
memory usage: 21.6 MB

test2.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries, ...
dtypes: uint8(2253)
memory usage: 21.9 MB

推荐答案

我看着熊猫 get_dummies 源,但到目前为止仍未发现错误.这是我下面做的一个小实验(上半部分正在用真实数据重现您的问题).

I looked at pandas get_dummies source but could not spot an error so far. Here is a small experiment that I did below (1st half is reproducing your problem with real data).

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: a = ['a', 'b'] * 100000
   ...: A = ['A', 'B'] * 100000
   ...: 
   ...: df1 = pd.DataFrame({'a': a, 'A': A})
   ...: df1 = pd.get_dummies(df1)
   ...: df1.info()
   ...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB

In [2]: df2 = pd.DataFrame({'a': a, 'A': A})
   ...: df2 = pd.get_dummies(df2, sparse=True)
   ...: df2.info()
   ...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB

到目前为止,结果与您的结果相同(df1的大小等于df2的大小),但是如果我使用to_sparsefill_value=0 df2转换为sparse >

So far the same result (the size of df1 is equal to that of df2) as yours, but if I explicitly convert df2 to sparse using to_sparse with fill_value=0

In [3]: df2 = df2.to_sparse(fill_value=0)
   ...: df2.info()
   ...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 390.7 KB

现在,由于一半的数据是0,所以内存使用量减少了一半.

Now the memory usage is half since half of the data is 0.

最后,我不确定为什么get_dummies(sparse = True)即使将其转换为SparseDataFrame也不会压缩该数据帧,但是有一种解决方法.相关讨论正在github 带有稀疏的get_dummies不会将数字转换为稀疏的,但结论似乎仍然悬而未决.

In conclusion, I'm not sure why get_dummies(sparse=True) does not compress the dataframe even though it is converted to SparseDataFrame, but there is a workaround. Related discussion was going on in github get_dummies with sparse doesn't convert numeric to sparse but the conclusion still seems to be up in the air.

这篇关于稀疏=真时与稀疏=假时pd.get_dummies数据帧的大小相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 08:45