问题描述
我有一个包含多个字符串列的数据框,我希望将其转换为分类数据,以便我可以运行一些模型并从中提取重要特征.
I have a dataframe with several string columns that I want to convert to categorical data so that I can run some models and extract important features from.
但是,由于唯一值的数量,单热编码数据扩展为大量列,这会导致性能问题.
However, due to the amount of unique values, the one-hot encoded data expands into a large number of columns which is causing performance issues.
为了解决这个问题,我正在尝试get_dummies中的Sparse = True
参数.
To combat this, I'm experimenting with the Sparse = True
parameter in get_dummies.
test1 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000))
test2 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000),sparse = True)
但是,当我检查两个比较对象的信息时,它们占用的内存量相同.似乎Sparse = True
占用的空间更少.为什么呢?
However, when I check info for my two comparison objects, they take up the same amount of memory. It doesn't seem like Sparse = True
uses any less space. Why is that?
test1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries,...
dtypes: uint8(2253)
memory usage: 21.6 MB
test2.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries, ...
dtypes: uint8(2253)
memory usage: 21.9 MB
推荐答案
我看着熊猫 get_dummies 源,但到目前为止仍未发现错误.这是我下面做的一个小实验(上半部分正在用真实数据重现您的问题).
I looked at pandas get_dummies source but could not spot an error so far. Here is a small experiment that I did below (1st half is reproducing your problem with real data).
In [1]: import numpy as np
...: import pandas as pd
...:
...: a = ['a', 'b'] * 100000
...: A = ['A', 'B'] * 100000
...:
...: df1 = pd.DataFrame({'a': a, 'A': A})
...: df1 = pd.get_dummies(df1)
...: df1.info()
...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A 200000 non-null uint8
A_B 200000 non-null uint8
a_a 200000 non-null uint8
a_b 200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB
In [2]: df2 = pd.DataFrame({'a': a, 'A': A})
...: df2 = pd.get_dummies(df2, sparse=True)
...: df2.info()
...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A 200000 non-null uint8
A_B 200000 non-null uint8
a_a 200000 non-null uint8
a_b 200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB
到目前为止,结果与您的结果相同(df1
的大小等于df2
的大小),但是如果我使用to_sparse
和fill_value=0
sparse
>So far the same result (the size of df1
is equal to that of df2
) as yours, but if I explicitly convert df2
to sparse
using to_sparse
with fill_value=0
In [3]: df2 = df2.to_sparse(fill_value=0)
...: df2.info()
...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A 200000 non-null uint8
A_B 200000 non-null uint8
a_a 200000 non-null uint8
a_b 200000 non-null uint8
dtypes: uint8(4)
memory usage: 390.7 KB
现在,由于一半的数据是0
,所以内存使用量减少了一半.
Now the memory usage is half since half of the data is 0
.
最后,我不确定为什么get_dummies(sparse = True)即使将其转换为SparseDataFrame也不会压缩该数据帧,但是有一种解决方法.相关讨论正在github 带有稀疏的get_dummies不会将数字转换为稀疏的,但结论似乎仍然悬而未决.
In conclusion, I'm not sure why get_dummies(sparse=True) does not compress the dataframe even though it is converted to SparseDataFrame, but there is a workaround. Related discussion was going on in github get_dummies with sparse doesn't convert numeric to sparse but the conclusion still seems to be up in the air.
这篇关于稀疏=真时与稀疏=假时pd.get_dummies数据帧的大小相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!