python - 如何将Pandas组转换为SparseDataFrame

我有一个高个（2743470行，2列）DataFrame，将其称为df，其中包含以下列，整数索引：

| item | user |
| 1    | abc  |
| 15   | abc  |
| 3    | def  |

我知道总共有35605个可能的商品ID和53690个用户。我想做的就是将其转换为SparseDataFrame，每行代表一个用户，一列代表一个项目，无论用户与原始表中的项目相关联的值是1。

我曾尝试进行分组，但到那时我还不知道如何将其余部分向量化。我所拥有的最好的是：

ids = pandas.Index(df.item.drop_duplicates())
g = df.groupby('user')
arr = []
arr_i = []
for name, group in g:
    arr_i.append(name)
    s = pandas.Series({val: 1 for val in group.item}, index=ids).to_sparse()
    arr.append(s)
book_reads = pandas.SparseDataFrame(arr, index=arr_i)

但这甚至失败了：

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

我尝试将索引参数取出给SparseDataFrame或将其设置为一组整数而不是字符串，但无济于事。唯一有效的方法是先制作一个常规的DataFrame，然后在其上调用to_sparse，但这会消耗太多的内存。

仅使用稀疏数据结构时，是否可以矢量化此操作？

更新

我还尝试过伪造全为1的value列并进行透视，但是几乎立即会遇到内存错误，这可能是因为透视产生了密集的DataFrame。

最佳答案

我认为您不会因此而遇到内存问题，因为最终结果不会那么大（因此，堆栈不会爆炸）

In [14]: df.groupby('user')['item'].apply(lambda x: Series(1,index=x)).unstack()
Out[14]:
      1   3   15
user
abc    1 NaN   1
def  NaN   1 NaN

[2 rows x 3 columns]

关于python - 如何将Pandas组转换为SparseDataFrame，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/20976736/