我有以下列的以下 Pandas 数据框

user_id user_agent_id requests

所有列均包含整数。我不会对它们执行某些操作,而无法使用dask数据框运行它们。这就是我的工作。
user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \
    .groupby(['user_id', 'user_agent_id']) \
    .size().to_frame(name='appearances') \
    .reset_index() # I am not sure I can run this on dask dataframe

user_profile_ddf = df.from_pandas(user_profile, npartitions=4)
user_profile_ddf['percent'] = user_profile_ddf.groupby('user_id')['appearances'] \
    .apply(lambda x: x / x.sum(), meta=float) #Percentage of appearance for each user group

但是我收到以下错误
raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

难道我做错了什么?在纯 Pandas 中,它的效果很好,但是对于许多行来说它变慢了(尽管它们适合存储在内存中),所以我想并行化计算。

最佳答案

创建dask dataframe时,添加reset_index():

user_profile_ddf = df.from_pandas(user_profile, npartitions=4).reset_index()

关于python - ValueError : Not all divisions are known,在dask数据帧上无法对齐分区错误,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45030651/

10-16 23:44
查看更多