python - Pandas/Python中的数据处理

看来是简单的数据操作操作。但是我被困在这一点上。

我有一个广告系列的推荐数据集。

Masteruserid content

1             100
1             101
1             102
2             100
2             101
2             110

现在，我们为每个用户推荐至少5个内容。因此，例如Masteruserid 1有三个建议，我想从全局查看的内容中随机选择其余两个，这是一个单独的数据集（列表）。然后，如果原始数据集中是否已经存在随机选择的数据，我还必须检查是否存在重复项。

global_content
100
300
301
101

实际上，我大约有4000多个Masteruserid。现在，我需要有关如何开始解决此问题的帮助。

最佳答案

def add_content(df, gc, k=5):
    n = len(df)
    gcs = set(gc.squeeze())
    if n < k:
        choices = list(gcs.difference(df.content))
        mc = np.random.choice(choices, k - n, replace=False)
        ids = np.repeat(df.Masteruserid.iloc[-1], k - n)
        data = dict(Masteruserid=ids, content=mc)

        return df.append(pd.DataFrame(data), ignore_index=True)


gb = df.groupby('Masteruserid', group_keys=False)
gb.apply(add_content, gc).reset_index(drop=True)

关于python - Pandas/Python中的数据处理，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/39103090/