问题1:
我汇总了两个大文件(对我而言),可追溯到2014年1月。一个文件是1700万行,另一个文件是300万行。我基于Date字段,PersonID字段和CustomerID字段汇总了它们,并汇总了每行仅1列的列。
文件1(我删除了重复项,因此每个日期每个客户ID只能访问一次PersonID):
Date | PersonID | CustomerID | Sum of Visits
文件2(这没有完整的CustomerID记录,因此我不包括它-我希望首先进行全部聊天,因此我不会遗漏这么大的数据):
Date | PersonID | Sum of Chats
当我执行pd.merge(file1,file2,how ='left')时,我最终严重增加了文件2的聊天次数。这是因为PersonID在同一日期可能有多个CustomerID,因此如果他们有多个聊天它会为每一行添加这些聊天。当我在Tableau中加载数据并汇总时,这种方法无法很好地解决。 (我要寻找的最终结果是将总访问次数除以每个PersonID的总聊天次数以创建比率)。最好的方法是什么?
问题2:
完成聚合文件后,我想再次在粒度行级别合并两个文件。我的问题是,文件2实际上可以在同一日期为同一PersonID进行多个聊天。有没有一种方法可以将文件1与每个PersonID + Date + CustomerID仅有一个记录的文件合并/合并,而无需在第一个文件上创建重复的Visits?
文件2:
Date | PersonID | CustomerID | Count of Chat
最佳答案
假设我对您的数据的看法接近,请参阅我的方法。
首先,可重现的数据。
In [2]: d1 = {'Date': {0: pd.Timestamp('2010-01-01 00:00:00'), 1: pd.Timestamp('2010-01-02 00:00:00'), 2: pd.Timestamp('2010-01-03 00:00:00'), 3: pd.Timestamp('2010-01-03 00:00:00'), 4: pd.Timestamp('2010-01-03 00:00:00'), 5: pd.Timestamp('2010-01-06 00:00:00'), 6: pd.Timestamp('2010-01-06 00:00:00'), 7: pd.Timestamp('2010-01-06 00:00:00'), 8: pd.Timestamp('2010-01-09 00:00:00'), 9: pd.Timestamp('2010-01-10 00:00:00'), 10: pd.Timestamp('2010-01-11 00:00:00'), 11: pd.Timestamp('2010-01-12 00:00:00'), 12: pd.Timestamp('2010-01-12 00:00:00'), 13: pd.Timestamp('2010-01-12 00:00:00'), 14: pd.Timestamp('2010-01-12 00:00:00'), 15: pd.Timestamp('2010-01-12 00:00:00'), 16: pd.Timestamp('2010-01-17 00:00:00'), 17: pd.Timestamp('2010-01-17 00:00:00'), 18: pd.Timestamp('2010-01-17 00:00:00'), 19: pd.Timestamp('2010-01-17 00:00:00')}, 'PersonID': {0: 'Foo', 1: 'Bar', 2: 'Foo', 3: 'Bar', 4: 'Foo', 5: 'Bar', 6: 'Foo', 7: 'Bar', 8: 'Foo', 9: 'Bar', 10: 'Foo', 11: 'Bar', 12: 'Foo', 13: 'Bar', 14: 'Foo', 15: 'Bar', 16: 'Foo', 17: 'Bar', 18: 'Foo', 19: 'Bar'}, 'CustomerID': {0: 'aaa', 1: 'bbb', 2: 'ccc', 3: 'ddd', 4: 'eee', 5: 'fff', 6: 'ggg', 7: 'hhh', 8: 'iii', 9: 'jjj', 10: 'kkk', 11: 'lll', 12: 'mmm', 13: 'nnn', 14: 'ooo', 15: 'ppp', 16: 'qqq', 17: 'rrr', 18: 'sss', 19: 'ttt'}}
...:
...: d2 = {'Date': {0: pd.Timestamp('2010-01-01 00:00:00'), 1: pd.Timestamp('2010-01-02 00:00:00'), 2: pd.Timestamp('2010-01-03 00:00:00'), 3: pd.Timestamp('2010-01-06 00:00:00'), 4: pd.Timestamp('2010-01-09 00:00:00'), 5: pd.Timestamp('2010-01-10 00:00:00'), 6: pd.Timestamp('2010-01-11 00:00:00'), 7: pd.Timestamp('2010-01-12 00:00:00'), 8: pd.Timestamp('2010-01-17 00:00:00'), 9: pd.Timestamp('2010-01-01 00:00:00'), 10: pd.Timestamp('2010-01-02 00:00:00'), 11: pd.Timestamp('2010-01-03 00:00:00'), 12: pd.Timestamp('2010-01-06 00:00:00'), 13: pd.Timestamp('2010-01-09 00:00:00'), 14: pd.Timestamp('2010-01-10 00:00:00'), 15: pd.Timestamp('2010-01-11 00:00:00'), 16: pd.Timestamp('2010-01-12 00:00:00'), 17: pd.Timestamp('2010-01-17 00:00:00')}, 'PersonID': {0: 'Foo', 1: 'Foo', 2: 'Foo', 3: 'Foo', 4: 'Foo', 5: 'Foo', 6: 'Foo', 7: 'Foo', 8: 'Foo', 9: 'Bar', 10: 'Bar', 11: 'Bar', 12: 'Bar', 13: 'Bar', 14: 'Bar', 15: 'Bar', 16: 'Bar', 17: 'Bar'}, 'Sum of Chats': {0: 5.0, 1: 3.0, 2: 24.0, 3: 7.0, 4: 15.0, 5: 9.0, 6: 16.0, 7: 22.0, 8: 14.0, 9: 8.0, 10: 15.0, 11: 14.0, 12: 29.0, 13: 11.0, 14: 6.0, 15: 14.0, 16: 30.0, 17: 12.0}}
In [3]: df1 = pd.DataFrame.from_dict(d1)
...: df2 = pd.DataFrame.from_dict(d2)
上面产生了以下数据帧。
# File 1
CustomerID Date PersonID
0 aaa 2010-01-01 Foo
1 bbb 2010-01-02 Bar
2 ccc 2010-01-03 Foo
3 ddd 2010-01-03 Bar
4 eee 2010-01-03 Foo
5 fff 2010-01-06 Bar
6 ggg 2010-01-06 Foo
7 hhh 2010-01-06 Bar
8 iii 2010-01-09 Foo
9 jjj 2010-01-10 Bar
10 kkk 2010-01-11 Foo
11 lll 2010-01-12 Bar
12 mmm 2010-01-12 Foo
13 nnn 2010-01-12 Bar
14 ooo 2010-01-12 Foo
15 ppp 2010-01-12 Bar
16 qqq 2010-01-17 Foo
17 rrr 2010-01-17 Bar
18 sss 2010-01-17 Foo
19 ttt 2010-01-17 Bar
# File 2
Date PersonID Sum of Chats
0 2010-01-01 Foo 5
1 2010-01-02 Foo 3
2 2010-01-03 Foo 24
3 2010-01-06 Foo 7
4 2010-01-09 Foo 15
5 2010-01-10 Foo 9
6 2010-01-11 Foo 16
7 2010-01-12 Foo 22
8 2010-01-17 Foo 14
9 2010-01-01 Bar 8
10 2010-01-02 Bar 15
11 2010-01-03 Bar 14
12 2010-01-06 Bar 29
13 2010-01-09 Bar 11
14 2010-01-10 Bar 6
15 2010-01-11 Bar 14
16 2010-01-12 Bar 30
17 2010-01-17 Bar 12
如果有可能,您想使用
CustomerID
来计算访问次数,则可以使用pivot_table
来快速汇总它。In [4]: df1 = df1.pivot_table(index=['Date','PersonID'], values='CustomerID', aggfunc=len)
...: print df1
Date PersonID
2010-01-01 Foo 1
2010-01-02 Bar 1
2010-01-03 Bar 1
Foo 2
2010-01-06 Bar 2
Foo 1
2010-01-09 Foo 1
2010-01-10 Bar 1
2010-01-11 Foo 1
2010-01-12 Bar 3
Foo 2
2010-01-17 Bar 2
Foo 2
Name: CustomerID, dtype: int64
我更喜欢在进行汇总时将其与
reset_index
结合使用,而不是其他方法,因为在进行数据透视时,我获得了上述有意义的数据的先睹为快。In [5]: df1 = df1.reset_index(); print df1
Date PersonID CustomerID
0 2010-01-01 Foo 1
1 2010-01-02 Bar 1
2 2010-01-03 Bar 1
3 2010-01-03 Foo 2
4 2010-01-06 Bar 2
5 2010-01-06 Foo 1
6 2010-01-09 Foo 1
7 2010-01-10 Bar 1
8 2010-01-11 Foo 1
9 2010-01-12 Bar 3
10 2010-01-12 Foo 2
11 2010-01-17 Bar 2
12 2010-01-17 Foo 2
因此,我们几乎刚开始。剩下的步骤是将其与第二个数据框合并,以获取每个人每个日期的聊天记录。
In [6]: df = pd.merge(df1, df2, how='outer', sort=True)
...: print df
Date PersonID CustomerID Sum of Chats
0 2010-01-01 Bar NaN 8
1 2010-01-01 Foo 1 5
2 2010-01-02 Bar 1 15
3 2010-01-02 Foo NaN 3
4 2010-01-03 Bar 1 14
5 2010-01-03 Foo 2 24
6 2010-01-06 Bar 2 29
7 2010-01-06 Foo 1 7
8 2010-01-09 Bar NaN 11
9 2010-01-09 Foo 1 15
10 2010-01-10 Bar 1 6
11 2010-01-10 Foo NaN 9
12 2010-01-11 Bar NaN 14
13 2010-01-11 Foo 1 16
14 2010-01-12 Bar 3 30
15 2010-01-12 Foo 2 22
16 2010-01-17 Bar 2 12
17 2010-01-17 Foo 2 14
当然,NaN是我方面错误的模拟数据设置的产物。从这里开始,这只是简单的计算。
让我知道是否有帮助。
关于python - 避免在 Pandas 合并中重复计算,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/29932473/