可以使用以下示例数据:
df_1
DATE AMOUNT
20170101 50000
20170102 55000
20170103 60000
20170104 65000
... ...
20171231 150000
df_2
DATE INVNR AMOUNT
20170101 ... ???
20170101 ... ???
20170101 ... ???
20170101 ... ???
20170102 ... ???
20170102 ... ???
20170103 ... ???
... ... ???
20171231 ... ???
20171231 ... ???
20171231 ... ???
20171231 ... ???
这是预期的示例输出:
df_ramdomly_distributed_amounts
DATE INVNR AMOUNT
20170101 ... 14879
20170101 ... 6523
20170101 ... 8596
20170101 ... 20002
20170102 ... 31548
20170102 ... 23452
20170103 ... 60000
... ... ???
20171231 ... 26428
20171231 ... 72658
20171231 ... 326
20171231 ... 50593
INVNR列保存发票编号,但是这些发票编号是系统导出的,并且不一致。 DATE始终采用给定的格式,并且可以根据需要进行操作(df_1 + df_2目前将其保存为字符串列)。
因此,任务是按照df_2中的发票以1:m的关系随机分配df_1中存储的每个唯一DATE的AMOUNT。每天的发票金额随时间而变化。 df_2目前拥有约200,000张发票。我现在无法解决问题。非常感谢您提供的任何提示!
最佳答案
试试这个:
import pandas as pd
from random import randint
# intialise data of lists & Create DataFrame
data1 = {'DATE': ['20170101', '20170102', '20170103', '20170104'],
'AMOUNT': ['500', '600', '400', '800']}
df_1 = pd.DataFrame(data1)
data2 = {'DATE': ['20170101', '20170101', '20170101', '20170102', '20170102', '20170103', '20170103', '20170104'],
'INVNR': ['a1', 'a2', 'a3', 'b1', 'b2', 'c1', 'c2', 'd1']}
df_2 = pd.DataFrame(data2)
print("------------- DF 1 --------------------")
print(df_1)
print("------------- DF 2 --------------------")
print(df_2)
# process each date group and split the amount among them
def splitAndAddAmount(gr, df_am):
amount = int(df_am.loc[df_am['DATE'] == gr['DATE'].iloc[0], 'AMOUNT'].iloc[0])
grpLength = int(len(gr))
assert amount >= grpLength >= 1
pieces = []
for idx in range(grpLength - 1):
pieces.append(randint(1, amount - sum(pieces) - grpLength + idx))
pieces.append(amount - sum(pieces))
gr['AMNT'] = pieces
return gr
print("------------- DF FINAL --------------------")
df_final = df_2.groupby('DATE').apply(splitAndAddAmount, df_1)
print(df_final)
输出:-
------------- DF 1 --------------------
DATE AMOUNT
0 20170101 500
1 20170102 600
2 20170103 400
3 20170104 800
------------- DF 2 --------------------
DATE INVNR
0 20170101 a1
1 20170101 a2
2 20170101 a3
3 20170102 b1
4 20170102 b2
5 20170103 c1
6 20170103 c2
7 20170104 d1
------------- DF FINAL --------------------
DATE INVNR AMNT
0 20170101 a1 29
1 20170101 a2 148
2 20170101 a3 323
3 20170102 b1 23
4 20170102 b2 577
5 20170103 c1 261
6 20170103 c2 139
7 20170104 d1 800