可以使用以下示例数据:

df_1

DATE      AMOUNT
20170101  50000
20170102  55000
20170103  60000
20170104  65000
...       ...
20171231  150000


df_2

DATE      INVNR   AMOUNT
20170101  ...     ???
20170101  ...     ???
20170101  ...     ???
20170101  ...     ???
20170102  ...     ???
20170102  ...     ???
20170103  ...     ???
...       ...     ???
20171231  ...     ???
20171231  ...     ???
20171231  ...     ???
20171231  ...     ???


这是预期的示例输出:

df_ramdomly_distributed_amounts

DATE      INVNR   AMOUNT
20170101  ...     14879
20170101  ...     6523
20170101  ...     8596
20170101  ...     20002
20170102  ...     31548
20170102  ...     23452
20170103  ...     60000
...       ...     ???
20171231  ...     26428
20171231  ...     72658
20171231  ...     326
20171231  ...     50593


INVNR列保存发票编号,但是这些发票编号是系统导出的,并且不一致。 DATE始终采用给定的格式,并且可以根据需要进行操作(df_1 + df_2目前将其保存为字符串列)。

因此,任务是按照df_2中的发票以1:m的关系随机分配df_1中存储的每个唯一DATE的AMOUNT。每天的发票金额随时间而变化。 df_2目前拥有约200,000张发票。我现在无法解决问题。非常感谢您提供的任何提示!

最佳答案

试试这个:

import pandas as pd
from random import randint

# intialise data of lists & Create DataFrame
data1 = {'DATE': ['20170101', '20170102', '20170103', '20170104'],
         'AMOUNT': ['500', '600', '400', '800']}
df_1 = pd.DataFrame(data1)
data2 = {'DATE': ['20170101', '20170101', '20170101', '20170102', '20170102', '20170103', '20170103', '20170104'],
         'INVNR': ['a1', 'a2', 'a3', 'b1', 'b2', 'c1', 'c2', 'd1']}

df_2 = pd.DataFrame(data2)

print("------------- DF 1 --------------------")
print(df_1)
print("------------- DF 2 --------------------")
print(df_2)

# process each date group and split the amount among them
def splitAndAddAmount(gr, df_am):
    amount = int(df_am.loc[df_am['DATE'] == gr['DATE'].iloc[0], 'AMOUNT'].iloc[0])
    grpLength = int(len(gr))
    assert amount >= grpLength >= 1
    pieces = []
    for idx in range(grpLength - 1):
        pieces.append(randint(1, amount - sum(pieces) - grpLength + idx))
    pieces.append(amount - sum(pieces))
    gr['AMNT'] = pieces
    return gr


print("------------- DF FINAL --------------------")
df_final = df_2.groupby('DATE').apply(splitAndAddAmount, df_1)
print(df_final)


输出:-

------------- DF 1 --------------------
       DATE AMOUNT
0  20170101    500
1  20170102    600
2  20170103    400
3  20170104    800
------------- DF 2 --------------------
       DATE INVNR
0  20170101    a1
1  20170101    a2
2  20170101    a3
3  20170102    b1
4  20170102    b2
5  20170103    c1
6  20170103    c2
7  20170104    d1
------------- DF FINAL --------------------
       DATE INVNR  AMNT
0  20170101    a1    29
1  20170101    a2   148
2  20170101    a3   323
3  20170102    b1    23
4  20170102    b2   577
5  20170103    c1   261
6  20170103    c2   139
7  20170104    d1   800

09-26 02:06