我有2个熊猫DataFrames
(df1
和df2
),其标题如下:
DF1: cnpj num_doc bc_icms
DF2: cnpj num_doc vlr_item
我的问题:
如何比较每个
sum()
的BC_ICMS
的NUM_DOC
来自
CNPJ
的DF1
,相对于每次出现的sum()
的VLR_ITEM
来自同一
NUM_DOC
和CNPJ
来自DF2
吗?我的想法:
我相信这就像
"subtract the row that contains the result of aggfunc="sum" of df1 from df2"
或者也许找到一种创建类似的方法:
cnpj num_doc sum_bc_icms_df1 sum_vlr_item_df2 diff
其中
diff
是包含vlr_item_df1 - vlr_item_df2
结果的列我的资料:
DF1:
cnpj num_doc bc_icms
0 02817342000124 0000010154 17827.07
1 54921580000189 0000112428 108000.00
2 08953538000122 0000012865 232.00
3 08953538000122 0000012865 239.00
4 08953538000122 0000012865 215.00
5 07374346000107 0000014224 320.12
6 07374346000107 0000014231 385.04
7 07374346000107 0000014263 401.28
8 07374346000107 0000014279 391.26
9 02364118000124 0000015263 37353.10
10 02364118000124 0000015264 56214.14
11 02364118000124 0000015265 2115.92
12 07720786000160 0000020128 355.53
13 57101370000132 0000002042 200.00
14 07720786000160 0000020437 461.32
15 07720786000160 0000020438 168.04
16 07720786000160 0000020439 385.23
17 07720786000160 0000020444 378.80
18 07720786000160 0000020446 289.67
19 07720786000160 0000020451 223.36
20 07720786000160 0000020461 305.31
DF2:
cnpj num_doc vlr_item
0 02817342000124 0000010154 1391,04
1 02817342000124 0000010154 100,0
2 02817342000124 0000010154 1587,09
3 02817342000124 0000010154 200,0
4 02817342000124 0000010154 1430,0
5 02817342000124 0000010154 592,0
6 02817342000124 0000010154 7508,99
7 02817342000124 0000010154 443,03
8 02817342000124 0000010154 29,9
9 02817342000124 0000010154 343,26
10 07374346000107 0000014224 320,12
11 07374346000107 0000014231 385,04
12 07374346000107 0000014263 401,28
13 07374346000107 0000014279 391,26
14 02364118000124 0000015263 29725,96
15 02364118000124 0000015263 6590,31
16 02364118000124 0000015263 1036,83
17 02364118000124 0000015264 47348,71
18 02364118000124 0000015264 7641,97
19 02364118000124 0000015264 1223,46
20 02364118000124 0000015265 2115,92
规则:
1
CNPJ
发出N NUM_DOC
。CNPJ A
可能会发出NUM_DOC
1,而CNPJ B
也可能会发出NUM_DOC
1。每个
CNPJ
可以出现多次。每个
NUM_DOC
可以出现多次。我的代码:
从df1:
g1 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()
Output:
cnpj num_doc
00903175000109 0000059253 100.00
01114678000168 0000004705 358.00
01116428000167 0000006883 165.00
02295440000149 0000051845 72.65
0000051866 69.90
02364118000124 0000015263 37353.10
0000015264 56214.14
0000015265 2115.92
0000015571 39873.74
0000015572 52611.06
0000015573 1121.64
02817342000124 0000010154 17827.07
0000010299 407.02
03509978001143 0000145539 17000.00
0000145744 108000.00
0000145745 108000.00
0000145746 108000.00
0000145749 15000.00
0000145750 15000.00
0000145753 15000.00
03572854000130 0000007298 7190.54
03688838000108 0000006286 155.77
04967974000280 0000025175 388.18
0000025230 384.07
0000025273 363.35
0000025356 351.88
0000025455 434.77
0000025487 352.18
0000025527 382.65
0000025630 383.61
从df2:
g2 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()
Output:
cnpj num_doc
02364118000124 0000015263 37353.10
0000015264 56214.14
0000015265 2115.92
0000015571 39873.74
0000015572 52611.06
0000015573 1121.64
02817342000124 0000010154 13625.31
0000010299 407.02
03572854000130 0000007298 7170.54
04967974000280 0000025175 388.18
0000025230 384.07
0000025273 363.35
0000025356 351.88
0000025455 434.77
0000025487 352.18
0000025527 382.65
0000025630 383.61
0000025660 400.15
0000025663 325.72
0000025785 342.14
0000025860 287.19
0000025933 378.77
0000026016 334.84
0000026083 357.55
0000026149 367.37
0000026197 388.14
05413136000147 0000005704 199.85
0000005709 187.22
0000005710 157.40
0000005711 81.30
所以..如何创建以下数据框?
cnpj num_doc sum_bc_icms_df1 sum_vlr_item_df2 diff
0001 0000001 100.000,00 98.000,00 2.000,00
0001 0000002 50.000,00 50.000,00 0,00
0002 0000001 10.000,00 12.000,00 -2.000,00
0003 0000001 60.000,00 55.000,00 5.000,00
或者也许通过创建一种使用AND条件有条件地对值求和的方法来实现我想要的(得到值的总和之间的差)……但我不知道如何……。
最佳答案
您可以这样操作:
In [48]: g1 = df1.groupby(['cnpj','num_doc']).sum()
In [49]: g2 = df2.groupby(['cnpj','num_doc']).sum()
In [50]: joined = g1.join(g2).fillna(0)
In [51]: joined
Out[51]:
bc_icms vlr_item
cnpj num_doc
2364118000124 15263 37353.10 37353.10
15264 56214.14 56214.14
15265 2115.92 2115.92
2817342000124 10154 17827.07 13625.31
7374346000107 14224 320.12 320.12
14231 385.04 385.04
14263 401.28 401.28
14279 391.26 391.26
7720786000160 20128 355.53 0.00
20437 461.32 0.00
20438 168.04 0.00
20439 385.23 0.00
20444 378.80 0.00
20446 289.67 0.00
20451 223.36 0.00
20461 305.31 0.00
8953538000122 12865 686.00 0.00
54921580000189 112428 108000.00 0.00
57101370000132 2042 200.00 0.00
In [53]: joined['diff'] = joined['bc_icms'] - joined['vlr_item']
In [54]: joined
Out[54]:
bc_icms vlr_item diff
cnpj num_doc
2364118000124 15263 37353.10 37353.10 0.00
15264 56214.14 56214.14 0.00
15265 2115.92 2115.92 0.00
2817342000124 10154 17827.07 13625.31 4201.76
7374346000107 14224 320.12 320.12 0.00
14231 385.04 385.04 0.00
14263 401.28 401.28 0.00
14279 391.26 391.26 0.00
7720786000160 20128 355.53 0.00 355.53
20437 461.32 0.00 461.32
20438 168.04 0.00 168.04
20439 385.23 0.00 385.23
20444 378.80 0.00 378.80
20446 289.67 0.00 289.67
20451 223.36 0.00 223.36
20461 305.31 0.00 305.31
8953538000122 12865 686.00 0.00 686.00
54921580000189 112428 108000.00 0.00 108000.00
57101370000132 2042 200.00 0.00 200.00
单线:
joined = df1.groupby(['cnpj','num_doc']).sum() \
.join(df2.groupby(['cnpj','num_doc']).sum()) \
.fillna(0)
关于python - 如何获得2个数据帧的条件值总和之间的差?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/36410207/