我有2个熊猫DataFramesdf1df2),其标题如下:

DF1: cnpj     num_doc   bc_icms
DF2: cnpj     num_doc   vlr_item


我的问题:


  如何比较每个sum()BC_ICMSNUM_DOC
  来自CNPJDF1,相对于每次出现的sum()VLR_ITEM
  来自同一NUM_DOCCNPJ来自DF2吗?




我的想法:

我相信这就像"subtract the row that contains the result of aggfunc="sum" of df1 from df2"

或者也许找到一种创建类似的方法:

cnpj    num_doc    sum_bc_icms_df1    sum_vlr_item_df2    diff


其中diff是包含vlr_item_df1 - vlr_item_df2结果的列



我的资料:

DF1:

               cnpj     num_doc    bc_icms
0    02817342000124  0000010154   17827.07
1    54921580000189  0000112428  108000.00
2    08953538000122  0000012865     232.00
3    08953538000122  0000012865     239.00
4    08953538000122  0000012865     215.00
5    07374346000107  0000014224     320.12
6    07374346000107  0000014231     385.04
7    07374346000107  0000014263     401.28
8    07374346000107  0000014279     391.26
9    02364118000124  0000015263   37353.10
10   02364118000124  0000015264   56214.14
11   02364118000124  0000015265    2115.92
12   07720786000160  0000020128     355.53
13   57101370000132  0000002042     200.00
14   07720786000160  0000020437     461.32
15   07720786000160  0000020438     168.04
16   07720786000160  0000020439     385.23
17   07720786000160  0000020444     378.80
18   07720786000160  0000020446     289.67
19   07720786000160  0000020451     223.36
20   07720786000160  0000020461     305.31


DF2:

               cnpj     num_doc  vlr_item
0    02817342000124  0000010154   1391,04
1    02817342000124  0000010154     100,0
2    02817342000124  0000010154   1587,09
3    02817342000124  0000010154     200,0
4    02817342000124  0000010154    1430,0
5    02817342000124  0000010154     592,0
6    02817342000124  0000010154   7508,99
7    02817342000124  0000010154    443,03
8    02817342000124  0000010154      29,9
9    02817342000124  0000010154    343,26
10   07374346000107  0000014224    320,12
11   07374346000107  0000014231    385,04
12   07374346000107  0000014263    401,28
13   07374346000107  0000014279    391,26
14   02364118000124  0000015263  29725,96
15   02364118000124  0000015263   6590,31
16   02364118000124  0000015263   1036,83
17   02364118000124  0000015264  47348,71
18   02364118000124  0000015264   7641,97
19   02364118000124  0000015264   1223,46
20   02364118000124  0000015265   2115,92


规则:


1 CNPJ发出N NUM_DOC
CNPJ A可能会发出NUM_DOC 1,而CNPJ B也可能会发出NUM_DOC 1。
每个CNPJ可以出现多次。
每个NUM_DOC可以出现多次。




我的代码:

从df1:

g1 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()

Output:

cnpj            num_doc
00903175000109  0000059253       100.00
01114678000168  0000004705       358.00
01116428000167  0000006883       165.00
02295440000149  0000051845        72.65
                0000051866        69.90
02364118000124  0000015263     37353.10
                0000015264     56214.14
                0000015265      2115.92
                0000015571     39873.74
                0000015572     52611.06
                0000015573      1121.64
02817342000124  0000010154     17827.07
                0000010299       407.02
03509978001143  0000145539     17000.00
                0000145744    108000.00
                0000145745    108000.00
                0000145746    108000.00
                0000145749     15000.00
                0000145750     15000.00
                0000145753     15000.00
03572854000130  0000007298      7190.54
03688838000108  0000006286       155.77
04967974000280  0000025175       388.18
                0000025230       384.07
                0000025273       363.35
                0000025356       351.88
                0000025455       434.77
                0000025487       352.18
                0000025527       382.65
                0000025630       383.61


从df2:

g2 = df2.groupby(['cnpj','num_doc'])['vlr_item'].sum()

Output:

cnpj            num_doc
02364118000124  0000015263    37353.10
                0000015264    56214.14
                0000015265     2115.92
                0000015571    39873.74
                0000015572    52611.06
                0000015573     1121.64
02817342000124  0000010154    13625.31
                0000010299      407.02
03572854000130  0000007298     7170.54
04967974000280  0000025175      388.18
                0000025230      384.07
                0000025273      363.35
                0000025356      351.88
                0000025455      434.77
                0000025487      352.18
                0000025527      382.65
                0000025630      383.61
                0000025660      400.15
                0000025663      325.72
                0000025785      342.14
                0000025860      287.19
                0000025933      378.77
                0000026016      334.84
                0000026083      357.55
                0000026149      367.37
                0000026197      388.14
05413136000147  0000005704      199.85
                0000005709      187.22
                0000005710      157.40
                0000005711       81.30



  所以..如何创建以下数据框?


cnpj    num_doc    sum_bc_icms_df1    sum_vlr_item_df2    diff
0001    0000001    100.000,00          98.000,00          2.000,00
0001    0000002    50.000,00           50.000,00          0,00
0002    0000001    10.000,00           12.000,00          -2.000,00
0003    0000001    60.000,00           55.000,00          5.000,00


或者也许通过创建一种使用AND条件有条件地对值求和的方法来实现我想要的(得到值的总和之间的差)……但我不知道如何……。

最佳答案

您可以这样操作:

In [48]: g1 = df1.groupby(['cnpj','num_doc']).sum()

In [49]: g2 = df2.groupby(['cnpj','num_doc']).sum()

In [50]: joined = g1.join(g2).fillna(0)

In [51]: joined
Out[51]:
                          bc_icms  vlr_item
cnpj           num_doc
2364118000124  15263     37353.10  37353.10
               15264     56214.14  56214.14
               15265      2115.92   2115.92
2817342000124  10154     17827.07  13625.31
7374346000107  14224       320.12    320.12
               14231       385.04    385.04
               14263       401.28    401.28
               14279       391.26    391.26
7720786000160  20128       355.53      0.00
               20437       461.32      0.00
               20438       168.04      0.00
               20439       385.23      0.00
               20444       378.80      0.00
               20446       289.67      0.00
               20451       223.36      0.00
               20461       305.31      0.00
8953538000122  12865       686.00      0.00
54921580000189 112428   108000.00      0.00
57101370000132 2042        200.00      0.00

In [53]: joined['diff'] = joined['bc_icms'] - joined['vlr_item']

In [54]: joined
Out[54]:
                          bc_icms  vlr_item       diff
cnpj           num_doc
2364118000124  15263     37353.10  37353.10       0.00
               15264     56214.14  56214.14       0.00
               15265      2115.92   2115.92       0.00
2817342000124  10154     17827.07  13625.31    4201.76
7374346000107  14224       320.12    320.12       0.00
               14231       385.04    385.04       0.00
               14263       401.28    401.28       0.00
               14279       391.26    391.26       0.00
7720786000160  20128       355.53      0.00     355.53
               20437       461.32      0.00     461.32
               20438       168.04      0.00     168.04
               20439       385.23      0.00     385.23
               20444       378.80      0.00     378.80
               20446       289.67      0.00     289.67
               20451       223.36      0.00     223.36
               20461       305.31      0.00     305.31
8953538000122  12865       686.00      0.00     686.00
54921580000189 112428   108000.00      0.00  108000.00
57101370000132 2042        200.00      0.00     200.00


单线:

joined = df1.groupby(['cnpj','num_doc']).sum() \
            .join(df2.groupby(['cnpj','num_doc']).sum()) \
            .fillna(0)

关于python - 如何获得2个数据帧的条件值总和之间的差?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/36410207/

10-11 03:50