本文介绍了如何计算两年内的移动平均数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于并购(M&;AS)的大数据框架(90万行)。

DF有四列:日期(并购完成时)、目标_国家(被并购国家的公司)、收购_国家(被收购的是哪个国家的公司)、大公司(被收购的是不是大公司,TRUE表示该公司是大公司)。

以下是我的df的示例:

> df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE)), row.names = c(NA, -8L))

> df

   date target_nation acquiror_nation big_corp_TF
1: 2000        Uganda          France        TRUE
2: 2000        Uganda         Germany       FALSE
3: 2001        Uganda          France        TRUE
4: 2001        Uganda          France       FALSE
5: 2001        Uganda         Germany       FALSE
6: 2002        Uganda          France        TRUE
7: 2002        Uganda          France        TRUE
8: 2002        Uganda         Germany        TRUE

从这些数据中,我想创建一个新的变量,表示特定收购国家的大公司所做的并购份额,计算两年的平均值。(对于我的实际练习,我将计算五年的平均值,但这里让事情变得简单一些)。因此,法国大企业会有一个新的变数,德国的大企业也会有一个新的变数。

到目前为止,我所做的是1)统计某一年某一特定目标国的并购总数;2)统计某一收购国大公司在某一年在特定目标国进行的并购总数。我加入了这两个DFS,以便计算我想要的平均数。下面是我使用的代码和生成的新df:

##counting total rows for target nations
df2 <- df %>%
 group_by(date, target_nation) %>%
 count(target_nation)

##counting total rows conducted by small or big corps for certain acquiror nations

df3 <- df %>%
  group_by(date, target_nation, acquiror_nation) %>%
  count(big_corp_TF)

##selecting rows that were conducted by big corps

df33 <- df3 %>%
  filter(big_corp_TF == TRUE)

##merging df2 and df33

df4 <- df2 %>%
  left_join(df33, by = c("date" = "date", "target_nation" = "target_nation"))

df4 <- as.data.frame(df4)

> df4

  date target_nation n.x acquiror_nation big_corp_TF n.y
1 2000        Uganda   2          France        TRUE   1
2 2001        Uganda   3          France        TRUE   1
3 2002        Uganda   3          France        TRUE   2
4 2002        Uganda   3         Germany        TRUE   1

n.x此处是特定目标国家在某一年的并购交易总数;n.y是特定收购国家的大公司在特定目标国家进行的并购交易总数。

有了这个新的数据框架df4,我现在可以很容易地计算出特定收购国家的大公司在特定目标国家进行的并购在特定一年中的份额。例如,让我们计算一下法国的份额:

df5 <- df4 %>%
  filter(acquiror_nation == "France") %>%
  mutate(France_bigcorp_share_1year = n.y / n.x)

  date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_1year
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.3333333
3 2002        Uganda   3          France        TRUE   2                  0.6666667

但是,我搞不懂如何计算特定收购国家的大公司所做的并购份额,计算一下2年的平均值。

以下是所需变量的外观:

  date target_nation n.x acquiror_nation big_corp_TF n.y France_bigcorp_share_2years
1 2000        Uganda   2          France        TRUE   1                  0.5000000
2 2001        Uganda   3          France        TRUE   1                  0.4000000
3 2002        Uganda   3          France        TRUE   2                  0.5000000

请注意,2000年的比例将保持不变,因为没有前一年将其设为两年平均值;2001年将变为0.4(因为(1+1)/(2+3)=0.4);2002年将变为0.5(因为(1+2)/(3+3)=0.5)。

您知道如何编写代码来计算两年的平均份额吗?我想我需要在这里使用for循环,但我不知道如何使用。如有任何建议,我们将不胜感激。

--

编辑:AnilGoYal的代码可以很好地处理示例数据,但我的实际数据显然更加混乱,因此我想知道我遇到的问题是否有解决方案。

我的实际数据集有时会跳过一年,或者有时不包括前几行中包含的QUIERROR_COUNTIONS。请查看更准确的我的实际数据样本:

> df_new <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
2002L, 2002L, 2003L, 2003L, 2004L, 2004L, 2004L, 2006L, 2006L
), target_nation = c("Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda",
"Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "France", "France",
"Germany", "Germany", "Germany", "France", "France", "Germany",
"France", "France"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE)), row.names = c(NA,
-15L))

> df_new

    date target_nation acquiror_nation big_corp_TF
 1: 2000        Uganda          France     TRUE
 2: 2000        Uganda         Germany    FALSE
 3: 2001        Uganda          France     TRUE
 4: 2001        Uganda          France    FALSE
 5: 2001        Uganda         Germany    FALSE
 6: 2002        Uganda          France     TRUE
 7: 2002        Uganda          France     TRUE
 8: 2002        Uganda         Germany     TRUE
 9: 2003        Uganda         Germany     TRUE
10: 2003        Uganda         Germany    FALSE
11: 2004        Uganda          France     TRUE
12: 2004        Uganda          France    FALSE
13: 2004        Uganda         Germany     TRUE
14: 2006        Uganda          France     TRUE
15: 2006        Uganda          France     TRUE

NB:2003年法国没有排,2005年也没有。

如果我运行Anil的第一个代码,结果是以下Tibble:

   date target_nation acquiror_nation    n1    n2 share
  <int> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1   0.5
2  2001 Uganda        France              3     1   0.4
3  2002 Uganda        France              3     2   0.5
4  2004 Uganda        France              3     1   0.5
5  2006 Uganda        France              2     2   0.6
注:2003年和2005年法国没有结果;我希望有2003年和2005年的结果(因为我们正在计算两年的平均值,因此我们应该能够得到2003年和2005年的结果)。另外,2006年的份额实际上是不正确的,因为它应该是1(它应该采用2005年的值(即0),而不是2004年的值来计算平均值)。

我希望能够收到以下Tibble:

       date target_nation acquiror_nation    n1    n2 share
      <int> <chr>         <chr>           <dbl> <int> <dbl>
    1  2000 Uganda        France              2     1   0.5
    2  2001 Uganda        France              3     1   0.4
    3  2002 Uganda        France              3     2   0.5
    4  2003 Uganda        France              2     0   0.4
    5  2004 Uganda        France              3     1   0.2
    6  2005 Uganda        France              0     0   0.33
    7  2006 Uganda        France              2     2   1.0

注意:请注意,2006年的结果也不同(因为我们现在以2005年而不是2004年作为两年的平均值)。

您认为有可能找到一种方法来输出所需的Tibble吗?我理解这是原始数据的一个问题:它只是缺乏某些数据点。然而,将它们包括到原始数据集似乎非常不方便;可能更好的做法是在中途将它们包括在内,例如在计算N1和N2后。但最方便的方法是什么呢?

EDIT2:Anil的新代码可以很好地处理上面的数据样本,但在处理更复杂的数据样本(包括多个Target_Country)时,它会遇到一个不受欢迎的问题。以下是一个更短但更复杂的数据示例:

> df_new_complex <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2003L,
2003L, 1999L, 2001L, 2002L, 2002L), target_nation = c("Uganda",
"Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Uganda", "Mozambique",
"Mozambique", "Mozambique", "Mozambique"), acquiror_nation = c("France",
"Germany", "France", "France", "Germany", "Germany", "Germany",
"Germany", "France", "France", "Germany"), big_corp_TF = c(TRUE,
FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE
)), row.names = c(NA, -11L))

> df_new_complex

date target_nation acquiror_nation big_corp_TF
 1: 2000        Uganda          France        TRUE
 2: 2000        Uganda         Germany       FALSE
 3: 2001        Uganda          France        TRUE
 4: 2001        Uganda          France       FALSE
 5: 2001        Uganda         Germany       FALSE
 6: 2003        Uganda         Germany        TRUE
 7: 2003        Uganda         Germany       FALSE
 8: 1999    Mozambique         Germany       FALSE
 9: 2001    Mozambique          France        TRUE
10: 2002    Mozambique          France       FALSE
11: 2002    Mozambique         Germany        TRUE

如您所见,此数据样本包括两个目标国家。Anil的代码,其中param <- c("France", "Germany"),生成以下Tibble:

    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0
 2  1999 Mozambique    Germany             1     0 0
 3  1999 Uganda        France              0     0 0
 4  1999 Uganda        Germany             0     0 0
 5  2000 Mozambique    France              0     0 0
 6  2000 Mozambique    Germany             0     0 0
 7  2000 Uganda        France              2     1 0.25
 8  2000 Uganda        Germany             2     0 0.167
 9  2001 Mozambique    France              1     1 0.4
10  2001 Mozambique    Germany             1     0 0.333
11  2001 Uganda        France              3     1 0.333
12  2001 Uganda        Germany             3     0 0.25
13  2002 Mozambique    France              2     0 0.2
14  2002 Mozambique    Germany             2     1 0.25
15  2002 Uganda        France              0     0 0.25
16  2002 Uganda        Germany             0     0 0.25
17  2003 Mozambique    France              0     0 0.25
18  2003 Mozambique    Germany             0     0 0.25
19  2003 Uganda        France              2     0 0.167
20  2003 Uganda        Germany             2     1 0.25
这里令人不快的是,代码为乌干达创建了1999年,为莫桑比克创建了2003年(后者不是问题)。1999年,如数据样本所示,乌干达没有投资,因此没有意义的数字(它可能有NAS,也可能根本不存在)。莫桑比克在2003年也没有投资,所以我不想计算莫桑比克在那一年的份额。

我已经找到了解决此问题的方法,即在代码的早期过滤特定的目标国家/地区,如下所示:

correct1 <- df_new_complex %>%
  filter(target_nation == "Mozambique") %>%
  mutate(d = 1) %>% ...

#I do the same for another target_nation

correct2 <- df_new_complex %>%
  filter(target_nation == "Uganda") %>%
  mutate(d = 1) %>% ...

#I then use rbind

correct <- rbind(correct1, correct2)

#which produces the desired tibble (without a year 2003 for Mozambique and 1999 for Uganda).

> correct

date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0
 2  1999 Mozambique    Germany             1     0 0
 3  2000 Mozambique    France              0     0 0
 4  2000 Mozambique    Germany             0     0 0
 5  2001 Mozambique    France              1     1 1
 6  2001 Mozambique    Germany             1     0 0
 7  2002 Mozambique    France              2     0 0.33
 8  2002 Mozambique    Germany             2     1 0.333
 9  2000 Uganda        France              2     1 0.5
10  2000 Uganda        Germany             2     0 0.25
11  2001 Uganda        France              3     1 0.286
12  2001 Uganda        Germany             3     0 0.2
13  2002 Uganda        France              0     0 0.167
14  2002 Uganda        Germany             0     0 0.167
15  2003 Uganda        France              2     0 0
16  2003 Uganda        Germany             2     1 0.25

执行此操作的更快方法是什么?我有一份想要的目标国家的清单。也许可以创建一个循环,我先计算一个目标国家,然后再计算另一个目标国家;然后重新绑定它们;然后再重新绑定它们;然后重新绑定,等等。或者有没有更好的方法?

推荐答案

使用包runner您可以执行以下操作

df <- structure(list(date = c(2000L, 2000L, 2001L, 2001L, 2001L, 2002L,
                              2002L, 2002L), target_nation = c("Uganda", "Uganda", "Uganda",
                                                               "Uganda", "Uganda", "Uganda", "Uganda", "Uganda"), acquiror_nation = c("France",
                                                                                                                                      "Germany", "France", "France", "Germany", "France", "France",
                                                                                                                                      "Germany"), big_corp_TF = c(TRUE, FALSE, TRUE, FALSE, FALSE,
                                                                                                                                                                  TRUE, TRUE, TRUE)), row.names = c(NA, -8L))

library(runner)
library(tidyverse)
df <- df %>% as.data.frame()
param <- 'France'
df %>%
  group_by(date, target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 3 x 6
#>    date target_nation acquiror_nation    n1    n2 share
#>   <int> <chr>         <chr>           <dbl> <int> <dbl>
#> 1  2000 Uganda        France              2     1   0.5
#> 2  2001 Uganda        France              3     1   0.4
#> 3  2002 Uganda        France              3     2   0.5

即使是您也可以同时为所有国家


df %>%
  group_by(date, target_nation) %>%
  mutate(n1 = n()) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  group_by(acquiror_nation) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))
#> # A tibble: 6 x 6
#> # Groups:   acquiror_nation [2]
#>    date target_nation acquiror_nation    n1    n2 share
#>   <int> <chr>         <chr>           <dbl> <int> <dbl>
#> 1  2000 Uganda        France              2     1 0.5
#> 2  2000 Uganda        Germany             2     0 0
#> 3  2001 Uganda        France              3     1 0.4
#> 4  2001 Uganda        Germany             3     0 0
#> 5  2002 Uganda        France              3     2 0.5
#> 6  2002 Uganda        Germany             3     1 0.167

针对修订后的场景,您需要做两件事-

  • 在两个sum_run函数中都包含参数idx = date。这将根据需要更正输出,但不包括缺少行/年的共享。
  • 要包括丢失的年份,您需要tidyr::complete,如下所示-
param <- 'France'
df_new %>%
  mutate(d = 1) %>%
  complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
           fill = list(d =0, big_corp_TF = FALSE)) %>%
  group_by(date, target_nation) %>%
  mutate(n1 = sum(d)) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  filter(acquiror_nation == param) %>%
  mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))

# A tibble: 7 x 6
   date target_nation acquiror_nation    n1    n2 share
  <dbl> <chr>         <chr>           <dbl> <int> <dbl>
1  2000 Uganda        France              2     1 0.5
2  2001 Uganda        France              3     1 0.4
3  2002 Uganda        France              3     2 0.5
4  2003 Uganda        France              2     0 0.4
5  2004 Uganda        France              3     1 0.2
6  2005 Uganda        France              0     0 0.333
7  2006 Uganda        France              2     2 1

与上面类似,您可以一次对所有国家/地区执行此操作(根据group_by进行复制过滤)

df_new %>%
  mutate(d = 1) %>%
  complete(date = seq(min(date), max(date), 1), nesting(target_nation, acquiror_nation),
           fill = list(d =0, big_corp_TF = FALSE)) %>%
  group_by(date, target_nation) %>%
  mutate(n1 = sum(d)) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  group_by(acquiror_nation) %>%
  mutate(share = sum_run(n2, k=2, idx = date)/sum_run(n1, k=2, idx = date))

# A tibble: 14 x 6
# Groups:   acquiror_nation [2]
    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  2000 Uganda        France              2     1 0.5
 2  2000 Uganda        Germany             2     0 0
 3  2001 Uganda        France              3     1 0.4
 4  2001 Uganda        Germany             3     0 0
 5  2002 Uganda        France              3     2 0.5
 6  2002 Uganda        Germany             3     1 0.167
 7  2003 Uganda        France              2     0 0.4
 8  2003 Uganda        Germany             2     1 0.4
 9  2004 Uganda        France              3     1 0.2
10  2004 Uganda        Germany             3     1 0.4
11  2005 Uganda        France              0     0 0.333
12  2005 Uganda        Germany             0     0 0.333
13  2006 Uganda        France              2     2 1
14  2006 Uganda        Germany             2     0 0

进一步编辑

  • 这很容易。从nesting中删除target_nation并在complete之前添加group_by

简单。不是吗

df_new_complex %>%
  mutate(d = 1) %>%
  group_by(target_nation) %>%
  complete(date = seq(min(date), max(date), 1), nesting(acquiror_nation),
           fill = list(d =0, big_corp_TF = FALSE)) %>%
  group_by(date, target_nation) %>%
  mutate(n1 = sum(d)) %>%
  group_by(date, target_nation, acquiror_nation) %>%
  summarise(n1 = mean(n1),
            n2 = sum(big_corp_TF), .groups = 'drop') %>%
  group_by(acquiror_nation) %>%
  mutate(share = sum_run(n2, k=2)/sum_run(n1, k=2))

# A tibble: 16 x 6
# Groups:   acquiror_nation [2]
    date target_nation acquiror_nation    n1    n2 share
   <dbl> <chr>         <chr>           <dbl> <int> <dbl>
 1  1999 Mozambique    France              1     0 0
 2  1999 Mozambique    Germany             1     0 0
 3  2000 Mozambique    France              0     0 0
 4  2000 Mozambique    Germany             0     0 0
 5  2000 Uganda        France              2     1 0.5
 6  2000 Uganda        Germany             2     0 0
 7  2001 Mozambique    France              1     1 0.667
 8  2001 Mozambique    Germany             1     0 0
 9  2001 Uganda        France              3     1 0.5
10  2001 Uganda        Germany             3     0 0
11  2002 Mozambique    France              2     0 0.2
12  2002 Mozambique    Germany             2     1 0.2
13  2002 Uganda        France              0     0 0
14  2002 Uganda        Germany             0     0 0.5
15  2003 Uganda        France              2     0 0
16  2003 Uganda        Germany             2     1 0.5

这篇关于如何计算两年内的移动平均数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 06:49