让我们看下面的样本数据集:

counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
sample.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price)

第1行和第4行实际上给出了相同的观察结果-唯一的问题是,第1行说“A”购买了资产(暗示“B”出售),在第4行中说了“B”出售了资产(暗示“A” ”)。

我想用代码创建以下数据集:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
transaction_number <- c(1,2,3,1,4)
duplicate <- c(1,0,0,1,0)
clean.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price,transaction_number,duplicate)

当然,实际上,我的数据集要大得多,所以我不能硬编码。

更新:我添加了第5行,该行与第2行相同,包括交易对手1和2的顺序相同。我希望“duplicate”变量仅将第1行和第4行标记为重复项(因为它们是相反的),而不是第2行和第5行。

最佳答案

更新的答案:
解决OP的后续问题,指出如果同一笔交易发生两次,则不应将其重复处理(例如,乙方两次以3.7万美元的价格向丙方出售某物);阅读评论和更新的问题。

library(dplyr)
sample.data %>%
  mutate(transaction=if_else(counterparty1_side=="buy",
                             paste0(counterparty1,counterparty2),
                             paste0(counterparty2,counterparty1))) %>%
  group_by_all %>%
  mutate(dup_dum = 1:n()) %>%
  group_by(transaction, dup_dum) %>%
  mutate(transaction_number = group_indices(),
         duplicate = +(n()!=n_distinct(transaction, dup_dum))) %>%
  ungroup() %>% select(-transaction, -dup_dum)

#> # A tibble: 5 x 6
#>   counterparty1 counterparty2 counterparty1_s~ price transaction_num~ duplicate
#>   <fct>         <fct>         <fct>            <dbl>            <int>     <int>
#> 1 A             B             buy                1.2                1         1
#> 2 B             C             sell               3.7                3         0
#> 3 B             A             buy                2.5                2         0
#> 4 B             A             sell               1.2                1         1
#> 5 B             C             sell               3.7                4         0
原始答案:
考虑欺诈(无论是因为对方角色已更改还是他们是实际的欺诈而被骗)(查看问题的编辑以查看问题的第一个版本)。
library(dplyr)

sample.data %>%
  mutate(transaction=if_else(counterparty1_side=="buy",
                             paste0(counterparty1,counterparty2),
                             paste0(counterparty2,counterparty1))) %>%
  group_by(transaction) %>%
  mutate(transaction_number = group_indices(),
         duplicate = +(n()!=n_distinct(transaction))) %>%
  ungroup() %>% select(-transaction)

# # A tibble: 4 x 6
#  counterparty1 counterparty2 counterparty1_side price transaction_number duplicate
#  <fct>         <fct>         <fct>              <dbl> <int>              <int>
# 1 A             B             buy               1.2   1                  1
# 2 B             C             sell              3.7   3                  0
# 3 B             A             buy               2.5   2                  0
# 4 B             A             sell              1.2   1                  1

关于r - R中的观测值配对,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57011800/

10-12 19:55