让我们看下面的样本数据集:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
sample.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price)
第1行和第4行实际上给出了相同的观察结果-唯一的问题是,第1行说“A”购买了资产(暗示“B”出售),在第4行中说了“B”出售了资产(暗示“A” ”)。
我想用代码创建以下数据集:
counterparty1 <- c("A","B","B","B","B")
counterparty2 <- c("B","C","A","A","C")
counterparty1_side <- c("buy","sell","buy","sell","sell")
price <- c(1.2,3.7,2.5,1.2,3.7)
transaction_number <- c(1,2,3,1,4)
duplicate <- c(1,0,0,1,0)
clean.data <- data.frame(counterparty1,counterparty2,counterparty1_side,price,transaction_number,duplicate)
当然,实际上,我的数据集要大得多,所以我不能硬编码。
更新:我添加了第5行,该行与第2行相同,包括交易对手1和2的顺序相同。我希望“duplicate”变量仅将第1行和第4行标记为重复项(因为它们是相反的),而不是第2行和第5行。
最佳答案
更新的答案:
解决OP的后续问题,指出如果同一笔交易发生两次,则不应将其重复处理(例如,乙方两次以3.7万美元的价格向丙方出售某物);阅读评论和更新的问题。
library(dplyr)
sample.data %>%
mutate(transaction=if_else(counterparty1_side=="buy",
paste0(counterparty1,counterparty2),
paste0(counterparty2,counterparty1))) %>%
group_by_all %>%
mutate(dup_dum = 1:n()) %>%
group_by(transaction, dup_dum) %>%
mutate(transaction_number = group_indices(),
duplicate = +(n()!=n_distinct(transaction, dup_dum))) %>%
ungroup() %>% select(-transaction, -dup_dum)
#> # A tibble: 5 x 6
#> counterparty1 counterparty2 counterparty1_s~ price transaction_num~ duplicate
#> <fct> <fct> <fct> <dbl> <int> <int>
#> 1 A B buy 1.2 1 1
#> 2 B C sell 3.7 3 0
#> 3 B A buy 2.5 2 0
#> 4 B A sell 1.2 1 1
#> 5 B C sell 3.7 4 0
原始答案:考虑欺诈(无论是因为对方角色已更改还是他们是实际的欺诈而被骗)(查看问题的编辑以查看问题的第一个版本)。
library(dplyr)
sample.data %>%
mutate(transaction=if_else(counterparty1_side=="buy",
paste0(counterparty1,counterparty2),
paste0(counterparty2,counterparty1))) %>%
group_by(transaction) %>%
mutate(transaction_number = group_indices(),
duplicate = +(n()!=n_distinct(transaction))) %>%
ungroup() %>% select(-transaction)
# # A tibble: 4 x 6
# counterparty1 counterparty2 counterparty1_side price transaction_number duplicate
# <fct> <fct> <fct> <dbl> <int> <int>
# 1 A B buy 1.2 1 1
# 2 B C sell 3.7 3 0
# 3 B A buy 2.5 2 0
# 4 B A sell 1.2 1 1
关于r - R中的观测值配对,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57011800/