我正在清理一个数据库,里面有电子邮件和他们各自的信息。有些电子邮件出现不止一次,但信息从一行到另一行是互补的。所以我想用email作为键合并行。并删除电子邮件,以防信息重复。
我的数据库是一个csv文件,使用read.csv将其转换为数据帧。
输入

  EMAIL     Country     Gender        Language
1 y@y.com   US                           S
2 z@z.com   AR           female          S
3 z@z.com                female
4 s@f.com   US           female          E
4 s@f.com   US           female          E
5 y@y.com   US           male

输出
  EMAIL     Country     Gender        Language
1 y@y.com   US           male            S
2 z@z.com   AR           female          S
3 s@f.com   US           female          E

最佳答案

我们可以使用dplyr。按“email”分组后,使用unique获取每个列中非空的summarise_all元素

library(dplyr)
df %>%
   group_by(EMAIL) %>%
   summarise_all(funs(unique(.[.!=''])))
# A tibble: 3 x 4
# Groups: EMAIL [3]
#  EMAIL   Country Gender Language
#  <chr>   <chr>   <chr>  <chr>
#1 y@y.com US      male   S
#2 z@z.com AR      female S
#3 s@f.com US      female E

数据
df <- structure(list(EMAIL = c("y@y.com", "z@z.com", "z@z.com", "s@f.com",
"s@f.com", "y@y.com"), Country = c("US", "AR", "", "US", "US",
"US"), Gender = c("", "female", "female", "female", "female",
"male"), Language = c("S", "S", "", "E", "E", "")), .Names = c("EMAIL",
 "Country", "Gender", "Language"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

关于r - 在R合并数据中合并两行并删除重复项,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/48017315/

10-11 22:13
查看更多