问题描述
我有一个包含4列的数据表:ID,名称,Rate1,Rate2.
我想删除ID,Rate1和Rate 2相同的重复项, 但是 ,如果它们都是NA,我想保留这两行./p>
基本上,我想有条件地删除重复项,但前提是条件!=不适用.
例如,我想要这样:
ID名称Rate1 Rate21 Xyz 1 21 Abc 1 22 def不适用不适用2 Lmn不适用不适用3朝圣者3 53季度3 7
成为这个:
ID名称Rate1 Rate21 Xyz 1 22 def不适用不适用2 Lmn不适用不适用3朝圣者3 53季度3 7
提前谢谢!
我知道可以只获取比率"为NA的数据表的子集,然后删除剩下的重复项,然后再添加NA行-但是,我宁愿避免这种策略.这是因为实际上我想连续执行很多对汇率.
为清楚起见,在示例中增加了几行.
一个 base R
选项将对不带名称"的数据集子集使用 duplicated
列(即列索引2)以创建逻辑向量,取反(!
-TRUE变为FALSE,反之亦然),以便TRUE将为非重复行.随之在逻辑矩阵上用 rowSums
创建另一个条件( is.na(df1 [3:4])
-为列评分),以获得所有均为NA的行-在这里,我们将其与2进行比较-即数据集中的费率"列数).这两个条件都由 |
连接起来以创建预期的逻辑索引
i1<-!duplicated(df1 [-2])|rowSums(is.na(df1 [3:4]))== 2df1 [i1,]#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NA
或与 base R
Reduce
一起使用 df1 [Reduce(`&`,lapply(df1 [3:4],is.na)))|!duplicated(df1 [-2]),]
将其包装在函数中
f1<-函数(dat,i,method){nm1<-grep("^ Rate",colnames(dat),值= TRUE)i1<-!duplicated(dat [-i])i2<-开关(方法,"rowSums" = rowSums(is.na(dat [nm1]))==长度(nm1),"Reduce" =减少(`&`,lapply(dat [nm1],is.na)))i3<-i1 | i2dat [i3,]}
-测试
f1(df1,2,"rowSums")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NAf1(df1,2,减少")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NAf1(df2,2,"rowSums")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NA#5 3朝圣者3 5#6 3 Qrs 3 7f1(df2,2,减少")#ID名称Rate1 Rate2#1 1 Xyz 1 2#3 2 Def NA NA#4 2 Lmn NA NA#5 3朝圣者3 5#6 3 Qrs 3 7
如果有多个费率"列(例如100或更多-在第一个解决方案中唯一要更改的内容是" 2
"应更改为费率"列的数量)
或使用 tidyverse
库(tidyvesrse)df1%>%group_by(ID)%&%;%filter_at(vars(Rate1,Rate2),any_vars(!duplicated(.)| is.na(.)))#小动作:3 x 4#组:ID [2]#ID名称Rate1 Rate2#< int>< chr>< int>< int>#1 1 Xyz 1 2#2 2 Def NA NA#3 2 Lmn NA NAdf2%>%group_by(ID)%&%;%filter_at(vars(Rate1,Rate2),any_vars(!duplicated(.)| is.na(.)))#小动作:5 x 4#组:ID [3]#ID名称Rate1 Rate2#< int>< chr>< int>< int>#1 1 Xyz 1 2#2 2 Def NA NA#3 2 Lmn NA NA#4 3 Hij 3 5#5 3 Qrs 3 7
数据
df1<-结构(list(ID = c(1L,1L,2L,2L),Name = c("Xyz","Abc","Def","Lmn"),Rate1 = c(1L,1L,NA,NA),Rate2 = c(2L,2L,NA,NA)),类="data.frame",row.names = c(NA,-4L))df2<-structure(list(ID = c(1L,1L,2L,2L,3L,3L),名称= c("Xyz","Abc","Def","Lmn","Hij","Qrs"),Rate1 = c(1L,1L,NA,NA,3L,3L),Rate2 = c(2L,2L,NA,NA,5L,7L)),类别="data.frame",row.names = c(NA,-6L))
I have a data table with 4 columns: ID, Name, Rate1, Rate2.
I want to remove duplicates where ID, Rate1, and Rate 2 are the same, but if they are both NA, I would like to keep both rows.
Basically, I want to conditionally remove duplicates, but only if the conditions != NA.
For example, I would like this:
ID Name Rate1 Rate2
1 Xyz 1 2
1 Abc 1 2
2 Def NA NA
2 Lmn NA NA
3 Hij 3 5
3 Qrs 3 7
to become this:
ID Name Rate1 Rate2
1 Xyz 1 2
2 Def NA NA
2 Lmn NA NA
3 Hij 3 5
3 Qrs 3 7
Thanks in advance!
EDIT: I know it's possible to just take a subset of the data table where the Rates are NA, then remove duplicates on what's left, then add the NA rows back in - but, I would rather avoid this strategy. This is because in reality there are quite a few couplets of rates that I want to do this for consecutively.
EDIT2: Added in some more rows to the example for clarity.
A base R
option would be to use duplicated
on the subset of dataset without the 'Name' column i.e. column index 2 to create a logical vector, negate (!
- TRUE becomes FALSE and viceversa) so that TRUE would be non-duplicated rows. Along with that create another condition with rowSums
on a logical matrix (is.na(df1[3:4])
- Rate columns) to get rows that are all NA's - here we compare it with 2 - i.e. the number of Rate columns in the dataset). Both the conditions are joined by |
to create the expected logical index
i1 <- !duplicated(df1[-2])| rowSums(is.na(df1[3:4])) == 2
df1[i1,]
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
Or with Reduce
from base R
df1[Reduce(`&`, lapply(df1[3:4], is.na)) | !duplicated(df1[-2]), ]
Wrapping it in a function
f1 <- function(dat, i, method ) {
nm1 <- grep("^Rate", colnames(dat), value = TRUE)
i1 <- !duplicated(dat[-i])
i2 <- switch(method,
"rowSums" = rowSums(is.na(dat[nm1])) == length(nm1),
"Reduce" = Reduce(`&`, lapply(dat[nm1], is.na))
)
i3 <- i1|i2
dat[i3,]
}
-testing
f1(df1, 2, "rowSums")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
f1(df1, 2, "Reduce")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
f1(df2, 2, "rowSums")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
#5 3 Hij 3 5
#6 3 Qrs 3 7
f1(df2, 2, "Reduce")
# ID Name Rate1 Rate2
#1 1 Xyz 1 2
#3 2 Def NA NA
#4 2 Lmn NA NA
#5 3 Hij 3 5
#6 3 Qrs 3 7
if there are multiple 'Rate' columns (say 100 or more - only thing to change in the first solution is 2
should be changed to the number of 'Rate' columns)
Or using tidyverse
library(tidyvesrse)
df1 %>%
group_by(ID) %>%
filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 3 x 4
# Groups: ID [2]
# ID Name Rate1 Rate2
# <int> <chr> <int> <int>
#1 1 Xyz 1 2
#2 2 Def NA NA
#3 2 Lmn NA NA
df2 %>%
group_by(ID) %>%
filter_at(vars(Rate1, Rate2), any_vars(!duplicated(.)|is.na(.)))
# A tibble: 5 x 4
# Groups: ID [3]
# ID Name Rate1 Rate2
# <int> <chr> <int> <int>
#1 1 Xyz 1 2
#2 2 Def NA NA
#3 2 Lmn NA NA
#4 3 Hij 3 5
#5 3 Qrs 3 7
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L), Name = c("Xyz", "Abc",
"Def", "Lmn"), Rate1 = c(1L, 1L, NA, NA), Rate2 = c(2L, 2L, NA,
NA)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Name = c("Xyz",
"Abc", "Def", "Lmn", "Hij", "Qrs"), Rate1 = c(1L, 1L, NA, NA,
3L, 3L), Rate2 = c(2L, 2L, NA, NA, 5L, 7L)), class = "data.frame",
row.names = c(NA, -6L))
这篇关于要删除重复的行,除非列中不存在NA值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!