问题描述
我有大量的数据,如下所示:
Name SNP.x ILMN.Strand.x Customer.Strand.x SNP.y ILMN.Strand.y Customer.Strand.y
exm-rs10128711 [T/C] BOT BOT [T/C] BOT BOT
exm-rs10134944 [A/G] TOP BOT NA NA NA
exm-rs10218696 NA NA NA [T/C] BOT TOP
exm-rs10223421 [A/C] TOP BOT NA NA NA
如何创建新列"SNP","ILMN.Strand","Customer.Strand",因此:
- 如果(SNP.x = SNP.y),则"SNP","ILMN.Strand","Customer.Strand"将来自"SNP.x","ILMN.Strand.x",客户". Strand.x"
-
如果(SNP.x不等于SNP.y),并且SNP.x是NA(缺失值),则新列中的值应取自"SNP.y","ILMN" .Strand.y," Customer.Strand.y"
-
如果(SNP.x不等于SNP.y),并且SNP.y是NA(缺失值),则新列中的值应取自"SNP.x","ILMN" .Strand.x," Customer.Strand.x"
非常感谢! :)
我假设,如果SNP.x
和SNP.y
均为NA
,则从数据帧中删除该行.如果SNP.x != SNP.y
该行也被删除(如果发生这种情况).下面的代码不是很好,也不是很有效,但是应该可以解决问题.
tmp <- apply(df, 1, function(x){
# if SNP.x == SNP.y and not NA pass X
if(!is.na(x["SNP.x"] == x["SNP.y"])) {
if(x["SNP.x"] == x["SNP.y"]) data.frame(Name = x["Name"], SNP = x["SNP.x"], ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
} else if(is.na(x["SNP.x"])) { # else if SNP.x is NA pass y
if(!is.na(x["SNP.y"])) data.frame(Name = x["Name"], SNP = x["SNP.y"], ILMN.Strand = x["ILMN.Strand.y"], Customer.Strand = x["Customer.Strand.y"])
} else if(is.na(x["SNP.y"])) { # else if SNP.y is NA pass x
if(!is.na(x["SNP.x"])) data.frame(Name = x["Name"], SNP = x["SNP.x"], ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
} else NULL # otherwise pass NULL (e.g. (SNP.x != SNP.y AND neither are NA))
})
# rbind the list-output of the previous apply() function
result <- do.call(rbind, tmp[!sapply(tmp, is.null)])
结果是具有以下结构的数据框:
str(result)
'data.frame': 81 obs. of 4 variables:
$ Name : Factor w/ 81 levels "exm-rs666","exm-rs3510",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SNP : Factor w/ 2 levels "[A/C]","[T/G]": 1 2 1 1 1 2 2 2 2 2 ...
$ ILMN.Strand : Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 1 2 2 1 1 ...
$ Customer.Strand: Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 2 1 1 1 1 ...
这可能是更好的解决方案(R 3.2.4和dplyr 0.5.0),因为apply()
将数据帧强制转换为矩阵等.如果(SNP.X != SNP.Y)
和两者都不是NA
.希望这能解决问题,尽管没有更多有关数据的信息,很难预测可能会遇到哪些问题.在此解决方案中,因素被强制转换为字符,因此请记住这一点.
# This is a helper function for the logic
# a and b will be tested; retA, retB, NA or '..' (see below) will be returned
logicalTest <- function(a, b, retA, retB){
# coerce factors into character
if(is.factor(retA)) retA <- as.character(retA)
if(is.factor(retB)) retB <- as.character(retB)
tmp <- a == b # compare a and b (surrogates for SNP.x and SNP.y) and put in tmp variable
if(is.na(tmp)){ # if the comparison was NA one of a or b must have been NA ...
if(is.na(a) & is.na(b)) return(NA) # if both were NA just return NA,
else if(is.na(a)) return(retB) # if a was NA return b,
else return(retA) # otherwise return a
} else if(tmp){ # if tmp is TRUE (a == b)
return(retA) # return a
} else return("..") # else (a != b) return ".."
}
# load dplyr for the bit below
library(dplyr)
result <- df %>%
group_by(Name) %>%
transmute(SNP = logicalTest(SNP.x, SNP.y, SNP.x, SNP.y),
ILMN.Strand = logicalTest(SNP.x, SNP.y, ILMN.Strand.x, ILMN.Strand.y),
Customer.Strand = logicalTest(SNP.x, SNP.y, Customer.Strand.x, Customer.Strand.y))
# get cleaned results
result[!rowSums(is.na(result)),] # drop rows with NAs
result[!(rowSums(is.na(result)) | result$SNP == ".."),] # drop rows with NAs and ".."
I have a big set of data that looks like the following:
Name SNP.x ILMN.Strand.x Customer.Strand.x SNP.y ILMN.Strand.y Customer.Strand.y
exm-rs10128711 [T/C] BOT BOT [T/C] BOT BOT
exm-rs10134944 [A/G] TOP BOT NA NA NA
exm-rs10218696 NA NA NA [T/C] BOT TOP
exm-rs10223421 [A/C] TOP BOT NA NA NA
How do I create new columns "SNP","ILMN.Strand","Customer.Strand", whereby:
- if (SNP.x = SNP.y), then "SNP","ILMN.Strand","Customer.Strand" would be from "SNP.x","ILMN.Strand.x","Customer.Strand.x"
if (SNP.x is not equal to SNP.y), and SNP.x is NA (missing value), then the values in the new columns should be taken from "SNP.y","ILMN.Strand.y","Customer.Strand.y"
if (SNP.x is not equal to SNP.y), and SNP.y is NA (missing value), then the values in the new columns should be taken from "SNP.x","ILMN.Strand.x","Customer.Strand.x"
Many thanks in advance! :)
I am assuming, that if both SNP.x
and SNP.y
are NA
, the row is dropped from the dataframe. If SNP.x != SNP.y
the row is also dropped (if that case were to occur).
The code below is not pretty or very efficient, but it ought to do the trick.
tmp <- apply(df, 1, function(x){
# if SNP.x == SNP.y and not NA pass X
if(!is.na(x["SNP.x"] == x["SNP.y"])) {
if(x["SNP.x"] == x["SNP.y"]) data.frame(Name = x["Name"], SNP = x["SNP.x"], ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
} else if(is.na(x["SNP.x"])) { # else if SNP.x is NA pass y
if(!is.na(x["SNP.y"])) data.frame(Name = x["Name"], SNP = x["SNP.y"], ILMN.Strand = x["ILMN.Strand.y"], Customer.Strand = x["Customer.Strand.y"])
} else if(is.na(x["SNP.y"])) { # else if SNP.y is NA pass x
if(!is.na(x["SNP.x"])) data.frame(Name = x["Name"], SNP = x["SNP.x"], ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
} else NULL # otherwise pass NULL (e.g. (SNP.x != SNP.y AND neither are NA))
})
# rbind the list-output of the previous apply() function
result <- do.call(rbind, tmp[!sapply(tmp, is.null)])
The result is a dataframe with the following structure:
str(result)
'data.frame': 81 obs. of 4 variables:
$ Name : Factor w/ 81 levels "exm-rs666","exm-rs3510",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SNP : Factor w/ 2 levels "[A/C]","[T/G]": 1 2 1 1 1 2 2 2 2 2 ...
$ ILMN.Strand : Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 1 2 2 1 1 ...
$ Customer.Strand: Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 2 1 1 1 1 ...
EDIT:
This might be the better solution (R 3.2.4 with dplyr 0.5.0), since apply()
coerces the dataframe into a matrix etc. The solution below also returns a unique '..', if (SNP.X != SNP.Y)
and both are NOT NA
. Hope this will do the trick, although without more information on your data it is hard to anticipate which problems you may run into.In this solution factors are coerced into characters, so keep that in mind.
# This is a helper function for the logic
# a and b will be tested; retA, retB, NA or '..' (see below) will be returned
logicalTest <- function(a, b, retA, retB){
# coerce factors into character
if(is.factor(retA)) retA <- as.character(retA)
if(is.factor(retB)) retB <- as.character(retB)
tmp <- a == b # compare a and b (surrogates for SNP.x and SNP.y) and put in tmp variable
if(is.na(tmp)){ # if the comparison was NA one of a or b must have been NA ...
if(is.na(a) & is.na(b)) return(NA) # if both were NA just return NA,
else if(is.na(a)) return(retB) # if a was NA return b,
else return(retA) # otherwise return a
} else if(tmp){ # if tmp is TRUE (a == b)
return(retA) # return a
} else return("..") # else (a != b) return ".."
}
# load dplyr for the bit below
library(dplyr)
result <- df %>%
group_by(Name) %>%
transmute(SNP = logicalTest(SNP.x, SNP.y, SNP.x, SNP.y),
ILMN.Strand = logicalTest(SNP.x, SNP.y, ILMN.Strand.x, ILMN.Strand.y),
Customer.Strand = logicalTest(SNP.x, SNP.y, Customer.Strand.x, Customer.Strand.y))
# get cleaned results
result[!rowSums(is.na(result)),] # drop rows with NAs
result[!(rowSums(is.na(result)) | result$SNP == ".."),] # drop rows with NAs and ".."
这篇关于如何创建满足不同"IF"的新列? R中的条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!