本文介绍了如何创建满足不同"IF"的新列? R中的条件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的数据,如下所示:

Name    SNP.x   ILMN.Strand.x   Customer.Strand.x   SNP.y   ILMN.Strand.y   Customer.Strand.y   
exm-rs10128711  [T/C]   BOT BOT [T/C]   BOT BOT
exm-rs10134944  [A/G]   TOP BOT NA  NA  NA  
exm-rs10218696    NA    NA  NA [T/C] BOT TOP
exm-rs10223421  [A/C]   TOP BOT NA  NA  NA

如何创建新列"SNP","ILMN.Strand","Customer.Strand",因此:

  1. 如果(SNP.x = SNP.y),则"SNP","ILMN.Strand","Customer.Strand"将来自"SNP.x","ILMN.Strand.x",客户". Strand.x"
  2. 如果(SNP.x不等于SNP.y),并且SNP.x是NA(缺失值),则新列中的值应取自"SNP.y","ILMN" .Strand.y," Customer.Strand.y"

  3. 如果(SNP.x不等于SNP.y),并且SNP.y是NA(缺失值),则新列中的值应取自"SNP.x","ILMN" .Strand.x," Customer.Strand.x"

非常感谢! :)

我假设,如果SNP.xSNP.y均为NA,则从数据帧中删除该行.如果SNP.x != SNP.y该行也被删除(如果发生这种情况).

下面的代码不是很好,也不是很有效,但是应该可以解决问题.

tmp <- apply(df, 1, function(x){
  # if SNP.x == SNP.y and not NA pass X
  if(!is.na(x["SNP.x"] == x["SNP.y"])) {
    if(x["SNP.x"] == x["SNP.y"]) data.frame(Name = x["Name"], SNP = x["SNP.x"],  ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
  } else if(is.na(x["SNP.x"])) { # else if SNP.x is NA pass y
    if(!is.na(x["SNP.y"])) data.frame(Name = x["Name"], SNP = x["SNP.y"],  ILMN.Strand = x["ILMN.Strand.y"], Customer.Strand = x["Customer.Strand.y"])
  } else if(is.na(x["SNP.y"])) { # else if SNP.y is NA pass x
    if(!is.na(x["SNP.x"])) data.frame(Name = x["Name"], SNP = x["SNP.x"],  ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
  } else NULL # otherwise pass NULL (e.g. (SNP.x != SNP.y AND neither are NA))
})

# rbind the list-output of the previous apply() function
result <- do.call(rbind, tmp[!sapply(tmp, is.null)])

结果是具有以下结构的数据框:

str(result)

'data.frame':   81 obs. of  4 variables:
 $ Name           : Factor w/ 81 levels "exm-rs666","exm-rs3510",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SNP            : Factor w/ 2 levels "[A/C]","[T/G]": 1 2 1 1 1 2 2 2 2 2 ...
 $ ILMN.Strand    : Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 1 2 2 1 1 ...
 $ Customer.Strand: Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 2 1 1 1 1 ...


这可能是更好的解决方案(R 3.2.4和dplyr 0.5.0),因为apply()将数据帧强制转换为矩阵等.如果(SNP.X != SNP.Y)和两者都不是NA.希望这能解决问题,尽管没有更多有关数据的信息,很难预测可能会遇到哪些问题.在此解决方案中,因素被强制转换为字符,因此请记住这一点.

# This is a helper function for the logic
# a and b will be tested; retA, retB, NA or '..' (see below) will be returned
logicalTest <- function(a, b, retA, retB){ 

  # coerce factors into character
  if(is.factor(retA)) retA <- as.character(retA)
  if(is.factor(retB)) retB <- as.character(retB)

  tmp <- a == b                        # compare a and b (surrogates for SNP.x and SNP.y) and put in tmp variable

  if(is.na(tmp)){                      # if the comparison was NA one of a or b must have been NA ...
    if(is.na(a) & is.na(b)) return(NA)  # if both were NA just return NA,
    else if(is.na(a)) return(retB)      # if a was NA return b,
    else return(retA)                   # otherwise return a
  } else if(tmp){                      # if tmp is TRUE (a == b)
    return(retA)                        # return a
  } else return("..")                  # else (a != b) return ".."
}

# load dplyr for the bit below
library(dplyr)

result <- df %>% 
  group_by(Name) %>% 
  transmute(SNP = logicalTest(SNP.x, SNP.y, SNP.x, SNP.y),
            ILMN.Strand = logicalTest(SNP.x, SNP.y, ILMN.Strand.x, ILMN.Strand.y),
            Customer.Strand = logicalTest(SNP.x, SNP.y, Customer.Strand.x, Customer.Strand.y))

# get cleaned results
result[!rowSums(is.na(result)),] # drop rows with NAs
result[!(rowSums(is.na(result)) | result$SNP == ".."),] # drop rows with NAs and ".."

I have a big set of data that looks like the following:

Name    SNP.x   ILMN.Strand.x   Customer.Strand.x   SNP.y   ILMN.Strand.y   Customer.Strand.y   
exm-rs10128711  [T/C]   BOT BOT [T/C]   BOT BOT
exm-rs10134944  [A/G]   TOP BOT NA  NA  NA  
exm-rs10218696    NA    NA  NA [T/C] BOT TOP
exm-rs10223421  [A/C]   TOP BOT NA  NA  NA

How do I create new columns "SNP","ILMN.Strand","Customer.Strand", whereby:

  1. if (SNP.x = SNP.y), then "SNP","ILMN.Strand","Customer.Strand" would be from "SNP.x","ILMN.Strand.x","Customer.Strand.x"
  2. if (SNP.x is not equal to SNP.y), and SNP.x is NA (missing value), then the values in the new columns should be taken from "SNP.y","ILMN.Strand.y","Customer.Strand.y"

  3. if (SNP.x is not equal to SNP.y), and SNP.y is NA (missing value), then the values in the new columns should be taken from "SNP.x","ILMN.Strand.x","Customer.Strand.x"

Many thanks in advance! :)

解决方案

I am assuming, that if both SNP.x and SNP.y are NA, the row is dropped from the dataframe. If SNP.x != SNP.y the row is also dropped (if that case were to occur).

The code below is not pretty or very efficient, but it ought to do the trick.

tmp <- apply(df, 1, function(x){
  # if SNP.x == SNP.y and not NA pass X
  if(!is.na(x["SNP.x"] == x["SNP.y"])) {
    if(x["SNP.x"] == x["SNP.y"]) data.frame(Name = x["Name"], SNP = x["SNP.x"],  ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
  } else if(is.na(x["SNP.x"])) { # else if SNP.x is NA pass y
    if(!is.na(x["SNP.y"])) data.frame(Name = x["Name"], SNP = x["SNP.y"],  ILMN.Strand = x["ILMN.Strand.y"], Customer.Strand = x["Customer.Strand.y"])
  } else if(is.na(x["SNP.y"])) { # else if SNP.y is NA pass x
    if(!is.na(x["SNP.x"])) data.frame(Name = x["Name"], SNP = x["SNP.x"],  ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
  } else NULL # otherwise pass NULL (e.g. (SNP.x != SNP.y AND neither are NA))
})

# rbind the list-output of the previous apply() function
result <- do.call(rbind, tmp[!sapply(tmp, is.null)])

The result is a dataframe with the following structure:

str(result)

'data.frame':   81 obs. of  4 variables:
 $ Name           : Factor w/ 81 levels "exm-rs666","exm-rs3510",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SNP            : Factor w/ 2 levels "[A/C]","[T/G]": 1 2 1 1 1 2 2 2 2 2 ...
 $ ILMN.Strand    : Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 1 2 2 1 1 ...
 $ Customer.Strand: Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 2 1 1 1 1 ...


EDIT:

This might be the better solution (R 3.2.4 with dplyr 0.5.0), since apply() coerces the dataframe into a matrix etc. The solution below also returns a unique '..', if (SNP.X != SNP.Y) and both are NOT NA. Hope this will do the trick, although without more information on your data it is hard to anticipate which problems you may run into.In this solution factors are coerced into characters, so keep that in mind.

# This is a helper function for the logic
# a and b will be tested; retA, retB, NA or '..' (see below) will be returned
logicalTest <- function(a, b, retA, retB){ 

  # coerce factors into character
  if(is.factor(retA)) retA <- as.character(retA)
  if(is.factor(retB)) retB <- as.character(retB)

  tmp <- a == b                        # compare a and b (surrogates for SNP.x and SNP.y) and put in tmp variable

  if(is.na(tmp)){                      # if the comparison was NA one of a or b must have been NA ...
    if(is.na(a) & is.na(b)) return(NA)  # if both were NA just return NA,
    else if(is.na(a)) return(retB)      # if a was NA return b,
    else return(retA)                   # otherwise return a
  } else if(tmp){                      # if tmp is TRUE (a == b)
    return(retA)                        # return a
  } else return("..")                  # else (a != b) return ".."
}

# load dplyr for the bit below
library(dplyr)

result <- df %>% 
  group_by(Name) %>% 
  transmute(SNP = logicalTest(SNP.x, SNP.y, SNP.x, SNP.y),
            ILMN.Strand = logicalTest(SNP.x, SNP.y, ILMN.Strand.x, ILMN.Strand.y),
            Customer.Strand = logicalTest(SNP.x, SNP.y, Customer.Strand.x, Customer.Strand.y))

# get cleaned results
result[!rowSums(is.na(result)),] # drop rows with NAs
result[!(rowSums(is.na(result)) | result$SNP == ".."),] # drop rows with NAs and ".."

这篇关于如何创建满足不同"IF"的新列? R中的条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 07:01