本文介绍了在数据帧的几个子集上应用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个data
数据帧,其中包含基因组中突变核苷酸的chromosome
和position
:
structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L,
20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom",
"pos"), class = "data.frame", row.names = c(NA, -14L))
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
和另一个tss_locations
,包含gene
和chromosome
中的功能(tss
)的位置:
structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2",
"gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9"
), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L,
88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA,
-11L))
gene chrom tss
1 gene1 1 5
2 gene2 1 10
3 gene3 1 23
4 gene4 2 1340
5 gene5 2 313
6 gene6 2 88
我正在尝试计算data
中每个pos
到同一染色体上最接近的tss
的距离。
到目前为止,我可以计算每个data$pos
到任何tss_locations$tss
(即最接近tss
到每个pos
,与染色体无关)的距离:
fun <- function(p) {
# Get index of nearest tss
index<-which.min(abs(tss_locations$tss - p))
# Lookup the value
closestTss<-tss_locations$tss[[index]]
# Calculate the distance
dist<-(closestTss-p)
list(snp=p, closest=closestTss, distance2nearest=dist)
}
# Run function for each 'pos' in data
dist2tss<-lapply(data$pos, fun)
# Convert to data frame and sort descending:
dist2tss<-do.call(rbind, dist2tss)
dist2tss<-as.data.frame(dist2tss)
dist2tss<-arrange(dist2tss,(as.numeric(distance2nearest)))
dist2tss$distance2nearest<-as.numeric(dist2tss$distance2nearest)
head(dist2tss)
snp closest distance2nearest
1 600 313 -287
2 400 313 -87
3 200 127 -73
4 100 88 -12
5 33 23 -10
6 134 127 -7
但是,我希望能够在同一染色体上为每个pos
找到最接近的tss
。
pos
i和tss
s。如何调整此设置以实现此目标?按染色体设置两个数据框的子集并合并结果?
到目前为止这是正确的方法吗?
推荐答案
类似的操作可能会为data
数据帧中的每条染色体获取最接近的TS。
data <- structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L,
20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom",
"pos"), class = "data.frame", row.names = c(NA, -14L))
tss_locations <- structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2",
"gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9"
), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L,
88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA,
-11L))
# Generate needed values by applying function to all rows and transposing t() the results
data[,c("closest_gene", "closest_tss", "min_dist")] <- t(apply(data, 1, function(x){
# Get subset of tss_locations where the chromosome matches the current row
genes <- tss_locations[tss_locations$chrom == x["chrom"], ]
# Find the minimum distance from the current row's pos to the nearest tss location
min.dist <- min(abs(genes$tss - x["pos"]))
# Find the closest tss location to the current row's pos
closest_tss <- genes[which.min(abs(genes$tss - x["pos"])), "tss"]
# Check if closest tss location is less than pos and set min.dist to negative if true
min.dist <- ifelse(closest_tss < x["pos"], min.dist * -1, min.dist)
# Find the closest gene to the current row's pos
closest_gene <- as.character(genes[which.min(abs(genes$tss - x["pos"])), "gene"])
# Return the values to the matrix
return(c(closest_gene, closest_tss, min.dist))
}))
这篇关于在数据帧的几个子集上应用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!