问题描述
我有一个等位基因身份的 data.table(行是个体,列是基因座),按单独的列分组.我想按组有效地计算每个基因座的等位基因频率(比例).示例数据表:
I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table:
DT = data.table(Loc1=rep(c("G","T"),each=5),
Loc2=c("C","A"), Loc3=c("C","G","G","G",
"C","G","G","G","G","G"),
Group=c(rep("G1",3),rep("G2",4),rep("G3",3)))
for(i in 1:3)
set(DT, sample(10,2), i, NA)
> DT
Loc1 Loc2 Loc3 Group
1: G NA C G1
2: G A G G1
3: G C G G1
4: NA NA NA G2
5: G C NA G2
6: T A G G2
7: T C G G2
8: T A G G3
9: T C G G3
10: NA A G G3
我遇到的问题是,当我尝试按组进行计算时,只有组中存在的等位基因 i.d.s 被识别,所以我很难找到可以告诉我的代码,例如基因座 1 的 G 在所有 3 组中的比例.简单的例子,计算每个基因座的第一个等位基因的总和(不是比例):
The problem I have is that when I try to do calculations by group, only the allele i.d.s present in the group are recognized, so I'm struggling to find code that can tell me e.g. the proportion of G's for locus 1 in all 3 groups. Simple example, calculating a sum (not proportion) for the first allele at each locus:
> fun1<- function(x){sum(na.omit(x==unique(na.omit(x))[1]))}
> DT[,lapply(.SD,fun1),by=Group,.SDcols=1:3]
Group Loc1 Loc2 Loc3
1: G1 3 1 1
2: G2 1 2 2
3: G3 2 2 3
对于 G1,结果是 Loc1 有 3 个 G,但对于 G3,它显示 Loc1 有 2 个 T,而不是 G 的数量.在这种情况下,我想要两者的 G 数.所以关键问题是等位基因身份是由组确定的,而不是整个列.我尝试使用要在计算中使用的等位基因身份制作一个单独的表格,但无法弄清楚如何将其包含在 fun1 中,以便在上面的 lapply 中引用正确的单元格.等位基因身份表:
For G1 the result is that Loc1 has 3 G's, but for G3 it shows Loc1 has 2 T's, not the number of G's. I want the number of G's for both in this case. So the key problem is that the allele identities are determined by group, not over the whole column. I tried making a separate table with the allele identities I want to use in calculations, but can't figure out how to include it in fun1 so that the correct cells are referenced in lapply above. Allele identities table:
> fun2<- function(x){sort(na.omit(unique(x)))}
> allele.id<-data.table(DT[,lapply(.SD,fun2),.SDcols=1:3])
> allele.id
Loc1 Loc2 Loc3
1: G A C
2: T C G
推荐答案
首先将您的 data.table 转换为长格式可能是明智的.这将使它更容易用于进一步的计算(或使用 ggplot2
进行可视化).使用 data.table
的 melt
函数(与 reshape2
包的 melt
函数相同)您可以从宽格式转换为长格式:
It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2
for example). With the melt
function of data.table
(which works the same as the melt
function of the reshape2
package) you can transform from wide to long format:
DT2 <- melt(DT, id = "Group", variable.name = "loci")
当您想在熔解操作期间删除 NA
-值时,您可以在上述调用中添加 na.rm = TRUE
(na.rm = FALSE
是默认行为).
When you want to remove the NA
-values during the melt-operation, you can add na.rm = TRUE
in the above call (na.rm = FALSE
is the default behaviour).
然后你可以制作如下计数和比例变量:
Then you can make count and proportion variables as follows:
DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]
给出以下结果:
> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000
如果你希望它以宽格式返回,你可以在多个变量上使用 dcast
:
I you want it back in wide format, you can use dcast
on multiple variables:
DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)
导致:
> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
另一种直接的方法是在一次调用中使用 melt
和 dcast
(这是@Frank 答案第一部分的简化版本):
Another and straightforward approach is using melt
and dcast
in one call (which is a simplified version of the first part of @Frank's answer):
DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)
给出:
> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0
因为dcast
中默认的聚合函数是length
,你会自动得到每个值的计数.
Because the default aggregation function in dcast
is length
, you will automatically get the counts for each of the values.
使用过的数据:
DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA),
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))
这篇关于如何使用 data.table 跨多个列(基因座)按组有效计算等位基因频率(比例)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!