问题描述
我有一个数据框,其中几列可能具有相同的名称。在这个小例子中,列 A和 G都出现两次:
I have a data frame where several columns may have the same name. In this small example, both column "A" and "G" occur twice:
A C G A G T
1 1 NA NA NA 1 NA
2 1 NA 5 3 1 NA
3 NA 1 NA NA NA 1
4 NA NA 1 2 NA NA
5 NA NA 1 1 NA NA
6 NA 1 NA NA NA 1
7 NA 1 NA NA NA 1
我希望创建一个数据集,每个列名称的 one 列。对于每一行,应将各个列的值替换为每个列名称中的值的总和( sum(...,na.rm = TRUE)
)。例如,在第二行中,两个单独的 A值( 1
和 3
)应替换为 4
。我事先不知道哪个列名会出现几次。
I wish to create a data set with one column per column name. For each row, the individual column values should be replaced with the sum (sum(..., na.rm = TRUE)
) of the values within each column name. For example, in row two, the two individual "A" values (1
and 3
) should be replaced with 4
. I don't know in advance which column names that occur several times.
那么预期的输出将是:
# A C G T
# 1 1 0 1 0
# 2 4 0 6 0
# 3 0 1 0 1
# 4 2 0 1 0
# 5 1 0 1 0
# 6 0 1 0 1
# 7 0 1 0 1
所以我想我可以做类似的事情:
So I guess I could do something like:
noms = colnames(dat)
for(x in noms[duplicated(noms)]) {
dat[ , x] = rowSums(dat[ , x == noms], na.rm = TRUE)
}
dat = dat[,!duplicated(noms)]
但这有点笨重for循环注定是邪恶的。有什么方法可以更简单地做到这一点?
But this is a bit clunky and for loops are meant to be evil. Is there any way to do this more simply?
推荐答案
我们可以转换 dat
,计算每组行总数
(原始 dat $ c $的
姓氏
c>),然后将结果转回原始结构。
We can transpose dat
, calculate rowsum
per group (colnames
of the original dat
), then transpose the result back to original structure.
t(rowsum(t(dat), group = colnames(dat), na.rm = T))
# A C G T
#1 1 0 1 0
#2 4 0 6 0
#3 0 1 0 1
#4 2 0 1 0
#5 1 0 1 0
#6 0 1 0 1
#7 0 1 0 1
这篇关于按行分组的值按具有相同名称的列分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!