本文介绍了返回建立“最接近"的行.在R中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有不同ID的数据帧,我想在其中创建一个子组:对于每个ID,我只会获得变量Y中最接近0.5的一行.

I have a data frame with different IDs and I want to make a subgroup in which: for each ID I will only obtain one row with the closest value to 0.5 in variable Y.

这是我的数据框:

df <- data.frame(ID=c("DB1", "BD1", "DB2", "DB2", "DB3", "DB3", "DB4", "DB4", "DB4"), X=c(0.04, 0.10, 0.10, 0.20, 0.02, 0.30, 0.01, 0.20, 0.30),Y=c(0.34, 0.49, 0.51, 0.53, 0.48, 0.49, 0.49, 0.50, 1.0))

df <- data.frame(ID=c("DB1", "BD1", "DB2", "DB2", "DB3", "DB3", "DB4", "DB4", "DB4"), X=c(0.04, 0.10, 0.10, 0.20, 0.02, 0.30, 0.01, 0.20, 0.30),Y=c(0.34, 0.49, 0.51, 0.53, 0.48, 0.49, 0.49, 0.50, 1.0))

这就是我想要的

ID X YDB1 0.10 0.49DB2 0.10 0.51DB3 0.30 0.49DB4 0.20 0.50

ID X YDB1 0.10 0.49DB2 0.10 0.51DB3 0.30 0.49DB4 0.20 0.50

我知道我可以使用类似这样的方法添加ddply过滤器

I know I can add a filter with ddply using something like this

ddply(df, .(ID), function(z) { z[z$Y == 0.50, ][1, ]})
并且如果Y中的值始终为0.50,效果会很好(不是这种情况).

ddply(df, .(ID), function(z) { z[z$Y == 0.50, ][1, ]})
and this would work fine if there were always a 0.50 value in Y, which is not the case.

如何将"=最近"的==更改为0.5,或者我可以使用另一个函数代替?

How do change the == for a "nearest to" 0.5, or is there another function I could use instead?

提前谢谢!

推荐答案

您需要计算0.5的差,然后保持最小的差.一种方法是这样:

You need to calculate the difference from 0.5 and then keep the smallest one. One way to do this would be as so:

ddply(df, .(ID), function(z) {
  z[abs(z$Y - 0.50) == min(abs(z$Y - 0.50)), ]
})

请注意,如果我将上面的编码方式忽略了[1, ],那么如果两行被精确地捆绑在一起,则将保留这两个代码.

Note that the way I've coded it above, omitting your [1, ], if two rows are exactly tied both will be kept.

应该没问题,因为我们在==的每一侧都进行了完全相同的计算,但是我经常担心数值精度问题,因此我们可以改用which.min.请注意,如果出现平局,which.min将返回第一个最小值.

It should be fine since we're doing the exact same calculation on either side of ==, but I often worry about numerical precision problems, so we could instead use which.min. Note that which.min will return the first minimum in the case of a tie.

ddply(df, .(ID), function(z) {
  z[which.min(abs(z$Y - 0.50)), ]
})

另一种健壮的方法是按0.5的顺序对数据帧进行排序,并保留每个ID的第一行.此时,我将过渡到dplyr,尽管您当然可以为这些方法中的任何一种使用dplyrplyr::ddply.

Another robust way to do it would be to order the data frame by difference from 0.5 and keep the first row per ID. At this point I'll transition over to dplyr, though of course you could use dplyr or plyr::ddply for any of these methods.

library(dplyr)
df %>% group_by(ID) %>%
  arrange(abs(Y - 0.5)) %>%
  slice(1)

我不确定arrange如何处理领带.有关更多方法,请参见获取具有最小变量的行,但如果有多个最小值,则仅获取第一行,并且始终使用abs(Y - 0.5)作为您要最小化的变量.

I'm not sure how arrange handles ties. For more methods see Get rows with minimum of variable, but only first row if multiple minima, and just always use abs(Y - 0.5) as the variable you are minimizing.

这篇关于返回建立“最接近"的行.在R中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-14 23:37