问题描述
这是一个小例子。在我较大的数据集中,我有多年的数据,每组(div)的观察次数并不总是相等。
This is a small example. In my larger dataset, I have multiple years of data and the number of observations per group (div) are not always equal.
示例数据:
set.seed(1)
df<-data.frame(
year = 2014,
id = sample(LETTERS[1:26], 12),
div = rep(c("1", "2a", "2b"), each=4),
pts = c(9,7,9,3,7,5,3,7,2,7,7,1),
x = c(10,12,11,7,7,5,4,12,4,6,7,2)
)
df
# year id div pts x
#1 2014 G 1 9 10
#2 2014 J 1 7 12
#3 2014 N 1 9 11
#4 2014 U 1 3 7
#5 2014 E 2a 7 7
#6 2014 S 2a 5 5
#7 2014 W 2a 3 4
#8 2014 M 2a 7 12
#9 2014 L 2b 2 4
#10 2014 B 2b 7 6
#11 2014 D 2b 7 7
#12 2014 C 2b 1 2
我想对这些数据进行排名,以使div 1中的个人排名高于div 2a / 2b,在div 1个人中排名第1,
I want to rank this data such that individuals in div 1 are ranked higher than div 2a/2b, and within div 1 individuals are ranked 1,2,3,4 based on highest number of 'pts' followed by highest number of 'x'.
div 2a和div 2b中的个人也应分别排名在相同的条件下。看起来像这样:
Individuals in div 2a and div 2b should be ranked individually also based on the same criteria. This would look like this:
df %>%
group_by(div) %>%
arrange(desc(pts), desc(x)) %>%
mutate(position = row_number(div))
# year id div pts x position
#1 2014 N 1 9 11 1
#2 2014 G 1 9 10 2
#3 2014 J 1 7 12 3
#4 2014 U 1 3 7 4
#5 2014 M 2a 7 12 1
#6 2014 E 2a 7 7 2
#7 2014 S 2a 5 5 3
#8 2014 W 2a 3 4 4
#9 2014 D 2b 7 7 1
#10 2014 B 2b 7 6 2
#11 2014 L 2b 2 4 3
#12 2014 C 2b 1 2 4
但是,我想生成另一个列的最终列/变量。这将使div 1中的所有个人的排名高于2a / 2b,但2a / 2b相等。即2a / 2b中为1的个人现在应该获得5.5,排名2的个人现在应该获得7.5。
However, I want to produce a final column/variable that is another rank. This would rank all individuals in div 1 as higher than 2a/2b, but 2a/2b are equal. i.e. individuals who are 1 in 2a/2b should now get 5.5, individuals who are ranked 2 should now get 7.5. There are always an equal number of individuals in div2a and div2b for all years.
看起来应该是这样的:
# year id div pts x position final
#1 2014 N 1 9 11 1 1.0
#2 2014 G 1 9 10 2 2.0
#3 2014 J 1 7 12 3 3.0
#4 2014 U 1 3 7 4 4.0
#5 2014 M 2a 7 12 1 5.5
#6 2014 E 2a 7 7 2 7.5
#7 2014 S 2a 5 5 3 9.5
#8 2014 W 2a 3 4 4 11.5
#9 2014 D 2b 7 7 1 5.5
#10 2014 B 2b 7 6 2 7.5
#11 2014 L 2b 2 4 3 9.5
#12 2014 C 2b 1 2 4 11.5
我需要找到理想的 dplyr
解决方案。另外,它确实需要归纳到 div1中的个体数量可能变化而div2a / div2b中的个体数量变化的年份(尽管length(div2a)== length(div2b)始终)。
I need to find a dplyr
solution ideally. Also, it does need to generalize to years where the number of individuals in 'div1' may vary and the number of individuals in div2a/div2b varies (although length(div2a)==length(div2b) always).
推荐答案
这就是我要这样做的方式:
This is how I'd do it:
library(data.table)
dt = as.data.table(df)
dt[order(-pts, -x), rank.init := 1:.N, by = div]
dt[, div.clean := sub('(\\d+).*', '\\1', div)]
setorder(dt, div.clean, rank.init)
dt[, rank.final := mean(.I), by = .(div.clean, rank.init)]
setorder(dt, div, rank.final)
# year id div pts x rank.init div.clean rank.final
# 1: 2014 N 1 9 11 1 1 1.0
# 2: 2014 G 1 9 10 2 1 2.0
# 3: 2014 J 1 7 12 3 1 3.0
# 4: 2014 U 1 3 7 4 1 4.0
# 5: 2014 M 2a 7 12 1 2 5.5
# 6: 2014 E 2a 7 7 2 2 7.5
# 7: 2014 S 2a 5 5 3 2 9.5
# 8: 2014 W 2a 3 4 4 2 11.5
# 9: 2014 D 2b 7 7 1 2 5.5
#10: 2014 B 2b 7 6 2 2 7.5
#11: 2014 L 2b 2 4 3 2 9.5
#12: 2014 C 2b 1 2 4 2 11.5
这篇关于基于几个变量的排名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!