问题描述
我有一个数据集,如下所示:
I have a data set as I've shown below:
df <- tribble(
~id, ~price, ~number_of_book,
"1", 10, 3,
"1", 5, 1,
"2", 7, 4,
"2", 6, 2,
"2", 3, 4,
"3", 4, 1,
"4", 5, 1,
"4", 6, 1,
"5", 1, 2,
"5", 9, 3,
)
在数据集中可以看到,如果id为 1,则有3本书的价格为每本书10美元,而有1本书的价格为5美元。基本上,我想查看每个价格区的书籍数量所占的百分比。这是我想要的数据集:
As you see in the data set, there are 3 books which cost 10 dollar for each book if id is "1" and 1 book that costs 5 dollar. Basically, I want to see the share (%) the number of books for each price bin. Here is my desired data set:
df <- tribble(
~id, ~less_than_three, ~three-five, ~five-six, ~more_than_six,
"1", "0%", "25%", "0%", "75%",
"2", "0%", "40%", "20%", "40%",
"3", "0%", "100%", "0%", "0%",
"4", "0%", "50%", "50%", "0%",
"5", "40%", "0%", "0%", "60%",
)
现在,我首先对价格进行聚类。为此,我运行以下代码:
Now, I clustered the prices first. To do this, I run the below code:
out <- cut(df$price, breaks = c(0, 3, 5, 6, 10),
labels = c("<3","3-5","5-6", ">6"))
out = table(out) / sum(table(out))
但是不幸的是,我不能此外,由于缺乏编码知识。您可以帮我获取所需的数据吗?
But unfortunately, I could not go further because of lack of coding knowledge. Would you help me to get the desired data?
推荐答案
使用dplyr,您可以添加列 cols
用作列名。然后,您可以对每个ID中每个列的书籍总数求和。接下来,您可以通过将这些数字除以该ID的总和来计算百分比,然后应用 scales :: percent
格式化为百分比而不是十进制。现在,您只需要ivot_wider给出从中获取名称和值的变量,并对列进行重新排序以匹配原始标签顺序。 (这比其他答案要复杂一些,因为它考虑了给定(id,cols / interval)对的行数大于1,并且看门人简化了情况)
With dplyr, you can add a column cols
which will be used for the column names. Then you can sum the number of books for each col in each id. Next you can compute the percent by dividing these numbers by the sum for that id, then applying scales::percent
for formatting as a percent rather than decimal. Now you just need to pivot_wider giving the variables from which to get the names and values, and reorder the columns to match the original label order. (This is a little more involved than the other answer since it accounts for the case when there is >1 row for a given (id, cols/interval) pair, and janitor simplifies things)
labels = c("less_than_three","three_to_five","five_to_six", "more_than_six")
df %>%
group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
summarise(n = sum(number_of_book)) %>%
group_by(id) %>%
mutate(pct = scales::percent(n/sum(n), 1)) %>%
pivot_wider(id_cols = id, names_from = cols, values_from = pct) %>%
select_at(c('id', labels)) %>%
ungroup
# # A tibble: 5 x 5
# id less_than_three three_to_five five_to_six more_than_six
# <chr> <chr> <chr> <chr> <chr>
# 1 1 NA 25% NA 75%
# 2 2 40% NA 20% 40%
# 3 3 NA 100% NA NA
# 4 4 NA 50% 50% NA
# 5 5 40% NA NA 60%
如果要更换NA为0%(我认为在这种情况下是有意义的,并且与问题中显示的输出匹配),您可以使用下面的注释中提到的方法。
If you want to replace the NAs with 0% (which I think makes sense in this context, and matches the output shown in the question), you can use the method mentioned in the comment below.
df %>%
group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
summarise(n = sum(number_of_book)) %>%
group_by(id) %>%
mutate(pct = scales::percent(n/sum(n), 1)) %>%
pivot_wider(id_cols = id, names_from = cols, values_from = pct,
values_fill = list(pct = '0%')) %>%
select_at(c('id', labels)) %>%
ungroup
# # A tibble: 5 x 5
# id less_than_three three_to_five five_to_six more_than_six
# <chr> <chr> <chr> <chr> <chr>
# 1 1 0% 57% 0% 43%
# 2 2 40% 0% 20% 40%
# 3 3 0% 100% 0% 0%
# 4 4 0% 50% 50% 0%
# 5 5 40% 0% 0% 60%
这篇关于如何处理(汇总)R中的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!