


I have a data set as I've shown below:

df <- tribble(
  ~id,  ~price, ~number_of_book,
  "1",    10,         3,
  "1",     5,         1,
  "2",     7,         4,
  "2",     6,         2,
  "2",     3,         4,
  "3",     4,         1,
  "4",     5,         1,
  "4",     6,         1,
  "5",     1,         2,
  "5",     9,         3,

在数据集中可以看到,如果id为 1,则有3本书的价格为每本书10美元,而有1本书的价格为5美元。基本上,我想查看每个价格区的书籍数量所占的百分比。这是我想要的数据集:

As you see in the data set, there are 3 books which cost 10 dollar for each book if id is "1" and 1 book that costs 5 dollar. Basically, I want to see the share (%) the number of books for each price bin. Here is my desired data set:

df <- tribble(
  ~id,    ~less_than_three,   ~three-five,  ~five-six, ~more_than_six,
  "1",          "0%",              "25%",     "0%",         "75%",
  "2",          "0%",              "40%",     "20%",        "40%",
  "3",          "0%",              "100%",    "0%",         "0%",
  "4",          "0%",              "50%",     "50%",        "0%",
  "5",          "40%",             "0%",      "0%",         "60%",


Now, I clustered the prices first. To do this, I run the below code:

out <- cut(df$price, breaks = c(0, 3, 5, 6, 10),
           labels = c("<3","3-5","5-6", ">6"))

out = table(out) / sum(table(out))


But unfortunately, I could not go further because of lack of coding knowledge. Would you help me to get the desired data?


使用dplyr,您可以添加列 cols 用作列名。然后,您可以对每个ID中每个列的书籍总数求和。接下来,您可以通过将这些数字除以该ID的总和来计算百分比,然后应用 scales :: percent 格式化为百分比而不是十进制。现在,您只需要ivot_wider给出从中获取名称和值的变量,并对列进行重新排序以匹配原始标签顺序。 (这比其他答案要复杂一些,因为它考虑了给定(id,cols / interval)对的行数大于1,并且看门人简化了情况)

With dplyr, you can add a column cols which will be used for the column names. Then you can sum the number of books for each col in each id. Next you can compute the percent by dividing these numbers by the sum for that id, then applying scales::percent for formatting as a percent rather than decimal. Now you just need to pivot_wider giving the variables from which to get the names and values, and reorder the columns to match the original label order. (This is a little more involved than the other answer since it accounts for the case when there is >1 row for a given (id, cols/interval) pair, and janitor simplifies things)

labels = c("less_than_three","three_to_five","five_to_six", "more_than_six")

df %>%
  group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
  summarise(n = sum(number_of_book)) %>%
  group_by(id) %>%
  mutate(pct = scales::percent(n/sum(n), 1)) %>%
  pivot_wider(id_cols = id, names_from = cols, values_from = pct) %>%
  select_at(c('id', labels)) %>%

# # A tibble: 5 x 5
#   id    less_than_three three_to_five five_to_six more_than_six
#   <chr> <chr>           <chr>         <chr>       <chr>
# 1 1     NA              25%           NA          75%
# 2 2     40%             NA            20%         40%
# 3 3     NA              100%          NA          NA
# 4 4     NA              50%           50%         NA
# 5 5     40%             NA            NA          60%


If you want to replace the NAs with 0% (which I think makes sense in this context, and matches the output shown in the question), you can use the method mentioned in the comment below.

df %>%
  group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>%
  summarise(n = sum(number_of_book)) %>%
  group_by(id) %>%
  mutate(pct = scales::percent(n/sum(n), 1)) %>%
  pivot_wider(id_cols = id, names_from = cols, values_from = pct,
              values_fill = list(pct = '0%')) %>%
  select_at(c('id', labels)) %>%

# # A tibble: 5 x 5
#   id    less_than_three three_to_five five_to_six more_than_six
#   <chr> <chr>           <chr>         <chr>       <chr>
# 1 1     0%              57%           0%          43%
# 2 2     40%             0%            20%         40%
# 3 3     0%              100%          0%          0%
# 4 4     0%              50%           50%         0%
# 5 5     40%             0%            0%          60%


07-22 23:50