r - 如何计算一个家庭中老年人的最高受教育年限

假设我有这样的数据框:

   family relationship meanings              edu
 1      1 A            respondent             12
 2      1 B            respondent's spouse    18
 3      1 C            A's father             10
 4      1 D            A's mother              9
 5      1 E1           A's first son          15
 6      1 F1           E1's spouse            14
 7      1 G11          E1's first son          3
 8      1 G12          E1's second son         1
 9      1 E2           A's second son         13
10      2 A            respondent             21
11      2 B            respondent's spouse    16
12      2 C            A's father             12
13      2 D            A's mother             16
14      2 E1           A's first son          18
15      2 F1           E1's spouse            15
16      2 E2           A's second son         17
17      2 E3           A's third son          16

family表示家庭编号。 relationship表示一个家庭中的关系。 meanings表示第二列relationship的含义。

我想计算一个家庭中父亲这一代的最高学历。
我们不需要配偶的信息。

预期结果如下:

   family id      edu fedu
 1      1 A        12 10
 2      1 C        10 NA
 3      1 E1       15 18
 4      1 E2       13 18
 5      1 G11       3 15
 6      1 G12       1 15
 7      2 A        21 16
 8      2 C        12 NA
 9      2 E1       18 21
10      2 E2       17 21
11      2 E3       16 21

数据如下:

 d = structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1", "G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"), meanings = c("respondent", "respondent's spouse", "A's father","A's mother", "A's first son", "E1's spouse", "E1's first son","E1's second son", "A's second son", "respondent", "respondent's spouse","A's father", "A's mother", "A's first son", "E1's spouse", "A's second son","A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,16, 12, 16, 18, 15, 17, 16)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"))

最佳答案

这是我尝试过的。我认为有必要创建一个世代变量。看到问题中的示例图像，C和D是第一代。 A和B是第二代。 E和F是第三代，而G是第四代。第一个带有mutate()的case_when()创建了生成变量。然后，我通过family和generation定义了组。对于每个小组，我确定了最长的教育持续时间(即max_ed_duration)。由于您说不需要配偶的信息，因此我在meanings中删除了包含母亲或配偶的行。然后，我再次使用family定义了组。对于每个家庭，如果generation为1，则将NA分配给fedu。否则，将上一代的max_ed_duration值分配给fedu。最后，我按family和relationship排列了数据。

library(dplyr)

mutate(mydf, generation = case_when(relationship %in% c("C", "D") ~ 1,
                                    relationship %in% c("A", "B") ~ 2,
                                    grepl(x = relationship, pattern = "^E|F") ~ 3,
                                    grepl(x = relationship, pattern = "^G") ~ 4)) %>%
  group_by(family, generation) %>%
  mutate(max_ed_duration = max(edu)) %>%
  filter(!grepl(x = meanings, pattern = "mother|spouse")) %>%
  group_by(family) %>%
  mutate(fedu = if_else(generation == 1,
                        NA_real_,
                        max_ed_duration[match(x = generation - 1, table = generation)])) %>%
  arrange(family, relationship)

#   family relationship meanings          edu generation max_ed_duration  fedu
#    <dbl> <chr>        <chr>           <dbl>      <dbl>           <dbl> <dbl>
# 1      1 A            respondent         12          2              18    10
# 2      1 C            A's father         10          1              10    NA
# 3      1 E1           A's first son      15          3              15    18
# 4      1 E2           A's second son     13          3              15    18
# 5      1 G11          E1's first son      3          4               3    15
# 6      1 G12          E1's second son     1          4               3    15
# 7      2 A            respondent         21          2              21    16
# 8      2 C            A's father         12          1              16    NA
# 9      2 E1           A's first son      18          3              18    21
#10      2 E2           A's second son     17          3              18    21
#11      2 E3           A's third son      16          3              18    21

数据

mydf <- structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1",
"G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"
), meanings = c("respondent", "respondent's spouse", "A's father",
"A's mother", "A's first son", "E1's spouse", "E1's first son",
"E1's second son", "A's second son", "respondent", "respondent's spouse",
"A's father", "A's mother", "A's first son", "E1's spouse", "A's second son",
"A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,
16, 12, 16, 18, 15, 17, 16)), class = "data.frame", row.names = c(NA,
-17L))

关于r - 如何计算一个家庭中老年人的最高受教育年限，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59466424/