我有一个数据库,由

家庭社区的

  • 社区ID(id_h),
  • Home块的
  • 块ID(blk_h
  • 邻域的子地理),
  • 工作块(blk_w),
  • 两者之间的通勤流量(Flow),
  • 每个家庭社区的通勤者中位数(Med_C)和
  • 按居委会(CumFlow)累积的工作人员流。

  • 数据按blk_hblk_w之间的距离(降序)排序,并按id_h分组。我需要对数据进行子集化,以提取CumFlow FIRST等于或超过Med_C的每个家庭邻居的情况。

    我尝试了各种dplyr函数,但无法使其正常工作。这是一个例子:
    df <- data.frame(
      id_h=c("A","A","A","A","B","B","B"),
      blk_h=c("A1","A1","A2","A2","B1","B2","B2"),
      blk_w=c("W1","W2","W3","W3","W1","W2","W2"),
      dist=c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
      Flow=c(3,6,3,7,5,4,2),
      CumFlow=c(3,9,12,19,5,9,11),
      Med_C=c(10,10,10,10,6,6,6)
    )
    df
    

    我需要这个来返回这样的表:
    id_h  blk_h  blk_w  dist  Flow  CumFlow  Med_C
    A     A2     W3     7.0   3     12       10
    B     B2     W2     6.5   4     9        6
    

    以下是我为实现这一目标所做的一些尝试:
    尝试#1
    library(dplyr)
    df.g <- group_by(df, id_h)
    df.g2 <- filter(df.g, CumFlow == which.min(CumFlow >= Med_C))
    

    尝试#2
    library(data.table)
    setDT(df)[, .SD[which.min(CumCount >= Med_C)], by = id_h]
    

    尝试#3
    library(dplyr)
    test <- df %>% group_by(id_h) %>% filter(min(CumFlow) >= Med_C)
    

    我想我误会了如何使用which.min函数。任何意见是极大的赞赏。

    最佳答案

    两个filter调用可以解决此问题。

    使用group_by在每个id_h中工作时,第一个filter返回data.frame,其中CumFlow大于或等于Med_C的所有行。第二个filter在每个id_h中返回CumFlow最低的行。这仅是因为数据已排序。为了使工作更加健壮,您可以考虑在对arrange的调用之后添加对group_by的调用。

    library(dplyr)
    
    df <- data.frame(
      id_h    = c("A","A","A","A","B","B","B"),
      blk_h   = c("A1","A1","A2","A2","B1","B2","B2"),
      blk_w   = c("W1","W2","W3","W3","W1","W2","W2"),
      dist    = c(4.3,5.6,7.0,8.7,5.2,6.5,6.8),
      Flow    = c(3,6,3,7,5,4,2),
      CumFlow = c(3,9,12,19,5,9,11),
      Med_C   = c(10,10,10,10,6,6,6)
    )
    df
    
    df %>%
    group_by(id_h) %>%
    filter(CumFlow >= Med_C) %>%
    filter(CumFlow == min(CumFlow))
    

    关于r - 在x首先超过y的组中过滤,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/38837909/

    10-09 17:07