从查找表中更新向量的某些值的规范 tidyverse 方法

本文介绍了从查找表中更新向量的某些值的规范 tidyverse 方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常需要根据查找表重新编码数据框列中的一些(不是全部！)值.我对我所知道的解决问题的方法并不满意.我希望能够以清晰、稳定和高效的方式完成这项工作.在我编写自己的函数之前，我想确保我没有复制已经存在的标准.

I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.

## Toy example
data = data.frame(
  id = 1:7,
  x = c("A", "A", "B", "C", "D", "AA", ".")
)

lookup = data.frame(
  old = c("A", "D", "."),
  new = c("a", "d", "!")
)

## desired result
#   id  x
# 1  1  a
# 2  2  a
# 3  3  B
# 4  4  C
# 5  5  d
# 6  6 AA
# 7  7  !

我可以通过连接、合并、取消选择来完成，如下所示，但这并不像我想要的那样清晰 - 步骤太多.

I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.

## This works, but is more steps than I want
library(dplyr)
data %>%
  left_join(lookup, by = c("x" = "old")) %>%
  mutate(x = coalesce(new, x)) %>%
  select(-new)

也可以使用 dplyr::recode 来完成，如下所示，将查找表转换为命名查找向量.我更喜欢 lookup 作为数据框，但我对命名向量解决方案没问题.我在这里担心的是recode 是Questioning 生命周期阶段，所以我担心这个方法不稳定.

It can also be done with dplyr::recode, as below, converting the lookup table to a named lookup vector. I prefer lookup as a data frame, but I'm okay with the named vector solution. My concern here is that recode is the Questioning lifecycle phase, so I'm worried that this method isn't stable.

lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
  mutate(x = recode(x, !!!lookup_v))

也可以使用 stringr::str_replace 来完成，但是使用正则表达式进行全字符串匹配 效率不高. 我想有 forcats::fct_recode 是 recode 的稳定版本，但我不想要 factor 输出(尽管 mutate(x = as.character(fct_recode(x, !!!lookup_v))) 可能是我目前最喜欢的选项...).

It could also be done with, say, stringr::str_replace, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode is a stable version of recode, but I don't want a factor output (though mutate(x = as.character(fct_recode(x, !!!lookup_v))) is perhaps my favorite option so far...).

我曾希望 rows_update() 系列的 dplyr 函数能够工作，但它对列名很严格，我不认为可以更新它加入的列.(而且它是实验性，所以还不能满足我的稳定性要求.)

I had hoped that the new-ish rows_update() family of dplyr functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)

我的要求总结:

根据查找数据框(最好)或命名向量(允许)更新单个数据列
并非数据中的所有值都包含在查找中——不存在的值不会被修改
必须处理character 类输入.更普遍地工作是一种不错的选择.
除了基本 R 和 tidyverse 包之外没有任何依赖项(尽管我也有兴趣查看 data.table 解决方案)
未使用处于生命周期阶段(例如被取代或质疑)的函数.请注意任何实验性生命周期函数，因为它们具有未来潜力.
简洁明了的代码
我不需要极端的优化，但没有什么特别低效的(比如不需要时的正则表达式)

A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
Not all values in the data are included in the lookup--the ones that are not present are not modified
Must work on character class input. Working more generally is a nice-to-have.
No dependencies outside of base R and tidyverse packages (though I'd also be interested in seeing a data.table solution)
No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
Concise, clear code
I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)

recode

从查找表中更新向量的某些值的规范 tidyverse 方法

问题描述

推荐答案