问题描述
我有一个非常大的数据集,其中有些列的格式设置为货币,某些数字和某些字符。读取数据时,所有货币列均被识别为因素,我需要将其转换为数字。数据集太宽,无法手动识别列。我正在尝试找到一种编程方式,以确定一列是否包含货币数据(例如,以 $开头),然后传递要清除的那列列表。
I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.
name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)
clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)
此示例中的问题在于,此方法适用于所有列,导致name列替换为NA。我需要一种方法来以编程方式仅识别需要将clean函数应用于的列。
The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.
推荐答案
使用 dplyr
和 stringr
包,您可以使用 mutate_if
来标识包含以 $
开头的任何字符串的列
Using dplyr
and stringr
packages, you can use mutate_if
to identify columns that have any string starting with a $
and then change the accordingly.
library(dplyr)
library(stringr)
emp_data %>%
mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
~as.numeric(str_replace_all(., '[$,]', '')))
这篇关于R-确定哪些列包含货币数据$的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!