问题描述
我对R比较陌生,我正在尝试编写我的第一个多步函数。本质上,我想创建一个函数,该函数接受一个目录并在该目录中搜索以找到某个列(在这种情况下为污染物)。然后找到该列的平均值并删除NA。到目前为止,这就是我所拥有的:
污染物平均值<-函数(目录,污染物,min_id = 1,max_id = 332) {
setwd(directory)
dirdata<-list.files(path = getwd(),pattern ='*。csv',full.names = TRUE)% >%lapply(read_csv)%>%bind_rows
specdata<-dirdata%>%filter(between(ID,min_id,max_id))
polspecdata< ;-specdata%>%select(污染物)
polspecdatamean<-polspecdata%>%summary(mean_pollutant = mean(pollutant,na.rm = TRUE))
}
我感觉我很近,但是结果是一个错误:警告消息:mean.default(污染物,na.rm = TRUE):参数不是数字或逻辑:返回NA。我相信该错误是由于列类为col_double。这可能是由于dirdata是从多个csv文件创建的。任何帮助将不胜感激。谢谢!
这是数据:
dplyr
不会在 mean(pollutant,na.rm = TRUE)
中呈现函数参数预期,因此第9行失败。 mean()
函数失败,因为污染物
参数呈现为文本字符串,而不是<$ c中的列$ c> polspecdata 数据帧。
解决该错误的一种方法是调整第9行以显式引用通过%>%
管道运算符,使用提取运算符的 [[[
]形式使用参数的字符串版本。
polspecdatamean<-polspecdata%>%summary(mean_pollutant = mean(.data [[pollutant]],na.rm = TRUE))
最后,由于函数应将均值返回给父环境,因此我们在第9行中添加了对象的打印件在函数的末尾。
polspecdatamean
由于这是约翰霍普金斯大学
R编程课程的编程任务,因此我不会提供完整答案,因为这违反了Coursera荣誉守则。
简化解决方案
在第5行中对数据进行过滤后,该函数可以简单地返回均值,如下所示。
平均值(specdata [[污染物]],na.rm = TRUE)
结论
对于此特定任务,使用 dplyr
会使任务比实际需要的困难得多,原因是 dplyr
使用非标准评估,而JHU课程中甚至没有包含 dplyr
,直到序列中的第三门课程为止。
该代码还有其他一些细微的缺陷,我们将作为读者的练习来加以纠正。例如,给定分配要求,该功能应该能够处理以下输入:
pollutantmean( specdata,硫酸盐)。 ,23)#传感器23的计算平均值
污染物平均值( specdata,硝酸盐,70:72)#传感器70的计算平均值$ 72平均值
污染物平均值( specdata,硫酸盐,c(3,5,7,9))#传感器3、5、7和9的计算平均值
I am relatively new to R and I am attempting to write my first multi-step function. Essentially, I want to create a function that takes a directory and searches within that directory to find a certain column (in this case, pollutant). Then find the mean value of that column and remove the NAs. This is what I have so far:
pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {
setwd(directory)
dirdata <- list.files(path=getwd() , pattern='*.csv' , full.names = TRUE) %>% lapply(read_csv) %>% bind_rows
specdata <- dirdata %>% filter(between(ID,min_id,max_id))
polspecdata <- specdata %>% select(pollutant)
polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(pollutant,na.rm=TRUE))
}
I feel that I am so close, but the result is an error: Warning message:In mean.default(pollutant, na.rm = TRUE) : argument is not numeric or logical: returning NA. I believe the error is due to the column class being col_double. This may be due to dirdata is created from multiple csv files. Any help would be greatly appreciated. Thank you!
This is the data: zipfile_data
The code in the original post fails because it uses dplyr
within a function, but does not use dplyr
quoting functions. When we run the code through the RStudio debugger and stop at line 7, we see the following:
dplyr
does not render the function argument within mean(pollutant, na.rm = TRUE)
as expected, so line 9 fails. The mean()
function fails because the pollutant
argument renders as a text string, not a column in the polspecdata
data frame.
One way to fix the error is to adjust line 9 to explicitly reference the data frame passed from the prior function via the %>%
pipe operator, using the [[
form of the extract operator to use the string version of the argument.
polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(.data[[pollutant]],na.rm=TRUE))
Finally, since the function should return the mean to the parent environment, we add a print of the object created in line 9 at the end of the function.
polspecdatamean
Since this is a programming assignment for the Johns Hopkins University R Programming course on Coursera, I won't post a complete answer because that violates the Coursera Honor Code.
Simplifying the solution
Once the data has been filtered in line 5, the function could simply return the mean as follows.
mean(specdata[[pollutant]],na.rm=TRUE)
Conclusions
For this particular assignment, use of dplyr
makes the assignment more difficult than it needs to be due to the fact that dplyr
uses non-standard evaluation and dplyr
isn't even covered in the JHU curriculum until the third course in the sequence.
The code has some other subtle defects whose correction we'll leave as an exercise for the reader. For example, given the assignment requirements, the function should be able to handle the following inputs:
pollutantmean("specdata","sulfate",23) # calc mean for sensor 23
pollutantmean("specdata","nitrate",70:72) # calc mean for sensors 70 - 72
pollutantmean("specdata","sulfate",c(3,5,7,9)) # calc mean for sensors 3, 5, 7, and 9
这篇关于R函数中的dplyr :: summarise()失败,并显示“参数不是数字或逻辑”。错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!