使用data.table子集确定不相等

本文介绍了使用data.table子集确定不相等的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我有一个数据表有400k行，我做子集，它是非常慢。下面是一个示例数据框：日期名称值size car1 car2 1 2015-01-01 07:44:00 bob 1 5 AD 2 2015-02-02 09:46:00 george 522 2 BF 现在我使用subset（）来缓慢地使用它： main< - data.frame（date = as.POSIXct（c（2015-01-01 07:44:00，2015-02-02 09:46:00），tz = GMT），name = c（bob，george），value = c（1,522），size = c（5,2），car1 = c（A，B），car2 = c （D，F）） main $ date subset（main，size> 1 & value == 522 & name ==george & date> = as.POSIXct（2015-01-01 03:44:00，tz =GMT）& date> = as.POSIXct（2015-01-01 08:44:00，tz =GMT）&（car1 ==F| car2 ==F））日期名称值size car1 car2 2 2015-02-02 09:46:00 george 522 2 BF 由于对另一个问题的一些响应使用data.table看起来要快得多，所以我想使用data.table做同样的事情，但我有一堆问题。这里是我到目前为止：表格） mdt< - as.data.table（main） setkey（mdt，date，name，value，size，car1，car2） mdt [ 2015-01-01 03:44:00），george，522,2，F，F）] 这会返回：日期名称值大小car1 car2 1： 01-01 03:44:00 george 522 2 NA F 这里是我的问题：（1）我想有一个条件，其中日期> =和日期（2）我想有一个标准where（car1 ==F| car2 ==F ）但这是可能吗？如果没有任何想法如何使子集化更快？（3）您可以看到mdt []的输出有一个日期2015-01-01 03 ：44：00，但此日期不在原始的主数据帧中。这里发生了什么？（4）你可以在mdt []的输出中看到car1值为NA，当car1在原始主数据帧。解决方案当然，您只需将标准放在 i 表达式中。 setDT ） main [size> 1& value == 522& name ==george& date> = as.POSIXct（2015-01-01 03:44:00，tz =GMT）& date> = as.POSIXct（2015-01-01 08:44:00，tz =GMT）& （car1 ==F| car2 ==F），] 结果：日期名称值size car1 car2 1：2015-02-02 09:46:00 george 522 2 BF 因此，比 >？ Yup。 library（data.table）库（ggplot2）库（reshape2） set.seed（1） cf< - function（n）{ main< - data.frame（date = as.POSIXct （Sys.Date（）+ runif（n，0，100））， name = sample（c（bob，george），n，replace = T）， value = round （n，400,600），0）， size = sample（1：5，n，replace = T）， car1 = sample（LETTERS [1：6]，n，replace = T ）， car2 = sample（LETTERS [1：6]，n，replace = T）， stringsAsFactors = F） mdt< - data.table（main） setkey（mdt，date，name，value，size，car1，car2） pre< - Sys.time（） mdt [size& 1&值> 100& name ==george& date> = as.POSIXct（Sys.Date（））& date< = as.POSIXct（Sys.Date（）+ 50）& （car1 ==F| car2 ==F），] dt_time pre< time（） subset（main， size> 1& value> 100& name ==george& date> = as.POSIXct （））& date< = as.POSIXct（Sys.Date（）+ 50）& （car1 ==F| car2 ==F）） subset_time& - Sys.time（） - pre return（c（n = n，dt_time = dt_time，subset_time = subset_time））} result< sapply（10 ^（2：7），cf） result< - melt（data.frame（t（result）），id.var ='n'） ggplot aes（x = n，y = value，color = variable））+ geom_point（）+ geom_line（）+ theme_bw（）+ scale_x_log10（） I have a datatable with 400k rows and I am doing subsetting and it is very slow.Here is an a sample data frame: date name value size car1 car21 2015-01-01 07:44:00 bob 1 5 A D2 2015-02-02 09:46:00 george 522 2 B FNow I subset it the slow way using subset():main<- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00","2015-02-02 09:46:00"),tz="GMT"),name= c("bob","george"),value=c(1,522), size= c(5,2), car1=c("A","B"), car2=c("D","F"))main$datesubset(main, size >1 & value == 522 & name == "george" & date >= as.POSIXct("2015-01-01 03:44:00",tz="GMT") & date >= as.POSIXct("2015-01-01 08:44:00",tz="GMT") & (car1 == "F" | car2 == "F")) date name value size car1 car22 2015-02-02 09:46:00 george 522 2 B FThis works and returns 1 row but it is very slow.Thanks to some responses on another question using data.table looks to be much faster so I would like to use data.table to do the same thing as above but I have a bunch of questions.Here is what I so far: library(data.table) mdt<- as.data.table(main) setkey(mdt, date, name, value,size,car1,car2) mdt[.(as.POSIXct("2015-01-01 03:44:00"),"george", 522,2,"F","F")]This returns:date name value size car1 car21: 2015-01-01 03:44:00 george 522 2 NA FHere are my questions:(1) I want to have a criteria where date >= and date <= but is this possible using data.table? If not any ideas how to make the subsetting faster?(2) I want to have a criteria where (car1 == "F" | car2 == "F") but is this possible? If not any ideas how to make the subsetting faster?(3) You can see the output of the mdt[] there is a date of 2015-01-01 03:44:00 but this date IS NOT in the original "main" dataframe. What is happening here?(4) You can see in the output of the mdt[] there is a car1 value of NA when car1 is not NA in the original "main" dataframe. What is happening here?Thank you. 解决方案 Sure, you just put the criteria in the i expression.setDT(main)main[size >1 & value == 522 & name == "george" & date >= as.POSIXct("2015-01-01 03:44:00",tz="GMT") & date >= as.POSIXct("2015-01-01 08:44:00",tz="GMT") & (car1 == "F" | car2 == "F"), ]Result: date name value size car1 car21: 2015-02-02 09:46:00 george 522 2 B FSo, is that faster than subset? Yup.library(data.table)library(ggplot2)library(reshape2)set.seed(1)cf <- function(n) { main <- data.frame(date = as.POSIXct(Sys.Date()+runif(n, 0, 100)), name = sample(c("bob","george"), n, replace=T), value = round(runif(n, 400,600), 0), size= sample(1:5, n, replace=T), car1= sample(LETTERS[1:6], n, replace=T), car2= sample(LETTERS[1:6], n, replace=T), stringsAsFactors=F) mdt <- data.table(main) setkey(mdt, date, name, value,size,car1,car2) pre <- Sys.time() mdt[size > 1 & value > 100 & name == "george" & date >= as.POSIXct(Sys.Date()) & date <= as.POSIXct(Sys.Date()+50) & (car1 == "F" | car2 == "F"), ] dt_time <- Sys.time() - pre pre <- Sys.time() subset(main, size > 1 & value > 100 & name == "george" & date >= as.POSIXct(Sys.Date()) & date <= as.POSIXct(Sys.Date()+50) & (car1 == "F" | car2 == "F")) subset_time <- Sys.time() - pre return(c(n=n, dt_time=dt_time, subset_time=subset_time))}result <- sapply(10^(2:7), cf)result <- melt(data.frame(t(result)), id.var='n')ggplot(result, aes(x=n, y=value, color=variable)) + geom_point() + geom_line() + theme_bw() + scale_x_log10() 这篇关于使用data.table子集确定不相等的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！