将非平凡函数应用于数据表的有序子集

本文介绍了将非平凡函数应用于数据表的有序子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述问题我试图使用我的新发现的data.table权力（好）来计算一堆数据的频率内容，如下所示： |样品|频道|试验|电压|类|主题| | -------：| --------：| ------：| -----------：|：---- - | --------：| | 1 | 1 | 1 | -196.82253 | 1 | 1 | | 1 | 2 | 1 | 488.15166 | 1 | 1 | | 1 | 3 | 1 | -311.92386 | 1 | 1 | | 1 | 4 | 1 | -297.06078 | 1 | 1 | | 1 | 5 | 1 | -244.95824 | 1 | 1 | | 1 | 6 | 1 | -265.96525 | 1 | 1 | | 1 | 7 | 1 | -258.93263 | 1 | 1 | | 1 | 8 | 1 | -224.07819 | 1 | 1 | | 1 | 9 | 1 | -87.06051 | 1 | 1 | | 1 | 10 | 1 | -183.72961 | 1 | 1 | 约有5700万行 - 每个变量都是除了Voltage之外的整数。示例是一个从1：350到1：118的索引。有280个试验。示例数据 Martín的示例数据有效，变量是关于错误的一个非问题）： big.table< - data.table ：350，Channel = 1：118，Trial = letters， Voltage = rnorm（10e5，-150，100），Class = LETTERS，Subject = 1:20）过程我做的第一件事是将键设置为Sample，我对单个数据系列以一个同样的顺序发生： setkey（big.table，Sample）然后，我对电压信号进行一些滤波以去除高频。（过滤函数返回与其第二个参数相同长度的向量）： require（signal） high .pass big.table [，Voltage：= filtfilt（high.pass，Voltage），by = Subject ] 初始错误喜欢看看是否正确地处理它（即主题，按主题，试验，按渠道，按照样品顺序），所以我添加一个列包含电压列的光谱内容： get.spectrum spec.obj outlist< - list（） outlist $ spec< - 20 * log10（spec.obj $ spec） outlist $ freq< - spec.obj $ freq return（outlist）} big.table [，c（Spectrum，Frequency）：= get.spectrum（Voltage），by = Subject] 错误：不能分配大小为6.1 Gb的向量大小我认为问题是 get .spectrum（）试图一次吃整个列，考虑整个表只有大约1.7GB。是这样吗？您尝试了什么？增加分组粒度如果我调用 get.spectrum ，包括我想要分组的所有列，我得到一个更有前途的错误： big.table [，c（Spectrum，Frequency）：= get.spectrum $ b by = c（Subject，Trial，Channel，Sample）] ar.yw.default错误（x，aic = aic，order.max = order .max，na.action = na.action，：'order.max'必须是> = 1 b $ b 这意味着我调用的 spectrum（）函数获取错误形状的数据。切割点，尝试不同的在哪里条件按照Roland的建议，我将点数减少到大约2000万，并尝试以下操作： big.table [，Spectrum：= get.spectrum（Voltage）， by = c在`.data.table`（big.table，`：=`（Spectrum，get.spectrum（Voltage））中出现错误？ j = list（...）中的所有项都应该是原子向量或列表。如果你正在尝试类似 j = list（.SD，newcol = mean（colA）），然后使用：= by group（更快），或cbind或合并。我的想法是我不应该按Sample分组，因为我想将这个函数应用到每个组上面由 c>向量给出的350个样本。改进的一些事情从第2.16节data.table常见问题，我添加了一个 ORDER BY 的SQL。我知道Sample列需要从每个输入的1：350到 spectrum（）函数： > big.table [sample == c（1：350），c（Spectrum，Frequency）：= as.list（get.spectrum（Voltage））， + by = c ，Trial，Channel）] ar.yw.default中的错误（x，aic = aic，order.max = order.max，na.action = na.action，：' order.max'必须是> = 1 解决方案部分，他耐心地听我说话，我能够找出一些出错的地方。初始错误一个主要问题是 spectrum（）在数据表的每个时间序列分量上调用，期望表示多元时间序列（在这种情况下， channels x samples ）的2D结构， / p> big.table [，c（Spectrum，Frequency）：= get.spectrum ] 错误：无法分配大小为6.1 Gb的向量。 brute'for'ce 这里是一种缓慢的方法来使用。 get.spectrum（）被修改为返回一个简单向量，它与 j ： get.spectrum< - function（x）{ spec.obj< - spectrum ，方法=ar，plot = FALSE） outlist outlist ＃outlist $ freq< - spec.obj $ freq＃不返回 return（outlist）} require（parallel） require（foreach） freq .bins< - 500 spectra< - foreach（s.ind = unique（big.table $ Subject），.combine = rbind）％：％{ foreach（t.ind = unique big.table $ Trial），.combine = rbind）％dopar％{ cbind（（sampling.rate *（seq_len（freq.bins）-1）/ sampling.rate）， rep（c.ind，freq.bins）， rep（t.ind，freq.bins）， get.spectrum（（subset（big.table， subset = == s.ind& Trial == t.ind）， select = Voltage））$ Voltage）， rep（s.ind，freq.bins）） } } 这样可以得到正确的结果， code> get.spectrum（）是一个子集，其中Subject和Trial是固定的，使Channel和Sample保持不同。但是，它相当慢，在这台机器上的4个核心中有1个的计算负载超过80％。 data.table方法我回到讨论中提到的一些玩具案件，并再次尝试： spec.dt< - big.table [，get.spectrum（Voltage），by = c（Subject，Trial）] 这是接近的！它返回一个几乎正确的结构的data.table。 > str（spec.dt） Classes'data.table'和'data.frame'：140000 obs。的3个变量： $主题：int 1 1 1 1 1 1 1 1 1 1 ... $试用：int 1 1 1 1 1 1 1 1 1 1 ... $ V1：num 110.7 109 105.4 101.6 98.2 ... 但是，缺少Channel变量。轻松修正： > spec.dt< - erp.table [，get.spectrum（Voltage），by = c（Subject，Trial，Channel）] & str（spec.dt） Classes'data.table'和'data.frame'：16520000 obs。的4个变量： $主题：int 1 1 1 1 1 1 1 1 1 1 ... $试用：int 1 1 1 1 1 1 1 1 1 1 ... $ channel：int 1 1 1 1 1 1 1 1 1 1 ... $ V1：num 78.6 78.6 78.6 78.5 78.5 ... - attr（*，.internal.selfref）= < externalptr> 这是对的吗？好吧，很容易检查它是否是正确的形状。我们知道在默认的 spectrum（）调用中有500个频率仓，我说数据有118个通道。 > nrow（spec.dt） [1] 16520000 > nrow（spec.dt）/ 500 [1] 33040 > nrow（spec.dt）/ 500/118 [1] 280 注释是在 by 参数中，您需要省略对应于依赖数据的独立变量。如果没有，则显示其他错误。 > spectra.table< - big.table [，get.spectrum（Voltage），by = c（Sample，Subject，Channel）] ar.yw.default = aic，order.max = order.max，na.action = na.action，：'order.max'必须是> = 1 这里电压是样本的函数（因为样本是一个索引） - 每个频道和每个主题重复一遍。我不知道这里的问题是什么。基准 > system.time（spec.dt 用户系统已过 86.669 3.452 87.414 system.time（ spectra< - foreach（s.ind = unique $ subject），.combine = rbind）％：％ foreach（t.ind = unique（erp.table $ Trial），.combine = rbind）％dopar％{ cbind （sampling.rate *（seq_len（freq.bins）-1）/ sampling.rate）， rep（c.ind，freq.bins）， rep（t.ind，freq.bins）， get.spectrum（（subset（erp.table， subset =（Subject == s.ind& Trial == t.ind）， select = Voltage））$ Voltage）， rep（s.ind，freq.bins）） } 用户系统已过 114.259 17.937 131.873 第二个基准是乐观的;我已经运行了第二次，没有清理环境或删除变量。 ProblemI'm trying to use my newfound data.table powers (for good) to compute the frequency content of a bunch of data that looks like this:| Sample| Channel| Trial| Voltage|Class | Subject||-------:|--------:|------:|-----------:|:------|--------:|| 1| 1| 1| -196.82253|1 | 1|| 1| 2| 1| 488.15166|1 | 1|| 1| 3| 1| -311.92386|1 | 1|| 1| 4| 1| -297.06078|1 | 1|| 1| 5| 1| -244.95824|1 | 1|| 1| 6| 1| -265.96525|1 | 1|| 1| 7| 1| -258.93263|1 | 1|| 1| 8| 1| -224.07819|1 | 1|| 1| 9| 1| -87.06051|1 | 1|| 1| 10| 1| -183.72961|1 | 1|There are about 57 million rows--every variable is an integer except Voltage. Sample is an index that goes from 1:350, and Channel goes from 1:118. There are 280 Trials.sample dataMartín's example data is valid, I believe (the numbers of categorical variables are a non-issue with respect to the errors):big.table <- data.table(Sample = 1:350, Channel = 1:118, Trial = letters, Voltage = rnorm(10e5, -150, 100), Class = LETTERS, Subject = 1:20)processThe first thing I do is set the key to Sample, because I want anything I do to the individual data series to happen in a sane order:setkey(big.table,Sample)Then, I do some filtering on the Voltage signals to remove high frequencies. (The filtering function returns a vector of the same length as its second argument):require(signal)high.pass <- cheby1(cheb1ord(Wp = 0.14, Ws = 0.0156, Rp = 0.5, Rs = 10))big.table[,Voltage:=filtfilt(high.pass,Voltage),by=Subject]initial errorI'd like to see if that processed it properly (i.e. Subject by Subject, Trial by Trial, Channel by Channel, in Sample order), so I add a column containing the spectral content of the Voltage column:get.spectrum <- function(x) { spec.obj <- spectrum(x,method="ar",plot=FALSE) outlist <- list() outlist$spec <- 20*log10(spec.obj$spec) outlist$freq <- spec.obj$freq return(outlist) }big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]Error: cannot allocate vector of size 6.1 GbI think the issue is that get.spectrum() is trying to eat the whole column at once, considering that the whole table is only around 1.7GB. Is that so? What are my options?What have you tried?Increasing the granularity of groupingIf I make a call to get.spectrum including all of the columns I want to group by, I get a more promising error:big.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage), by=c("Subject","Trial","Channel","Sample")]Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action, : 'order.max' must be >= 1That implies the spectrum() function I'm calling is getting data of the wrong shape.Cutting points down, trying different 'where' conditionsFollowing Roland's advice, I cut the number of points to around 20 million and tried the below:big.table[,"Spectrum":=get.spectrum(Voltage), by=c("Subject","Trial","Channel")]Error in `[.data.table`(big.table, , `:=`("Spectrum", get.spectrum(Voltage)), : All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.My thinking was that I shouldn't group by Sample since I want to apply this function to each group of 350 Samples given by the above by vector.Improving on that with some things gleaned from section 2.16 of the data.table FAQ, I added the SQL equivalent of an ORDER BY. I know that the Sample column needs to go from 1:350 for each input to the spectrum() function:> big.table[Sample==c(1:350),c("Spectrum","Frequency"):=as.list(get.spectrum(Voltage)),+ by=c("Subject","Trial","Channel")]Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action, : 'order.max' must be >= 1Again, I run into trouble with non-unique inputs. 解决方案 After some extended discussion with Martín Bel who was patient enough to listen to me thrash, I was able to work out some of what was going wrong.initial errorA major issue is that spectrum(), the function being called on each time-series component of the data.table, expects a 2D structure representing a multivariate time series (in this case, channels x samples). So this callbig.table[,c("Spectrum","Frequency"):=get.spectrum(Voltage),by=Subject]Error: cannot allocate vector of size 6.1 Gbis totally bad.brute 'for'ceHere is a slow way to do it using (mostly useless) parallelization. get.spectrum() is modified to return a simple vector, which was related to the third error on return types from j:get.spectrum <- function(x) { spec.obj <- spectrum(x,method="ar",plot=FALSE) outlist <- list() outlist <- 20*log10(spec.obj$spec) # outlist$freq <- spec.obj$freq # don't return me return(outlist)}require(parallel)require(foreach)freq.bins <- 500spectra <- foreach(s.ind = unique(big.table$Subject), .combine=rbind) %:% { foreach(t.ind = unique(big.table$Trial), .combine=rbind) %dopar% { cbind((sampling.rate * (seq_len(freq.bins)-1) / sampling.rate), rep(c.ind,freq.bins), rep(t.ind,freq.bins), get.spectrum((subset(big.table, subset=(Subject==s.ind & Trial==t.ind), select=Voltage))$Voltage), rep(s.ind,freq.bins)) } }This gives the right result because each input to get.spectrum() is a subset where Subject and Trial are fixed, leaving Channel and Sample to vary. However, it is quite slow, and spends over 80% of the computational load in 1 of the 4 cores I have on this machine.data.table approachI went back to some toy cases that came up in the discussion, and tried this again:spec.dt <- big.table[,get.spectrum(Voltage),by=c("Subject","Trial")]This is close! It returns a data.table of almost the right structure.> str(spec.dt)Classes ‘data.table’ and 'data.frame': 140000 obs. of 3 variables: $ Subject: int 1 1 1 1 1 1 1 1 1 1 ... $ Trial : int 1 1 1 1 1 1 1 1 1 1 ... $ V1 : num 110.7 109 105.4 101.6 98.2 ...However, the Channel variable is missing. Easily fixed:> spec.dt <- erp.table[,get.spectrum(Voltage),by=c("Subject","Trial","Channel")]> str(spec.dt)Classes ‘data.table’ and 'data.frame': 16520000 obs. of 4 variables: $ Subject: int 1 1 1 1 1 1 1 1 1 1 ... $ Trial : int 1 1 1 1 1 1 1 1 1 1 ... $ Channel: int 1 1 1 1 1 1 1 1 1 1 ... $ V1 : num 78.6 78.6 78.6 78.5 78.5 ... - attr(*, ".internal.selfref")=<externalptr>Is this right? Well, it's easy to check if it's the right shape. We know that there are 500 frequency bins in the default spectrum() call, and I stated that the data had 118 channels.> nrow(spec.dt)[1] 16520000> nrow(spec.dt)/500[1] 33040> nrow(spec.dt)/500/118[1] 280I didn't mention it in the original question, but there are indeed 280 trials.remarkAn apparent rule here is that in the by argument, you need to leave out the independent variable corresponding to the dependent data. If you don't, the other error appears.> spectra.table <- big.table[,get.spectrum(Voltage),by=c("Sample","Subject","Channel")]Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action, : 'order.max' must be >= 1Here Voltage is a function of Sample (since sample is an index)--it is repeated over and over again for each Channel and each Subject.I don't know exactly what the problem is here, though.benchmarks> system.time(spec.dt <- erp.table[,get.spectrum(Voltage),by=c("Subject","Trial","Channel")]) user system elapsed 86.669 3.452 87.414system.time( spectra <- foreach(s.ind = unique(erp.table$Subject), .combine=rbind) %:% foreach(t.ind = unique(erp.table$Trial), .combine=rbind) %dopar% { cbind((sampling.rate * (seq_len(freq.bins)-1) / sampling.rate), rep(c.ind,freq.bins), rep(t.ind,freq.bins), get.spectrum((subset(erp.table, subset=(Subject==s.ind & Trial==t.ind), select=Voltage))$Voltage), rep(s.ind,freq.bins)) }) user system elapsed 114.259 17.937 131.873 The second benchmark is optimistic; I had run it a second time without cleaning up the environment or removing variables. 这篇关于将非平凡函数应用于数据表的有序子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！