本文介绍了如何替换程序包randomForest r中的引导程序步骤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 在我的数据分析中,我尝试比较性能不同的机器学习方法对时间序列数据(回归,不分类)的影响。因此,举例来说,我已经训练了一个Boosting训练模型,并将其与随机森林训练模型(R package randomForest)进行比较。 我使用时间序列数据,其他数据和因变量的滞后值。 出于某种原因,随机森林严重表现不佳。我能想到的其中一个问题是随机森林为每棵树执行训练数据的采样步骤。如果它对时间序列数据这样做,系列的自回归性质将完全消除。为了测试这个想法,我想用(bootstrap)取样步骤在randomForest()函数中使用所谓的分块引导步骤。这基本上意味着我将训练集切割成k个部分,其中 k Now我的问题是这样的: 为了实现这一点,我通常会复制现有的函数并编辑所需的步骤/行。 randomForest2 所以对我来说,解决方案不是编辑现有的randomForest函数。相反,我使用Soren H. Welling给出的 split2 函数自己编写了基于块的bootstrap来创建块。一旦我的数据按块进行引导,我就会寻找一个只执行一个回归树并且自己进行聚合的包( rpart )。 p> 我的实际数据结果与RMSPE方面的正常随机森林表现相比略有改善。 以Soren的代码为例,它看起来有点像这样: library(randomForest) library(doParallel)#parallel包和mclapply更适合linux library(rpart) #parallel后端ftw nCPU = detectCores() cl = makeCluster(nCPU) registerDoParallel(cl) #simulated time series(y )随时间推移和滞后= 1 时间点= 1000; var = 6; noise.factor = .2 #past现在的方向y = sin((1:timepoints )* pi / 30)* 1000 + sin((1:时间点)* pi / 40)* 1000 + 1:时间点y = y + rnorm(timepoints,sd = sd(y))* noise.factor plot(y,type =l) 转换为绝对变化, (0,y [-1] -y [-length(y)])#c(0,t2-t1,t3-t2,...) #compute lag dy = dy + rnorm(timepoints)* sd(dy)* noise.factor #add noise dy = c(0,y [-1] -y [ - (1:40,function(i){ getTheseLags =(1:timepoints) - getTheseLags [getTheseLags< 1] = NA#开始之前删除时间点 dx.lag.i = dy [getTheseLags] }) dX [is.na(dX)] = - 100# (数据框架(dy,dX [,1:5]),cex = .2)#data结构 #make train-并测试-set train = 1:600 dy.train = dy [train] dy.test = dy [-train] dX.train = dX [train,] dX.test = dX [-train,] #classic rf rf = randomForest(dX.train,dy.train,ntree = 500) print( rf) #不需要混合的矢量分割一个矢量 split2 = func (aVector,splits = 31){ lVector = length(aVector) mod = lVector %%分割 lBlocks = rep(floor(lVector / split),split)如果(mod!= 0)lBlocks [1:mod] = lBlocks [1:mod] + 1 lapply(1:splits,function(i){ Stop = sum(lBlocks [1:i ]) Start = Stop - lBlocks [i] + 1 aVector [开始:停止] })} # (ttt in 1:numTrees)创建一个按块引导的样本列表 aBlock< - list() numTrees< - 500 splits< - 40 ( aBlock [[ttt]]< - unlist( sample( split2(1:nrow(dX.train),splits = split),分割,替换= T ))} #将数据输入到数据帧,所以rpart理解它 df1 #执行块的回归树 rfBlocks = foreach(aBlock = aBlock, .packages =(rpart))%dopar% { dBlock = df1 [aBlock,] rf = predict(rpart(dy.train〜。,data = dBlock,method =anova),newdata = data.frame(dX.test))} #predict测试,使结果表#使用rowMeans来聚合分块预测 results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame,rfBlocks)), true = dy.test, predBootstrap = predict(rf,newdata = dX.test)) plot(results [,1:2],xlab =OOB-CV预测变化 , ylab =trueChange, main =black bootstrap and blue block train) points(results [,3:2],xlab =OOB-CV预测变化, ylab =trueChange, col =blue) #预测结果 print(cor(results)^ 2) stopCluster(cl)#close cluster First some background info, which is probably more interesting on stats.stackexchange:In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).I use time series data where the explanatory variables are lagged values of other data and the dependent variable.For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.Now my problem is this:To achieve this I would normally copy the existing function and edit the desired step/lines.randomForest2 <- randomForest()But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly? 解决方案 So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.For the code below the performance seems to be a coin-toss.Taking Soren's code as an example it looks a bit like this:library(randomForest)library(doParallel) #parallel package and mclapply is better for linuxlibrary(rpart)#parallel backend ftwnCPU = detectCores()cl = makeCluster(nCPU)registerDoParallel(cl)#simulated time series(y) with time roll and lag=1timepoints=1000;var=6;noise.factor=.2#past to present orientationy = sin((1:timepoints)*pi/30) * 1000 + sin((1:timepoints)*pi/40) * 1000 + 1:timepointsy = y+rnorm(timepoints,sd=sd(y))*noise.factorplot(y,type="l")#convert to absolute change, with lag=1dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)#compute lagdy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noisedy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1dX = sapply(1:40,function(i){ getTheseLags = (1:timepoints) - i getTheseLags[getTheseLags<1] = NA #remove before start timePoints dx.lag.i = dy[getTheseLags]})dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseriespairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure#make train- and test-settrain=1:600dy.train = dy[ train]dy.test = dy[-train]dX.train = dX[ train,]dX.test = dX[-train,]#classic rfrf = randomForest(dX.train,dy.train,ntree=500)print(rf)#like function split for a vector without mixingsplit2 = function(aVector,splits=31) { lVector = length(aVector) mod = lVector %% splits lBlocks = rep(floor(lVector/splits),splits) if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1 lapply(1:splits,function(i) { Stop = sum(lBlocks[1:i]) Start = Stop - lBlocks[i] + 1 aVector[Start:Stop] })}#create a list of block-wise bootstrapped samplesaBlock <- list()numTrees <- 500splits <- 40for (ttt in 1:numTrees){ aBlock[[ttt]] <- unlist( sample( split2(1:nrow(dX.train),splits=splits), splits, replace=T ) )}#put data into a dataframe so rpart understands itdf1 <- data.frame(dy.train, dX.train)#perform regression trees for BlocksrfBlocks = foreach(aBlock = aBlock, .packages=("rpart")) %dopar% { dBlock = df1[aBlock,] rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) ) }#predict test, make results table#use rowMeans to aggregate the block-wise predictionsresults = data.frame(predBlock = rowMeans(do.call(cbind.data.frame, rfBlocks)), true=dy.test, predBootstrap = predict(rf,newdata=dX.test) )plot(results[,1:2],xlab="OOB-CV predicted change", ylab="trueChange", main="black bootstrap and blue block train")points(results[,3:2],xlab="OOB-CV predicted change", ylab="trueChange", col="blue")#prediction resultsprint(cor(results)^2)stopCluster(cl)#close cluster 这篇关于如何替换程序包randomForest r中的引导程序步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 06:32