问题描述
我在普通的8GB服务器上用尽了机器学习上下文中的一个很小的数据集的内存不足:
I'm running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:
> dim(basetrainf) # this is a dataframe
[1] 58168 118
我采取的唯一显着增加内存消耗的预建模步骤是将数据帧转换为模型矩阵.这是因为caret
,cor
等仅适用于(模型)矩阵.即使在除去具有多个级别的因子之后,矩阵(下面的mergem
)还是相当大的. (通常很难支持sparse.model.matrix
/Matrix
,所以我不能使用它.)
The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret
, cor
, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem
below) is fairly large. (sparse.model.matrix
/Matrix
is poorly supported in general, so I can't use that.)
> lsos()
Type Size PrettySize Rows Columns
mergem matrix 879205616 838.5 Mb 115562 943
trainf data.frame 80613120 76.9 Mb 106944 119
inttrainf matrix 76642176 73.1 Mb 907 10387
mergef data.frame 58264784 55.6 Mb 115562 75
dfbase data.frame 48031936 45.8 Mb 54555 115
basetrainf data.frame 40369328 38.5 Mb 58168 118
df2 data.frame 34276128 32.7 Mb 54555 103
tf data.frame 33182272 31.6 Mb 54555 98
m.gbm train 20417696 19.5 Mb 16 NA
res.glmnet list 14263256 13.6 Mb 4 NA
此外,由于许多R模型不支持示例权重,因此我不得不首先对少数类进行过采样,使数据集的大小增加一倍(为什么trainf,mergef,mergem的行数是basetrainf的两倍).
Also, since many R models don't support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).
R此时使用的内存为1.7GB,这使我的总内存使用量从7.7GB提升到了4.3GB.
R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.
我接下来要做的是:
> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')
Bam-在几秒钟内,Linux内存不足杀手杀死了rsession.
Bam - in a few seconds, the Linux out-of-memory killer kills rsession.
我可以对数据进行采样,欠采样而不是过采样等,但这是不理想的.除了重写插入符号和打算使用的各种模型程序包之外,我还应该(另外)做什么?
I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?
FWIW,即使没有删减我的任何因素,我也从未遇到过其他ML软件(Weka,Orange等)的问题,也许是因为示例加权和数据框架"支持模型.
FWIW, I've never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and "data frame" support, across all models.
完整脚本如下:
library(caret)
library(Matrix)
library(doMC)
registerDoMC(2)
response = 'class'
repr = 'dummy'
do.impute = F
xmode = function(xs) names(which.max(table(xs)))
read.orng = function(path) {
# read header
hdr = strsplit(readLines(path, n=1), '\t')
pairs = sapply(hdr, function(field) strsplit(field, '#'))
names = sapply(pairs, function(pair) pair[2])
classes = sapply(pairs, function(pair)
if (grepl('C', pair[1])) 'numeric' else 'factor')
# read data
dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='')
# switch response, remove meta columns
df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)]
df
}
train.and.test = function(x, y, trains, method) {
m = train(x[trains,], y[trains,], method=method)
ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,])
perf = postResample(ps$pred, ps$obs)
list(m=m, ps=ps, perf=perf)
}
# From
sparse.cor = function(x){
memory.limit(size=10000)
n 200 levels')
badfactors = sapply(mergef, function(x)
is.factor(x) && (nlevels(x) 200))
mergef = mergef[, -which(badfactors)]
print('remove near-zero variance predictors')
mergef = mergef[, -nearZeroVar(mergef)]
print('create model matrix, making everything numeric')
if (repr == 'dummy') {
dummies = dummyVars(as.formula(paste(response, '~ .')), mergef)
mergem = predict(dummies, newdata=mergef)
} else {
mat = if (repr == 'sparse') model.matrix else sparse.model.matrix
mergem = mat(as.formula(paste(response, '~ .')), data=mergef)
# remove intercept column
mergem = mergem[, -1]
}
print('remove high-correlation predictors')
merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem)
mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)]
print('try a couple of different methods')
do.method = function(method) {
train.and.test(mergem, mergef[response], mergef$istrain, method)
}
res.gbm = do.method('gbm')
res.glmnet = do.method('glmnet')
res.rf = do.method('parRF')
推荐答案
检查基础 randomForest 代码是否未存储树的森林.也许减小tuneLength
以便尝试使用更少的mtry
值.
Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength
so that fewer values of mtry
are being tried.
此外,我可能只需要手工拟合一个随机森林,看看是否可以在我的机器上拟合这样的模型.如果您不能直接适应一个人,则将无法使用插入符一口气地容纳多个人.
Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.
在这一点上,我认为您需要找出导致内存膨胀的原因,以及如何控制模型拟合,以防止其膨胀失控.因此,请找出插入符号的调用方式以及randomForest()
使用的选项.您可能可以关闭其中一些(例如,存储我之前提到的森林,以及可变重要性度量值).一旦确定了mtry
的最佳值,就可以尝试使用可能需要帮助解释拟合的所有其他特征拟合模型.
At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest()
and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry
, you can then try to fit the model with all the extras you might want to help interpret the fit.
这篇关于R内存管理建议(插入符号,模型矩阵,数据帧)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!