R内存管理建议(插入符号，模型矩阵，数据帧)

本文介绍了R内存管理建议(插入符号，模型矩阵，数据帧)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在普通的8GB服务器上用尽了机器学习上下文中的一个很小的数据集的内存不足:

I'm running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:


> dim(basetrainf) # this is a dataframe
[1] 58168   118

我采取的唯一显着增加内存消耗的预建模步骤是将数据帧转换为模型矩阵.这是因为caret，cor等仅适用于(模型)矩阵.即使在除去具有多个级别的因子之后，矩阵(下面的mergem)还是相当大的. (通常很难支持sparse.model.matrix/Matrix，所以我不能使用它.)

The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret, cor, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem below) is fairly large. (sparse.model.matrix/Matrix is poorly supported in general, so I can't use that.)


> lsos()
                 Type      Size PrettySize   Rows Columns
mergem         matrix 879205616   838.5 Mb 115562     943
trainf     data.frame  80613120    76.9 Mb 106944     119
inttrainf      matrix  76642176    73.1 Mb    907   10387
mergef     data.frame  58264784    55.6 Mb 115562      75
dfbase     data.frame  48031936    45.8 Mb  54555     115
basetrainf data.frame  40369328    38.5 Mb  58168     118
df2        data.frame  34276128    32.7 Mb  54555     103
tf         data.frame  33182272    31.6 Mb  54555      98
m.gbm           train  20417696    19.5 Mb     16      NA
res.glmnet       list  14263256    13.6 Mb      4      NA

此外，由于许多R模型不支持示例权重，因此我不得不首先对少数类进行过采样，使数据集的大小增加一倍(为什么trainf，mergef，mergem的行数是basetrainf的两倍).

Also, since many R models don't support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).

R此时使用的内存为1.7GB，这使我的总内存使用量从7.7GB提升到了4.3GB.

R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.

我接下来要做的是:


> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')

Bam-在几秒钟内，Linux内存不足杀手杀死了rsession.

Bam - in a few seconds, the Linux out-of-memory killer kills rsession.

我可以对数据进行采样，欠采样而不是过采样等，但这是不理想的.除了重写插入符号和打算使用的各种模型程序包之外，我还应该(另外)做什么?

I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?

FWIW，即使没有删减我的任何因素，我也从未遇到过其他ML软件(Weka，Orange等)的问题，也许是因为示例加权和数据框架"支持模型.

FWIW, I've never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and "data frame" support, across all models.

完整脚本如下:


library(caret)
library(Matrix)
library(doMC)
registerDoMC(2)

response = 'class'

repr = 'dummy'
do.impute = F

xmode = function(xs) names(which.max(table(xs)))

read.orng = function(path) {
  # read header
  hdr = strsplit(readLines(path, n=1), '\t')
  pairs = sapply(hdr, function(field) strsplit(field, '#'))
  names = sapply(pairs, function(pair) pair[2])
  classes = sapply(pairs, function(pair)
    if (grepl('C', pair[1])) 'numeric' else 'factor')

  # read data
  dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='')

  # switch response, remove meta columns
  df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)]

  df
}

train.and.test = function(x, y, trains, method) {
  m = train(x[trains,], y[trains,], method=method)
  ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,])
  perf = postResample(ps$pred, ps$obs)
  list(m=m, ps=ps, perf=perf)
}

# From
sparse.cor = function(x){
  memory.limit(size=10000)
  n 200 levels')
badfactors = sapply(mergef, function(x)
  is.factor(x) && (nlevels(x)  200))
mergef = mergef[, -which(badfactors)]

print('remove near-zero variance predictors')
mergef = mergef[, -nearZeroVar(mergef)]

print('create model matrix, making everything numeric')
if (repr == 'dummy') {
  dummies = dummyVars(as.formula(paste(response, '~ .')), mergef)
  mergem = predict(dummies, newdata=mergef)
} else {
  mat = if (repr == 'sparse') model.matrix else sparse.model.matrix
  mergem = mat(as.formula(paste(response, '~ .')), data=mergef)
  # remove intercept column
  mergem = mergem[, -1]
}

print('remove high-correlation predictors')
merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem)
mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)]

print('try a couple of different methods')
do.method = function(method) {
  train.and.test(mergem, mergef[response], mergef$istrain, method)
}
res.gbm = do.method('gbm')
res.glmnet = do.method('glmnet')
res.rf = do.method('parRF')

推荐答案

检查基础 randomForest 代码是否未存储树的森林.也许减小tuneLength以便尝试使用更少的mtry值.

Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength so that fewer values of mtry are being tried.

此外，我可能只需要手工拟合一个随机森林，看看是否可以在我的机器上拟合这样的模型.如果您不能直接适应一个人，则将无法使用插入符一口气地容纳多个人.

Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.

在这一点上，我认为您需要找出导致内存膨胀的原因，以及如何控制模型拟合，以防止其膨胀失控.因此，请找出插入符号的调用方式以及randomForest()使用的选项.您可能可以关闭其中一些(例如，存储我之前提到的森林，以及可变重要性度量值).一旦确定了mtry的最佳值，就可以尝试使用可能需要帮助解释拟合的所有其他特征拟合模型.

At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest() and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry, you can then try to fit the model with all the extras you might want to help interpret the fit.

这篇关于R内存管理建议(插入符号，模型矩阵，数据帧)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！