NumPy中具有很大矩阵的线性回归

NumPy中具有很大矩阵的线性回归

本文介绍了NumPy中具有很大矩阵的线性回归-如何节省内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有这些巨大的矩阵X和Y.X和Y都有1亿行,而X有10列.我正在尝试使用这些矩阵实现线性回归,并且我需要数量(X^T*X)^-1 * X^T * Y.我该如何尽可能节省空间?

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I'm trying to implement linear regression with these matrices, and I need the quantity (X^T*X)^-1 * X^T * Y. How can I compute this as space-efficiently as possible?

现在我有

X = readMatrix("fileX.txt")
Y = readMatrix("fileY.txt")
return (X.getT() * X).getI() * X.getT() * Y

这里有多少矩阵存储在内存中?是否一次存储两个以上的矩阵?有更好的方法吗?

How many matrices are being stored in memory here? Are more than two matrices being stored at once? Is there a better way to do it?

我有大约1.5 GB的内存用于该项目.如果我关闭其他所有程序,则可以将其拉伸到2或2.5.理想情况下,该过程也可以在很短的时间内运行,但是内存限制更为严格.

I have about 1.5 GB of memory for this project. I can probably stretch it to 2 or 2.5 if I close every other program. Ideally the process would run in a short amount of time also, but the memory bound is more strict.

我尝试过的另一种方法是将计算的中间步骤另存为文本文件,并在每一步之后重新加载它们.但这很慢.

The other approach I've tried is saving the intermediate steps of the calculation as text files and reloading them after every step. But that is very slow.

推荐答案

X的大小为100e6 x 10Y的大小是100e6 x 1

the size of X is 100e6 x 10the size of Y is 100e6 x 1

所以(X^T*X)^-1 * X^T * Y的最终大小是10 x 1

so the final size of (X^T*X)^-1 * X^T * Y is 10 x 1

您可以按照以下步骤进行计算:

you can calculate it by following step:

  1. 计算a = X^T*X-> 10 x 10
  2. 计算b = X^T*Y-> 10 x 1
  3. 计算a^-1 * b
  1. calculate a = X^T*X -> 10 x 10
  2. calculate b = X^T*Y -> 10 x 1
  3. calculate a^-1 * b

步骤3中的

矩阵很小,因此您只需要一些中间步骤来计算1& 2.

matrixs in step 3 is very small, so you just need to dosome intermediate steps to calculate 1 & 2.

例如,您可以读取X和Y的第0列,并通过numpy.dot(X0, Y)进行计算.

For example you can read column 0 of X and Y,and calculate it by numpy.dot(X0, Y).

对于float64 dtype,如果满足以下条件,则X0和Y的大小约为1600M它无法容纳内存,您可以调用numpy.dot两次X0&的前半部分和后半部分分别为Y.

for float64 dtype, the size of X0 and Y is about 1600M, ifit cann't fit the memory, you can call numpy.dot twice forthe first half and second half of X0 & Y separately.

因此要计算X^T*Y,您需要调用numpy.dot 20次,要计算X^T*X,您需要调用numpy.dot 200次.

So to calculate X^T*Y you need call numpy.dot 20 times,to calculate X^T*X you need call numpy.dot 200 times.

这篇关于NumPy中具有很大矩阵的线性回归-如何节省内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 17:05