

我原来的 list _ 函数具有超过200万行代码,运行计算代码时出现内存错误.有没有办法可以解决这个问题.下面的 list _ 是实际numpy数组的一部分.

My original list_ function has over 2 million lines of code and I get a memory error when I run the code that calculates . Is there a way I could could go around it. The list_ down below isa portion fo the actual numpy array.


import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']


number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

std = np.std(rolling_window(list_, number), axis=1)

错误消息: MemoryError:无法分配198.GiB用于形状为(2659448,10000)和数据类型为float64的数组


MemoryError                               Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
      5     return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
----> 7 std1 = np.std(rolling_window(PC_list, number), axis=1)

<__array_function__ internals> in std(*args, **kwargs)

C:\Python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
   3496     return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3497                          **kwargs)

C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
    232 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
    233     ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 234                keepdims=keepdims)
    236     if isinstance(ret, mu.ndarray):

C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
    200     # Note that x may not be inexact and that we need it to be an array,
    201     # not a scalar.
--> 202     x = asanyarray(arr - arrmean)
    204     if issubclass(arr.dtype.type, (nt.floating, nt.integer)):

MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64



Generally, there are two ways to deal with "cannot allocate 198GiB of memory":

  • 按块或逐行处理数据.

  • Process the data in chunks, or line-by line.

您的算法似乎适用于此;而不是一次读取所有数据,而是重写 rolling_window 函数,以便它加载初始窗口(文件的前 n 行),然后重复删除一行并读取文件中的一行.这样,您的内存行数永远不会超过 n 行,并且一切正常.

Your algorithm appears to be suitable for this; rather than reading the data all at once, rewrite the rolling_window function so that it loads the initial window (first n lines of the file), then repeatedly drops one line and reads one line from the file. That way, you'll never have more than n lines of memory and it'll all work fine.


If it's a local file, it can be kept open during the whole calculation, which is easiest. If it's a remote object, you may find connections timing out; if so, you may need to either copy the data to a local file, or use the relevant seek/offset parameter to reopen the file for each additional line (or each additional chunk, which you buffer locally).

或者,购买(租用)具有200 GiB以上内存的计算机;内存超过1 TiB的计算机可以在AWS上现成(可能是GCP和Azure;也可以直接购买).

Alternately, buy (rent) a machine with more than 200 GiB of memory; machines with over 1 TiB of memory are available off-the-shelf at AWS (and presumably GCP and Azure; or for direct purchase).


This is especially suitable if you're reasonably sure your requirements won't grow further and you just need to get this one job done. It'll save you rewriting your code to handle this, but it's not a sustainable solution in a longer term.


08-04 05:17