本文介绍了Python中内存高效的大量numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用numpy对非常大的基因组数据集进行排序.我有一个26亿个浮点数组,尺寸= (868940742, 3),一旦加载并坐在那里,它将占用我机器上约20GB的内存.我有2015年初的13英寸MacBook Pro,配备16GB的RAM,500GB的固态硬盘和3.1 GHz的Intel i7处理器.只是将阵列加载到虚拟内存中就可以了,但不会达到我的机器遭受痛苦的程度,否则我必须停止正在做的其他事情.

我从22个较小的(N, 2)子阵列中逐步构建了这个非常大的阵列.

函数FUN_1使用我称为sub_arr的22个子数组中的每一个生成2个新的(N, 1)数组.

FUN_1的第一个输出是通过对数组b = array([X, F(X)])上的sub_arr[:,0]的值进行插值生成的,而第二个输出是通过使用数组r = array([X, BIN(X)])sub_arr[:, 0]放入bin中生成的.我将这些输出分别称为b_arrrate_arr.该函数返回一个三元组的(N, 1)数组:

import numpy as np

def FUN_1(sub_arr):
    """interpolate b values and rates based on position in sub_arr"""

    b = np.load(bfile)
    r = np.load(rfile)

    b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
    rate_arr = np.searchsorted(r[:,0], sub_arr[:,0])  # HUGE efficiency gain over np.digitize...

    return r[rate_r, 1], b_arr, sub_arr[:,1]

我在for循环中调用函数22次,并用值填充预先分配的零数组full_arr = numpy.zeros([868940742, 3]):

full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1

就这一步节省内存而言,我认为这是我能做的最好的事情,但是我愿意接受建议.无论哪种方式,到目前为止我都不会遇到问题,只需要大约2分钟.

这是排序例程(有两个连续的排序)

for idx in range(2):
    sort_idx = numpy.argsort(full_arr[:,idx])
    full_arr = full_arr[sort_idx]
    # ...
    # <additional processing, return small (1000, 3) array of stats>

现在这种方法一直在工作,尽管速度很慢(大约需要10分钟).但是,我最近开始在上面的FUN_1中的插值步骤中使用更大,更精细的[X, F(X)]值分辨率表,该表返回b_arr,现在SORT的确放慢了速度,尽管其他所有条件都保持不变.

有趣的是,我什至没有在现在滞后的步骤上对插值进行排序.以下是不同插值文件的一些片段-较小的片段在每种情况下均小30%左右,第二列中的值更均匀;较慢的分辨率具有较高的分辨率和更多的唯一值,因此插值的结果可能更唯一,但是我不确定这是否会产生任何效果...?

更大,更慢的文件:

17399307    99.4
17493652    98.8
17570460    98.2
17575180    97.6
17577127    97
17578255    96.4
17580576    95.8
17583028    95.2
17583699    94.6
17584172    94

更小,更统一的常规文件:

1       24
1001    24
2001    24
3001    24
4001    24
5001    24
6001    24
7001    24

我不确定是什么引起了这个问题,并且我对在这种类型的内存限制情况下进行排序的任何建议或只是一般性的输入都非常感兴趣!

解决方案

此刻,每次对np.argsort的调用都将生成一个(868940742, 1) int64索引数组,该数组本身将占用约7 GB的空间.此外,当您使用这些索引对full_arr的列进行排序时,由于.

一个相当明显的改进是使用full_arr进行排序nofollow noreferrer> .sort()方法.不幸的是,.sort()不允许您直接指定要作为排序依据的行或列.但是,您可以指定用于对结构化数组进行排序的字段.因此,您可以通过获取 view 放入具有三个float字段的结构化数组上,然后按以下字段之一进行排序:

full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)

在这种情况下,我要按与第一列相对应的第0个字段对full_arr进行排序.请注意,我假设有3个float64列('f8')-如果您的dtype不同,则应相应地进行更改.这还要求您的数组是连续的,并且是以行为主的格式,即full_arr.flags.C_CONTIGUOUS == True.

有关此方法的信用,请在此处.


尽管它需要较少的内存,但是与使用np.argsort生成索引数组相比,按字段对结构化数组进行排序却不幸地慢得多,如您在下面的注释中所提到的(请参阅).如果使用np.argsort获取一组索引进行排序,则使用np.take而不是直接索引来获取排序后的数组可能会获得适度的性能提升:

 %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop

 %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop

但是我不希望看到内存使用方面的任何差异,因为这两种方法都会生成一个副本.


关于您为什么对第二个数组进行排序更快的问题-是的,您应该期望,当数组中的唯一值较少时,任何合理的排序算法都会更快,因为平均而言,要做的工作较少.假设我有一个介于1到10之间的随机数字序列:

5  1  4  8  10  2  6  9  7  3

有10个! = 3628800种可能的方式来排列这些数字,但只能以升序排列.现在假设只有5个唯一数字:

4  4  3  2  3  1  2  5  1  5

现在,有2种= 32种方法可以按升序排列这些数字,因为我可以在排序的向量中交换任意一对相同的数字而不会破坏顺序.

默认情况下,np.ndarray.sort()使用 Quicksort .此算法的 qsort 变体通过在对数组进行排序,然后对数组进行重新排序,以使所有小于枢轴值的元素都放置在它的前面,而所有大于枢轴值的元素都放在它的后面.等于枢轴的值已经排序.拥有较少的唯一值意味着平均而言,在任何给定扫描中,更多的值将等于枢轴值,因此,对阵列进行完全排序所需的扫描数也将减少.

例如:

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop

在此示例中,两个数组的dtypes相同.如果较小的数组与较大的数组相比具有较小的项目大小,则由于花式索引而导致的复制成本也会降低.

I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3) which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.

I build this VERY large array step by step from 22 smaller (N, 2) subarrays.

Function FUN_1 generates 2 new (N, 1) arrays using each of the 22 subarrays which I call sub_arr.

The first output of FUN_1 is generated by interpolating values from sub_arr[:,0] on array b = array([X, F(X)]) and the second output is generated by placing sub_arr[:, 0] into bins using array r = array([X, BIN(X)]). I call these outputs b_arr and rate_arr, respectively. The function returns a 3-tuple of (N, 1) arrays:

import numpy as np

def FUN_1(sub_arr):
    """interpolate b values and rates based on position in sub_arr"""

    b = np.load(bfile)
    r = np.load(rfile)

    b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
    rate_arr = np.searchsorted(r[:,0], sub_arr[:,0])  # HUGE efficiency gain over np.digitize...

    return r[rate_r, 1], b_arr, sub_arr[:,1]

I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3]) with the values:

full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1

In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.

Here is the sorting routine (there are two consecutive sorts)

for idx in range(2):
    sort_idx = numpy.argsort(full_arr[:,idx])
    full_arr = full_arr[sort_idx]
    # ...
    # <additional processing, return small (1000, 3) array of stats>

Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of [X, F(X)] values for the interpolation step above in FUN_1 that returns b_arr and now the SORT really slows down, although everything else remains the same.

Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?

bigger, slower file:

17399307    99.4
17493652    98.8
17570460    98.2
17575180    97.6
17577127    97
17578255    96.4
17580576    95.8
17583028    95.2
17583699    94.6
17584172    94

smaller, more uniform regular file:

1       24
1001    24
2001    24
3001    24
4001    24
5001    24
6001    24
7001    24

I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!

解决方案

At the moment each call to np.argsort is generating a (868940742, 1) array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr you are generating another (868940742, 1) array of floats, since fancy indexing always returns a copy rather than a view.

One fairly obvious improvement would be to sort full_arr in place using its .sort() method. Unfortunately, .sort() does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields:

full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)

In this case I'm sorting full_arr in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8') - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True.

Credit for this method should go to Joe Kington for his answer here.


Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort to obtain a set of indices to sort by, you might see a modest performance gain by using np.take rather than direct indexing to get the sorted array:

 %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop

 %%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop

However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.


Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:

5  1  4  8  10  2  6  9  7  3

There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:

4  4  3  2  3  1  2  5  1  5

Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.

By default, np.ndarray.sort() uses Quicksort. The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.

For example:

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop

%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop

In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.

这篇关于Python中内存高效的大量numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 05:17