问题描述
我需要使用numpy对非常大的基因组数据集进行排序.我有一个26亿个浮点数组,尺寸= (868940742, 3)
,一旦加载并坐在那里,它将占用我机器上约20GB的内存.我有2015年初的13英寸MacBook Pro,配备16GB的RAM,500GB的固态硬盘和3.1 GHz的Intel i7处理器.只是将阵列加载到虚拟内存中就可以了,但不会达到我的机器遭受痛苦的程度,否则我必须停止正在做的其他事情.
我从22个较小的(N, 2)
子阵列中逐步构建了这个非常大的阵列.
函数FUN_1
使用我称为sub_arr
的22个子数组中的每一个生成2个新的(N, 1)
数组.
FUN_1
的第一个输出是通过对数组b = array([X, F(X)])
上的sub_arr[:,0]
的值进行插值生成的,而第二个输出是通过使用数组r = array([X, BIN(X)])
将sub_arr[:, 0]
放入bin中生成的.我将这些输出分别称为b_arr
和rate_arr
.该函数返回一个三元组的(N, 1)
数组:
import numpy as np
def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""
b = np.load(bfile)
r = np.load(rfile)
b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0]) # HUGE efficiency gain over np.digitize...
return r[rate_r, 1], b_arr, sub_arr[:,1]
我在for循环中调用函数22次,并用值填充预先分配的零数组full_arr = numpy.zeros([868940742, 3])
:
full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
就这一步节省内存而言,我认为这是我能做的最好的事情,但是我愿意接受建议.无论哪种方式,到目前为止我都不会遇到问题,只需要大约2分钟.
这是排序例程(有两个连续的排序)
for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
现在这种方法一直在工作,尽管速度很慢(大约需要10分钟).但是,我最近开始在上面的FUN_1
中的插值步骤中使用更大,更精细的[X, F(X)]
值分辨率表,该表返回b_arr
,现在SORT的确放慢了速度,尽管其他所有条件都保持不变.
有趣的是,我什至没有在现在滞后的步骤上对插值进行排序.以下是不同插值文件的一些片段-较小的片段在每种情况下均小30%左右,第二列中的值更均匀;较慢的分辨率具有较高的分辨率和更多的唯一值,因此插值的结果可能更唯一,但是我不确定这是否会产生任何效果...?
更大,更慢的文件:
17399307 99.4
17493652 98.8
17570460 98.2
17575180 97.6
17577127 97
17578255 96.4
17580576 95.8
17583028 95.2
17583699 94.6
17584172 94
更小,更统一的常规文件:
1 24
1001 24
2001 24
3001 24
4001 24
5001 24
6001 24
7001 24
我不确定是什么引起了这个问题,并且我对在这种类型的内存限制情况下进行排序的任何建议或只是一般性的输入都非常感兴趣!
此刻,每次对np.argsort
的调用都将生成一个(868940742, 1)
int64索引数组,该数组本身将占用约7 GB的空间.此外,当您使用这些索引对full_arr
的列进行排序时,由于.
一个相当明显的改进是使用full_arr进行排序nofollow noreferrer> .sort()
方法.不幸的是,.sort()
不允许您直接指定要作为排序依据的行或列.但是,您可以指定用于对结构化数组进行排序的字段.因此,您可以通过获取 view
放入具有三个float字段的结构化数组上,然后按以下字段之一进行排序:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
在这种情况下,我要按与第一列相对应的第0个字段对full_arr
进行排序.请注意,我假设有3个float64列('f8'
)-如果您的dtype不同,则应相应地进行更改.这还要求您的数组是连续的,并且是以行为主的格式,即full_arr.flags.C_CONTIGUOUS == True
.
有关此方法的信用,请在此处.
尽管它需要较少的内存,但是与使用np.argsort
生成索引数组相比,按字段对结构化数组进行排序却不幸地慢得多,如您在下面的注释中所提到的(请参阅).如果使用np.argsort
获取一组索引进行排序,则使用np.take
而不是直接索引来获取排序后的数组可能会获得适度的性能提升:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
但是我不希望看到内存使用方面的任何差异,因为这两种方法都会生成一个副本.
关于您为什么对第二个数组进行排序更快的问题-是的,您应该期望,当数组中的唯一值较少时,任何合理的排序算法都会更快,因为平均而言,要做的工作较少.假设我有一个介于1到10之间的随机数字序列:
5 1 4 8 10 2 6 9 7 3
有10个! = 3628800种可能的方式来排列这些数字,但只能以升序排列.现在假设只有5个唯一数字:
4 4 3 2 3 1 2 5 1 5
现在,有2种= 32种方法可以按升序排列这些数字,因为我可以在排序的向量中交换任意一对相同的数字而不会破坏顺序.
默认情况下,np.ndarray.sort()
使用 Quicksort .此算法的 qsort
变体通过在对数组进行排序,然后对数组进行重新排序,以使所有小于枢轴值的元素都放置在它的前面,而所有大于枢轴值的元素都放在它的后面.等于枢轴的值已经排序.拥有较少的唯一值意味着平均而言,在任何给定扫描中,更多的值将等于枢轴值,因此,对阵列进行完全排序所需的扫描数也将减少.
例如:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
在此示例中,两个数组的dtypes相同.如果较小的数组与较大的数组相比具有较小的项目大小,则由于花式索引而导致的复制成本也会降低.
I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3)
which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.
I build this VERY large array step by step from 22 smaller (N, 2)
subarrays.
Function FUN_1
generates 2 new (N, 1)
arrays using each of the 22 subarrays which I call sub_arr
.
The first output of FUN_1
is generated by interpolating values from sub_arr[:,0]
on array b = array([X, F(X)])
and the second output is generated by placing sub_arr[:, 0]
into bins using array r = array([X, BIN(X)])
. I call these outputs b_arr
and rate_arr
, respectively. The function returns a 3-tuple of (N, 1)
arrays:
import numpy as np
def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""
b = np.load(bfile)
r = np.load(rfile)
b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0]) # HUGE efficiency gain over np.digitize...
return r[rate_r, 1], b_arr, sub_arr[:,1]
I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3])
with the values:
full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.
Here is the sorting routine (there are two consecutive sorts)
for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of [X, F(X)]
values for the interpolation step above in FUN_1
that returns b_arr
and now the SORT really slows down, although everything else remains the same.
Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?
bigger, slower file:
17399307 99.4
17493652 98.8
17570460 98.2
17575180 97.6
17577127 97
17578255 96.4
17580576 95.8
17583028 95.2
17583699 94.6
17584172 94
smaller, more uniform regular file:
1 24
1001 24
2001 24
3001 24
4001 24
5001 24
6001 24
7001 24
I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!
At the moment each call to np.argsort
is generating a (868940742, 1)
array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr
you are generating another (868940742, 1)
array of floats, since fancy indexing always returns a copy rather than a view.
One fairly obvious improvement would be to sort full_arr
in place using its .sort()
method. Unfortunately, .sort()
does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view
onto your array as a structured array with three float fields, then sorting by one of these fields:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
In this case I'm sorting full_arr
in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8'
) - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True
.
Credit for this method should go to Joe Kington for his answer here.
Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort
to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort
to obtain a set of indices to sort by, you might see a modest performance gain by using np.take
rather than direct indexing to get the sorted array:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.
Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:
5 1 4 8 10 2 6 9 7 3
There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:
4 4 3 2 3 1 2 5 1 5
Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.
By default, np.ndarray.sort()
uses Quicksort. The qsort
variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.
For example:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.
这篇关于Python中内存高效的大量numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!