问题描述
我通常需要使用给定的聚合函数(即求和,平均值等)来总结具有不规则时序的时间序列.但是,我目前使用的解决方案似乎效率低下且缓慢.
I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.
执行聚合功能:
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
for iSer = 1:size(array, 2)
aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
end
end
end
请注意,array
和groupIndex
都可以是2D. array
中的每一列都是要汇总的独立系列,但是groupIndex
的列应放在一起(作为一行)以指定句点.
Note that both array
and groupIndex
can be 2D. Every column in array
is an independent series to be aggregated, but the columns of groupIndex
should be taken together (as a row) to specify a period.
然后,当我们给它带来不规则的时间序列时(请注意第二个周期比一个基本周期长),计时结果很差:
Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:
a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.
使用事件探查器,我们可以发现grpIdx
行花费了大约1/4的执行时间(.28 s),而iSer
循环花费了大约3/4(1.17 s)的执行时间( 1.48秒).
Using the profiler, we can find out that the grpIdx
line takes about 1/4 of the execution time (.28 s) and the iSer
loop takes about 3/4 (1.17 s) of the total (1.48 s).
将此与无周期情况进行比较:
Compare this with the period-indifferent case:
tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.
是否有更有效的方法来汇总这些数据?
Is there a more efficient way to aggregate this data?
将每个响应放入单独的函数中,这是我在Windows 7和Intel i7上使用Matlab 2015b在timeit
中获得的计时结果:
Taking each response and putting it in a separate function, here are the timing results I get with timeit
with Matlab 2015b on Windows 7 with an Intel i7:
original | 1.32451
felix1 | 0.35446
felix2 | 0.16432
divakar1 | 0.41905
divakar2 | 0.30509
divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977
对groupIndex
的澄清一个2D groupIndex
的示例是为一组涵盖1980-2015年的每日数据指定年号和周号:
Clarification on groupIndex
An example of a 2D groupIndex
would be where both the year number and week number are specified for a set of daily data covering 1980-2015:
a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];
因此,年-周"期间由一行groupIndex
唯一标识.可以通过调用unique(groupIndex, 'rows')
并获取第三个输出来有效地解决此问题,因此可以随意忽略问题的这一部分.
Thus a "year-week" period are uniquely identified by a row of groupIndex
. This is effectively handled through calling unique(groupIndex, 'rows')
and taking the third output, so feel free to disregard this portion of the question.
推荐答案
聚会晚了一点,但是使用 accumarray
产生了巨大差异:
A little late to the party, but a single loop using accumarray
makes a huge difference:
function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)
[groups, ~, index] = unique(groupIndex, 'rows');
numCols = size(array, 2);
aggArray = nan(numel(groups), numCols);
for col = 1:numCols
aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
end
end
为此示例使用MATLAB R2016b中的 timeit
进行计时问题中的数据如下:
Timing this using timeit
in MATLAB R2016b for the sample data in the question gives the following:
original | 1.127141
gnovice | 0.002205
加速超过500倍!
这篇关于时间序列聚合效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!