本文介绍了时间序列聚合效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常需要使用给定的聚合函数(即求和,平均值等)来总结具有不规则时序的时间序列.但是,我目前使用的解决方案似乎效率低下且缓慢.

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.

执行聚合功能:

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
    for iSer = 1:size(array, 2)
      aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
    end
end

end

请注意,arraygroupIndex都可以是2D. array中的每一列都是要汇总的独立系列,但是groupIndex的列应放在一起(作为一行)以指定句点.

Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.

然后,当我们给它带来不规则的时间序列时(请注意第二个周期比一个基本周期长),计时结果很差:

Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:

a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.

使用事件探查器,我们可以发现grpIdx行花费了大约1/4的执行时间(.28 s),而iSer循环花费了大约3/4(1.17 s)的执行时间( 1.48秒).

Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).

将此与无周期情况进行比较:

Compare this with the period-indifferent case:

tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.

是否有更有效的方法来汇总这些数据?

Is there a more efficient way to aggregate this data?

将每个响应放入单独的函数中,这是我在Windows 7和Intel i7上使用Matlab 2015b在timeit中获得的计时结果:

Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:

    original | 1.32451
      felix1 | 0.35446
      felix2 | 0.16432
    divakar1 | 0.41905
    divakar2 | 0.30509
    divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977

groupIndex

的澄清

一个2D groupIndex的示例是为一组涵盖1980-2015年的每日数据指定年号和周号:

Clarification on groupIndex

An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:

a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];

因此,年-周"期间由一行groupIndex唯一标识.可以通过调用unique(groupIndex, 'rows')并获取第三个输出来有效地解决此问题,因此可以随意忽略问题的这一部分.

Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.

推荐答案

聚会晚了一点,但是使用 accumarray 产生了巨大差异:

A little late to the party, but a single loop using accumarray makes a huge difference:

function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)

  [groups, ~, index] = unique(groupIndex, 'rows');
  numCols = size(array, 2);
  aggArray = nan(numel(groups), numCols);
  for col = 1:numCols
    aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
  end

end

为此示例使用MATLAB R2016b中的 timeit 进行计时问题中的数据如下:

Timing this using timeit in MATLAB R2016b for the sample data in the question gives the following:

original | 1.127141
 gnovice | 0.002205

加速超过500倍!

这篇关于时间序列聚合效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-13 05:02