问题描述
我有以下问题.
我正在开发一个随机模拟器,它随机对系统配置进行采样,并存储在特定时间实例中每个配置被访问的次数的统计信息.代码大致是这样的
I am developing a stochastic simulator which samples configurations of the system at random and stores the statistics of how many times each configuration has been visited at certain time instances. Roughly the code works like this
f[_Integer][{_Integer..}] :=0
...
someplace later in the code, e.g.,
index = get index;
c = get random configuration (i.e. a tuple of integers, say a pair {n1, n2});
f[index][c] = f[index][c] + 1;
which tags that configuration c has occurred once more in the simulation at time instance index.
一旦代码完成,就会有一个 f 的定义列表,看起来像这样(我手工输入只是为了强调最重要的部分)
Once the code has finished there is a list of definitions for f that looks something like this (I typed it by hand just to emphasize the most important parts)
?f
f[1][{1, 2}] = 112
f[1][{3, 4}] = 114
f[2][{1, 6}] = 216
f[2][{2, 7}] = 227
...
f[index][someconfiguration] = some value
...
f[_Integer][{_Integer..}] :=0
请注意,首先出现的无模式定义可能相当稀疏.也无法知道将选择哪些值和配置.
Please note that pattern free definitions that come first can be rather sparse. Also one cannot know which values and configurations will be picked.
问题是有效地提取所需索引的值,例如发出类似
The problem is to efficiently extract down values for a desired index, for example issue something like
result = ExtractConfigurationsAndOccurences[f, 2]
应该给出一个结构列表
result = {list1, list2}
哪里
list1 = {{1, 6}, {2, 7}} (* the list of configurations that occurred during the simulation*)
list2 = {216, 227} (* how many times each of them occurred *)
问题是 ExtractConfigurationsAndOccurences 应该非常快.我能想到的唯一解决方案是使用 SubValues[f](它提供完整列表)并使用 Cases
语句对其进行过滤.我意识到应该不惜一切代价避免这个过程,因为要测试的配置(定义)会成倍增加,这会大大减慢代码的速度.
The problem is that ExtractConfigurationsAndOccurences should be very fast. The only solution I could come up with was to use SubValues[f] (which gives the full list) and filter it with Cases
statement. I realize that this procedure should be avoided at any cost since there will be exponentially many configurations (definitions) to test, which slows down the code considerably.
在 Mathematica 中是否有一种自然的方法可以快速地做到这一点?
Is there a natural way in Mathematica to do this in a fast way?
我希望 Mathematica 将 f[2] 视为具有许多向下值的单个头,但使用 DownValues[f[2]] 什么也没有.同样使用 SubValues[f[2]] 会导致错误.
I was hoping that Mathematica would see f[2] as a single head with many down values but using DownValues[f[2]] gives nothing. Also using SubValues[f[2]] results in an error.
推荐答案
这是对我之前答案的完全重写.事实证明,在我之前的尝试中,我忽略了一种基于打包数组和稀疏数组组合的更简单的方法,它比所有以前的方法更快,内存更多 - 效率更高(至少在我使用的样本大小范围内)对其进行了测试),同时只对原始的基于 SubValues
的方法进行了最小程度的更改.由于这个问题是关于最有效的方法,我将从答案中删除其他的(鉴于它们相当复杂并占用大量空间.那些想要看到它们的人可以检查这个的过去修订版回答).
This is a complete rewrite of my previous answer. It turns out that in my previous attempts, I overlooked a much simpler method based on a combination of packed arrays and sparse arrays, that is much faster and more memory - efficient than all previous methods (at least in the range of sample sizes where I tested it), while only minimally changing the original SubValues
- based approach. Since the question was asked about the most efficient method, I will remove the other ones from the answer (given that they are quite a bit more complex and take a lot of space. Those who would like to see them can inspect past revisions of this answer).
我们首先引入一个函数来为我们生成配置的测试样本.这是:
We start by introducing a function to generate the test samples of configurations for us. Here it is:
Clear[generateConfigurations];
generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer,
nconfs_Integer] :=
Transpose[{
RandomInteger[{1, maxIndex}, nconfs],
Transpose[{
RandomInteger[{1, maxConfX}, nconfs],
RandomInteger[{1, maxConfY}, nconfs]
}]}];
我们可以生成一个小样本来说明:
We can generate a small sample to illustrate:
In[3]:= sample = generateConfigurations[2,2,2,10]
Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}},
{1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}}
我们这里只有 2 个索引和配置,其中x"都是和y"数字从 1 到 2 不等 - 10 个这样的配置.
We have here only 2 indices, and configurations where both "x" and "y" numbers vary from 1 to 2 only - 10 such configurations.
以下函数将帮助我们模拟配置频率的累积,因为我们为重复出现的计数器增加基于 SubValues
的计数器:
The following function will help us imitate the accumulation of frequencies for configurations, as we increment SubValues
-based counters for repeatedly occurring ones:
Clear[testAccumulate];
testAccumulate[ff_Symbol, data_] :=
Module[{},
ClearAll[ff];
ff[_][_] = 0;
Do[
doSomeStuff;
ff[#1][#2]++ & @@ elem;
doSomeMoreStaff;
, {elem, data}]];
doSomeStuff
和 doSomeMoreStaff
符号在这里代表一些可能排除或跟随计数代码的代码.data
参数应该是由 generateConfigurations
生成的表单列表.例如:
The doSomeStuff
and doSomeMoreStaff
symbols are here to represent some code that might preclude or follow the counting code. The data
parameter is supposed to be a list of the form produced by generateConfigurations
. For example:
In[6]:=
testAccumulate[ff,sample];
SubValues[ff]
Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3,
HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1,
HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1,
HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0}
以下函数将从 SubValues
列表中提取结果数据(指数、配置及其频率):
The following function will extract the resulting data (indices, configurations and their frequencies) from the list of SubValues
:
Clear[getResultingData];
getResultingData[f_Symbol] :=
Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &@
Most@SubValues[f, Sort -> False];
例如:
In[10]:= result = getResultingData[ff]
Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1},
{2,{2,2},1},{1,{2,2},1}}
为了完成数据处理循环,这里有一个简单的函数来为固定索引提取数据,基于Select
:
To finish with the data-processing cycle, here is a straightforward function to extract data for a fixed index, based on Select
:
Clear[getResultsForFixedIndex];
getResultsForFixedIndex[data_, index_] :=
If[# === {}, {}, Transpose[#]] &[
Select[data, First@# == index &][[All, {2, 3}]]];
对于我们的测试示例,
In[13]:= getResultsForFixedIndex[result,1]
Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}}
这大概与@zorank 在代码中尝试过的很接近.
This is presumably close to what @zorank tried, in code.
正如@zorank 所指出的,对于具有更多索引和配置的更大样本,这会变得很慢.我们现在将生成一个大样本来说明这一点(注意!这需要大约 4-5 Gb 的 RAM,因此如果超过可用 RAM,您可能需要减少配置数量)::>
As @zorank noted, this becomes slow for larger sample with more indices and configurations. We will now generate a large sample to illustrate that (note! This requires about 4-5 Gb of RAM, so you may want to reduce the number of configurations if this exceeds the available RAM):
In[14]:=
largeSample = generateConfigurations[20,500,500,5000000];
testAccumulate[ff,largeSample];//Timing
Out[15]= {31.89,Null}
我们现在将从 ff
的 SubValues
中提取完整数据:
We will now extract the full data from the SubValues
of ff
:
In[16]:= (largeres = getResultingData[ff]); // Timing
Out[16]= {10.844, Null}
这需要一些时间,但您只需要做一次.但是当我们开始为固定索引提取数据时,我们发现它很慢:
This takes some time, but one has to do this only once. But when we start extracting data for a fixed index, we see that it is quite slow:
In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing
Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
{256,38},{352,469}},{<<1>>}}}
我们在这里使用的主要思想是将单个列表打包在 largeres
中,用于索引、组合和频率.虽然无法打包完整列表,但这些部分可以:
The main idea we will use here to speed it up is to pack individual lists inside the largeres
, those for indices, combinations and frequencies. While the full list can not be packed, those parts individually can:
In[18]:= Timing[
subIndicesPacked = Developer`ToPackedArray[largeres[[All,1]]];
subCombsPacked = Developer`ToPackedArray[largeres[[All,2]]];
subFreqsPacked = Developer`ToPackedArray[largeres[[All,3]]];
]
Out[18]= {1.672,Null}
这也需要一些时间,但又是一次性操作.
This also takes some time, but it is a one-time operation again.
然后将使用以下函数更有效地提取固定索引的结果:
The following functions will then be used to extract the results for a fixed index much more efficiently:
Clear[extractPositionFromSparseArray];
extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]
Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_,
packedFreqs_, index_Integer] :=
With[{positions =
extractPositionFromSparseArray[
SparseArray[1 - Unitize[packedIndices - index]]]},
{Extract[packedCombs, positions],Extract[packedFreqs, positions]}];
现在,我们有:
In[25]:=
getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10]
//Short//Timing
Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
{256,38},{352,469}},{<<1>>}}}
我们获得了 30 倍的加速 w.r.t.天真的 Select
方法.
We get a 30 times speed-up w.r.t. the naive Select
approach.
请注意,第二种解决方案更快,因为它使用了优化的数据结构,但其复杂度与基于 Select
的解决方案相同,即与唯一性列表的总长度成线性关系所有索引的组合.因此,理论上,之前讨论的基于嵌套哈希表等的解决方案 可能渐近更好.问题是,在实践中,我们很可能会在此之前很久就遇到内存限制.对于 1000 万配置示例,上面的代码仍然比我之前发布的最快解决方案快 2-3 倍.
Note that the second solution is faster because it uses optimized data structures, but its complexity is the same as that of Select
- based one, which is, linear in the length of total list of unique combinations for all indices. Therefore, in theory, the previously - discussed solutions based on nested hash-table etc may be asymptotically better. The problem is, that in practice we will probably hit the memory limitations long before that. For the 10 million configurations sample, the above code was still 2-3 times faster than the fastest solution I posted before.
编辑
修改如下:
Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_,
packedFreqs_, index_Integer] :=
With[{positions =
extractPositionFromSparseArray[
SparseArray[Unitize[packedIndices - index], Automatic, 1]]},
{Extract[packedCombs, positions], Extract[packedFreqs, positions]}];
使代码仍然快两倍.此外,对于更稀疏的索引(例如,使用 generateConfigurations[2000, 500, 500, 5000000]
之类的参数调用样本生成函数),相对于 Select- 的函数大约是 100 次.
makes the code twice faster still. Moreover, for more sparse indices (say, calling the sample-generation function with parameters like generateConfigurations[2000, 500, 500, 5000000]
), the speed-up with respect to the Select
- based function is about 100 times.
这篇关于从稀疏定义列表中挑选无模式下值的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!