问题描述
考虑使用Julia生成器,如果生成器将收集大量内存
Consider a generator in Julia that if collected will take a lot of memory
g=(x^2 for x=1:9999999999999999)
我想获取一个随机的小子样本(说1%),但是我不想收集()该对象,因为它会占用大量内存
I want to take a random small subsample (Say 1%) of it, but I do not want to collect() the object because will take a lot of memory
直到现在我一直使用的技巧是这个
Until now the trick I was using was this
temp=collect((( rand()>0.01 ? nothing : x ) for x in g))
random_sample= temp[temp.!=nothing]
但这对于具有很多元素的生成器来说效率不高,收集没有太多元素的东西似乎是不正确的
But this is not efficient for generators with a lot of elements, collecting something with so many nothing elements doesnt seem right
任何想法都受到高度赞赏.我猜想诀窍是能够从生成器中获取随机元素,而不必为其分配所有内存.
Any idea is highly appreciated. I guess the trick is to be able to get random elements from the generator without having to allocate memory for all of it.
非常感谢您
推荐答案
您可以使用具有if
条件的生成器,如下所示:
You can use a generator with if
condition like this:
[v for v in g if rand() < 0.01]
或如果您想更快一点,但是更冗长的方法(我已经硬编码了0.01
且元素类型为g
,并且我假设您的生成器支持length
-否则您可以删除sizehint!
行):
or if you want a bit faster, but more verbose approach (I have hardcoded 0.01
and element type of g
and I assume that your generator supports length
- otherwise you can remove sizehint!
line):
function collect_sample(g)
r = Int[]
sizehint!(r, round(Int, length(g) * 0.01))
for v in g
if rand() < 0.01
push!(r, v)
end
end
r
end
编辑
这里有一些自我避免采样器和储层采样器的示例,它们为您提供了固定的输出大小.您希望获得的输入越小越好,那就是使用自我避免采样器:
Here you have examples of self avoiding sampler and reservoir sampler giving you fixed output size. The smaller fraction of the input you want to get the better it is to use self avoiding sampler:
function self_avoiding_sampler(source_size, ith, target_size)
rng = 1:source_size
idx = rand(rng)
x1 = ith(idx)
r = Vector{typeof(x1)}(undef, target_size)
r[1] = x1
s = Set{Int}(idx)
sizehint!(s, target_size)
for i = 2:target_size
while idx in s
idx = rand(rng)
end
@inbounds r[i] = ith(idx)
push!(s, idx)
end
r
end
function reservoir_sampler(g, target_size)
r = Vector{Int}(undef, target_size)
for (i, v) in enumerate(g)
if i <= target_size
@inbounds r[i] = v
else
j = rand(1:i)
if j < target_size
@inbounds r[j] = v
end
end
end
r
end
这篇关于在Julia中提取和收集生成器的随机子样本的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!