本文介绍了在Julia中提取和收集生成器的随机子样本的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑使用Julia生成器,如果生成器将收集大量内存

Consider a generator in Julia that if collected will take a lot of memory

g=(x^2 for x=1:9999999999999999)

我想获取一个随机的小子样本(说1%),但是我不想收集()该对象,因为它会占用大量内存

I want to take a random small subsample (Say 1%) of it, but I do not want to collect() the object because will take a lot of memory

直到现在我一直使用的技巧是这个

Until now the trick I was using was this

temp=collect((( rand()>0.01 ? nothing : x ) for x in g))
random_sample= temp[temp.!=nothing]

但这对于具有很多元素的生成器来说效率不高,收集没有太多元素的东西似乎是不正确的

But this is not efficient for generators with a lot of elements, collecting something with so many nothing elements doesnt seem right

任何想法都受到高度赞赏.我猜想诀窍是能够从生成器中获取随机元素,而不必为其分配所有内存.

Any idea is highly appreciated. I guess the trick is to be able to get random elements from the generator without having to allocate memory for all of it.

非常感谢您

推荐答案

您可以使用具有if条件的生成器,如下所示:

You can use a generator with if condition like this:

[v for v in g if rand() < 0.01]

或如果您想更快一点,但是更冗长的方法(我已经硬编码了0.01且元素类型为g,并且我假设您的生成器支持length-否则您可以删除sizehint!行):

or if you want a bit faster, but more verbose approach (I have hardcoded 0.01 and element type of g and I assume that your generator supports length - otherwise you can remove sizehint! line):

function collect_sample(g)
    r = Int[]
    sizehint!(r, round(Int, length(g) * 0.01))
    for v in g
        if rand() < 0.01
           push!(r, v)
        end
    end
    r
end

编辑

这里有一些自我避免采样器和储层采样器的示例,它们为您提供了固定的输出大小.您希望获得的输入越小越好,那就是使用自我避免采样器:

Here you have examples of self avoiding sampler and reservoir sampler giving you fixed output size. The smaller fraction of the input you want to get the better it is to use self avoiding sampler:

function self_avoiding_sampler(source_size, ith, target_size)
    rng = 1:source_size
    idx = rand(rng)
    x1 = ith(idx)
    r = Vector{typeof(x1)}(undef, target_size)
    r[1] = x1
    s = Set{Int}(idx)
    sizehint!(s, target_size)
    for i = 2:target_size
        while idx in s
            idx = rand(rng)
        end
        @inbounds r[i] = ith(idx)
        push!(s, idx)
    end
    r
end

function reservoir_sampler(g, target_size)
    r = Vector{Int}(undef, target_size)
    for (i, v) in enumerate(g)
        if i <= target_size
            @inbounds r[i] = v
        else
            j = rand(1:i)
            if j < target_size
                @inbounds r[j] = v
            end
        end
    end
    r
end

这篇关于在Julia中提取和收集生成器的随机子样本的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-13 12:23