制作uniq数组最快的方法是什么

制作uniq数组最快的方法是什么

本文介绍了制作uniq数组最快的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到以下情况。我有很多随机字符串。应该尽快使该数组唯一。

I've got the following situation. I have got a big array of random strings. This array should be made unique as fast as possible.

现在通过一些基准测试,我发现ruby的uniq相当慢:

Now through some Benchmarking I found out that ruby's uniq is quite slow:

require 'digest'
require 'benchmark'

#make a nice random array of strings
list = (1..100000).to_a.map(&:to_s).map {|e| Digest::SHA256.hexdigest(e)}
list += list
list.shuffle

def hash_uniq(a)
  a_hash = {}
  a.each do |v|
    a_hash[v] = nil
  end
  a_hash.keys
end

Benchmark.bm do |x|
  x.report(:uniq) { 100.times { list.uniq} }
  x.report(:hash_uniq) { 100.times { hash_uniq(list) } }
end

要点->

结果非常有趣。可能是ruby的uniq相当慢吗?

The results are quite interesting. Could it be that ruby's uniq is quite slow?

          user     system      total        real
uniq      23.750000   0.040000  23.790000 ( 23.823770)
hash_uniq 18.560000   0.020000  18.580000 ( 18.591803)

现在我的问题:


  1. 有没有更快的方法来使数组唯一?

  1. Are there any faster ways to make an array unique?

我在做什么

Array.uniq方法是否有问题?

Is there something wrong in the Array.uniq method?

我正在使用ruby 2.2.3p173(2015-08-18修订版51636)[x86_64-linux]

I am using ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

推荐答案

对大型数据集的字符串解析操作当然不是Ruby的亮点。如果这对业务至关重要,则您可能希望使用C或,或者让另一个应用程序在将其传递给Ruby应用程序之前对其进行处理。

String parsing operations on large data sets is certainly not where Ruby shines. If this is business critical, you might want to write an extension in something like C or Go, or let another application handle this before passing it to your Ruby application.

您的基准似乎有些奇怪。在我的MacBook Pro上使用Ruby 2.2.3 运行相同的命令,将得到以下结果:

That said. There seems to be something strange with your benchmark. Running the same on my MacBook Pro using Ruby 2.2.3 renders the following result:

          user        system    total     real
uniq      10.300000   0.110000  10.410000 ( 10.412513)
hash_uniq 11.660000   0.210000  11.870000 ( 11.901917)

建议 uniq 快一点。

如果可能,您应该始终尝试使用正确的集合类型。如果您的收藏确实是独一无二的,请使用 Set 。它们具有更好的内存配置文件和更快的 Hash 查找速度,同时保留了某些 Array 直觉。

If possible, you should always try to work with the right collection types. If your collection is truly unique, then use a Set. They feature better memory profile, and the faster lookup speeds of Hash, while retaining some of the Array intuition.

但是,如果您的数据已经在 Array 中,这可能不是一个很好的权衡,因为插入到 Set 也相当慢,如您在此处看到的:

If your data is already in an Array, however, this might not be a good tradeoff, as insertion into Set is rather slow as well, as you can see here:

              user        system    total     real
uniq          11.040000   0.060000  11.100000 ( 11.102644)
hash_uniq     12.070000   0.230000  12.300000 ( 12.319356)
set_insertion 12.090000   0.200000  12.290000 ( 12.294562)

在其中添加了以下基准的地方:

Where I added the following benchmark:

x.report(:set_insertion) { 100.times { Set.new(list) } }

这篇关于制作uniq数组最快的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!