制作uniq数组最快的方法是什么？ | 制作uniq数组最快的方法是什么

本文介绍了制作uniq数组最快的方法是什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到以下情况。我有很多随机字符串。应该尽快使该数组唯一。

I've got the following situation. I have got a big array of random strings. This array should be made unique as fast as possible.

现在通过一些基准测试，我发现ruby的uniq相当慢：

Now through some Benchmarking I found out that ruby's uniq is quite slow:

require 'digest'
require 'benchmark'

#make a nice random array of strings
list = (1..100000).to_a.map(&:to_s).map {|e| Digest::SHA256.hexdigest(e)}
list += list
list.shuffle

def hash_uniq(a)
  a_hash = {}
  a.each do |v|
    a_hash[v] = nil
  end
  a_hash.keys
end

Benchmark.bm do |x|
  x.report(:uniq) { 100.times { list.uniq} }
  x.report(:hash_uniq) { 100.times { hash_uniq(list) } }
end

要点->

结果非常有趣。可能是ruby的uniq相当慢吗？

The results are quite interesting. Could it be that ruby's uniq is quite slow?

          user     system      total        real
uniq      23.750000   0.040000  23.790000 ( 23.823770)
hash_uniq 18.560000   0.020000  18.580000 ( 18.591803)

现在我的问题：

有没有更快的方法来使数组唯一？

Are there any faster ways to make an array unique?

我在做什么

Array.uniq方法是否有问题？

Is there something wrong in the Array.uniq method?

我正在使用ruby 2.2.3p173（2015-08-18修订版51636）[x86_64-linux]

I am using ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

推荐答案

对大型数据集的字符串解析操作当然不是Ruby的亮点。如果这对业务至关重要，则您可能希望使用C或，或者让另一个应用程序在将其传递给Ruby应用程序之前对其进行处理。

String parsing operations on large data sets is certainly not where Ruby shines. If this is business critical, you might want to write an extension in something like C or Go, or let another application handle this before passing it to your Ruby application.

您的基准似乎有些奇怪。在我的MacBook Pro上使用Ruby 2.2.3 运行相同的命令，将得到以下结果：

That said. There seems to be something strange with your benchmark. Running the same on my MacBook Pro using Ruby 2.2.3 renders the following result:

          user        system    total     real
uniq      10.300000   0.110000  10.410000 ( 10.412513)
hash_uniq 11.660000   0.210000  11.870000 ( 11.901917)

建议 uniq 快一点。

如果可能，您应该始终尝试使用正确的集合类型。如果您的收藏确实是独一无二的，请使用 Set 。它们具有更好的内存配置文件和更快的 Hash 查找速度，同时保留了某些 Array 直觉。

If possible, you should always try to work with the right collection types. If your collection is truly unique, then use a Set. They feature better memory profile, and the faster lookup speeds of Hash, while retaining some of the Array intuition.

但是，如果您的数据已经在 Array 中，这可能不是一个很好的权衡，因为插入到 Set 也相当慢，如您在此处看到的：

If your data is already in an Array, however, this might not be a good tradeoff, as insertion into Set is rather slow as well, as you can see here:

              user        system    total     real
uniq          11.040000   0.060000  11.100000 ( 11.102644)
hash_uniq     12.070000   0.230000  12.300000 ( 12.319356)
set_insertion 12.090000   0.200000  12.290000 ( 12.294562)

在其中添加了以下基准的地方：

Where I added the following benchmark:

x.report(:set_insertion) { 100.times { Set.new(list) } }

这篇关于制作uniq数组最快的方法是什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！