哈希集的简单并行化 | 哈希集的简单并行化

本文介绍了哈希集的简单并行化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个循环(嵌套)，试图做一个简单的并行化

伪代码:

 针对data1中的item1(约1亿行)用于data2中的item2(约100行)result = process(item1，item2)//一对if条件hashset.add(result)//添加时，如果有重复，我还决定保留一个

如果条件基于item1和item2中的值，则

process(item1，item2)精确为4(耗时少于50ms)

data1 的大小为Nx17
data2 的大小为Nx17
result 的大小为1x17(结果添加到字符串中，然后再添加到哈希集中)

最大输出量:未知，但我想准备至少5亿个这意味着该哈希集将容纳5亿个项目.(我想如何在哈希集中处理这么多数据将是另一个问题)

我应该只使用 concurrent hashset 使其线程安全并与 parallel.each 一起使用，还是应该与 TASK 概念一起使用

请根据您的意见提供一些代码示例.

解决方案

答案很大程度上取决于 process(data1，data2)的成本.如果这是占用大量CPU的操作，那么您肯定可以从 Parallel.ForEach 中受益.当然，您应该使用并发字典，或者锁定您的哈希表.您应该进行基准测试，以了解最适合您的方法.如果 process 对性能的影响太小，那么您可能无法从并行化中得到任何好处-哈希表上的锁定将杀死所有这些.

您还应该尝试查看在外部循环上枚举data2是否也更快.它可能会给您带来另一个好处-您可以为data2的每个实例创建一个单独的哈希表，然后将结果合并到一个哈希表中.这样可以避免锁定.

同样，您需要进行测试，此处没有通用答案.

I have 2 loops(nested), trying to do a simple parallelisation

pseudocode:

for item1 in data1 (~100 million row)
    for item2 in data2 (~100 rows)
        result = process(item1,item2) // couple of if conditions
        hashset.add(result) // while adding, incase of a duplicate i also decide wihch one to retain

process(item1,item2) to be precise has 4 if conditions bases on values in item1 and item2.(time taken is less than 50ms)

data1 size is Nx17
data2 size is Nx17
result size is 1x17 (result is joined into a string before it is added into hashset)

max output size: unknown, but i would like to be ready for atleast 500 millionwhich means the hashset would be holding 500 million items. (how to handle so much data in a hashset would be an another question i guess)

Should i just use a concurrent hashset to make it thread safe and go with parallel.each or should i go with TASK concept

Please provide some code samples based on your opinion.

解决方案

The answer depends a lot on the costs of process(data1, data2). If this is a CPU-intensive operation, then you can surely benefit from Parallel.ForEach. Of course, you should use concurrent dictionary, or lock around your hash table. You should benchmark to see what works best for you. If process has too little impact on performance, then probably you will get nothing from the parallelization - the locking on the hashtable will kill it all.

You should also try to see whether enumerating data2 on the outer loop is also faster. It might give you another benefit - you can have a separate hashtable for each instance of data2 and then merge the results into one hashtable. This will avoid locks.

Again, you need to do your tests, there is no universal answer here.

这篇关于哈希集的简单并行化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！