问题描述
我有2个循环(嵌套),试图做一个简单的并行化
伪代码:
针对data1中的item1(约1亿行)用于data2中的item2(约100行)result = process(item1,item2)//一对if条件hashset.add(result)//添加时,如果有重复,我还决定保留一个
如果条件基于item1和item2中的值,则
process(item1,item2)
精确为4(耗时少于50ms)
data1
的大小为Nx17 data2
的大小为Nx17 result
的大小为1x17(结果添加到字符串中,然后再添加到哈希集中)
我应该只使用 concurrent hashset
使其线程安全并与 parallel.each
一起使用,还是应该与 TASK
概念一起使用
请根据您的意见提供一些代码示例.
答案很大程度上取决于 process(data1,data2)
的成本.如果这是占用大量CPU的操作,那么您肯定可以从 Parallel.ForEach
中受益.当然,您应该使用并发字典,或者锁定您的哈希表.您应该进行基准测试,以了解最适合您的方法.如果 process
对性能的影响太小,那么您可能无法从并行化中得到任何好处-哈希表上的锁定将杀死所有这些.
您还应该尝试查看在外部循环上枚举data2是否也更快.它可能会给您带来另一个好处-您可以为data2的每个实例创建一个单独的哈希表,然后将结果合并到一个哈希表中.这样可以避免锁定.
同样,您需要进行测试,此处没有通用答案.
I have 2 loops(nested), trying to do a simple parallelisation
pseudocode:
for item1 in data1 (~100 million row)
for item2 in data2 (~100 rows)
result = process(item1,item2) // couple of if conditions
hashset.add(result) // while adding, incase of a duplicate i also decide wihch one to retain
process(item1,item2)
to be precise has 4 if conditions bases on values in item1 and item2.(time taken is less than 50ms)
data1
size is Nx17data2
size is Nx17result
size is 1x17 (result is joined into a string before it is added into hashset)
max output size: unknown, but i would like to be ready for atleast 500 millionwhich means the hashset would be holding 500 million items. (how to handle so much data in a hashset would be an another question i guess)
Should i just use a concurrent hashset
to make it thread safe and go with parallel.each
or should i go with TASK
concept
Please provide some code samples based on your opinion.
The answer depends a lot on the costs of process(data1, data2)
. If this is a CPU-intensive operation, then you can surely benefit from Parallel.ForEach
. Of course, you should use concurrent dictionary, or lock around your hash table. You should benchmark to see what works best for you. If process
has too little impact on performance, then probably you will get nothing from the parallelization - the locking on the hashtable will kill it all.
You should also try to see whether enumerating data2 on the outer loop is also faster. It might give you another benefit - you can have a separate hashtable for each instance of data2 and then merge the results into one hashtable. This will avoid locks.
Again, you need to do your tests, there is no universal answer here.
这篇关于哈希集的简单并行化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!