If I have two datasets (having equal number of rows and columns) and I wish to run a piece of code that I have made, then there are two options obviously, either to go with sequential execution or parallel programming.
Now, the algorithm (code) that I have made is a big one and consists of multiple for loops. I wish to ask, is there any way to directly use it on both of them or will I have to transform the code in some way? A heads up would be great.
To answer your question: you do not have to transform the code to run it on two datasets in parallel, it should work fine like it is.
The need for parallel processing usually arises in two ways (for most users, I would imagine):
- 您具有可以按顺序运行的代码,但是您希望并行执行.
- 您有一个需要花费很长时间才能在大型数据集上执行的函数,并且希望并行运行它以加快速度.
For the first case, you do not have to do anything, you can just execute it in parallel using one of the libraries designed for it, or just run two instances of R on the same computer and run the same code but with different datasets in each of them.It doesn't matter how many for loops you have in there and you don't even need to have the same number of rows in columns in the datasets.If it runs fine sequentially, it means there will be no dependence between the parallel chains and thus no problem.Since your question falls in the first case, you can run it in parallel.
If you have the second case, you can sometimes turn it into the first case by splitting your dataset into pieces (where you can run each of the pieces sequentially) and then you run it in parallel. This is easier said than done, and won't always be possible. It is also why not all functions just have a run.in.parallel=TRUE
option: it is not always obvious how you should split the data, nor is it always possible.
So you have already done most of the work by writing the functions, and splitting the data.Here is a general way of doing parallel processing with one function, on two datasets:
library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )
datalist <- list(mydataset1 , mydataset2)
# now start the chains
nchains <- 2 # for two processors
results_list <- foreach(i=1:nchains ,
.packages = c( 'packages_you_need') ) %dopar% {
result <- find.string( datalist[[i]] )
return(result) }
The result will be a list with two elements, each containing the results from a chain. You can then combine it as you wish, or use a .combine
function. See the foreach
help for details.
遇到上述第1种情况时,您可以随时使用此代码.在大多数情况下,如果您花一些时间考虑如何分割数据,然后合并结果,则也可以将其用于数字2之类的情况.可以将其视为并行包装器".它应该可以在Windows,GNU/Linux和Mac OS上运行,但是我还没有在所有设备上对其进行测试.
You can use this code any time you have a case like number 1 described above. Most of the time you can also use it for cases like number 2, if you spend some time thinking about how you want to divide the data, and then combine the results. Think of it as a "parallel wrapper".It should work in Windows, GNU/Linux, and Mac OS, but I haven't tested it on all of them.
I keep this script handy whenever I need a quick speed-up, but I still always start out by writing code I can run sequentially. Thinking in parallel hurts my brain.