问题描述
R版本:3.2.4
R version: 3.2.4
RStudio版本:0.99.893
RStudio version: 0.99.893
Windows 7
Windows 7
Intel i7
480 GB RAM
480 GB RAM
str(df)
161976 obs。的11个变量
我是R的相对新手,并且没有软件编程背景。我的任务是对数据集执行聚类。
I am a relative novice to R and do not have a software programming background. My task is to perform clustering on a data set.
变量已缩放并居中。我正在使用以下代码来找到最佳的群集数量:
The variables have been scaled and centered. I am using the following code to find the optimal number of clusters:
d<-dist(df,method = euclidean)
库(群集)
pamk.best<--pamk(d)
图(pam (d,pamk.best $ nc))
我注意到系统从未使用超过CPU处理能力的22%。
I have noticed that the system never uses more than 22% of the CPU's processing power.
到目前为止,我已执行以下操作:
I have taken the following actions so far:
- 尝试更改失败进程标签中的
rsession.exe
的设置优先级和设置相似性设置 Windows任务管理器。但是,由于某些原因,即使我将其设置为高或实时或该列表中的任何其他内容,它也始终会回到低位。 Set Affinity 设置表明系统允许R使用所有内核。 - 我已经调整了
High Performance 通过进入控制面板-> 电源选项-> 更改高级电源设置-> 处理器电源管理达到100%。
- 我已经阅读了并行处理
用于高性能计算的CRAN任务视图
。我可能是错的,但是我不认为计算数据集中观测值之间的距离应该并行化,也就是说,将数据集划分为子集并在不同子集上并行执行子集的距离计算核心。如果我错了,请纠正我。
- Unsuccessfully tried to change the Set Priority and Set Affinity setting for
rsession.exe
in the Processes tab of the Windows Task Manager. But, for some reason, it always comes back to low even when I set it to High or Realtime or anything else on that list. The Set Affinity setting shows that the system is allowing R to use all of the cores. - I have adjusted the
High Performance
settings on my machine by going into Control Panel -> Power Options -> Change advance power settings -> Processor Power Management to 100%. - I have read up the parallel processing
CRAN Task View for High Performance Computing
. I may be wrong but I don't think that calculating distance between observations in a data set is a task that should be parallelized, in the sense of, dividing up the data set in subsets and performing the distance calculations on subsets in parallel on different cores. Please correct me if I am wrong.
我有一个选择是对数据集的子集执行聚类,然后预测聚类其余数据集的成员身份。但是,我在想,如果我有处理能力和可用内存,为什么我不能对整个数据集执行聚类!
One option I have is to perform clustering on a subset of the data set and then predict cluster membership for the rest of the data set. But, I am thinking that if I have the processing power and the memory available, why can't I perform the clustering on the whole data set!
有没有办法让机器或 R
使用更高百分比的处理能力并更快地完成任务?
Is there a way to have the machine or R
use higher percentage of the processing power and complete the task quicker?
编辑::我认为我的问题与 R中的多线程中描述的问题不同,因为我不是试图在R中运行不同的功能。相反,我仅在一个数据集上运行一个功能,并且希望计算机使用更多可用的处理能力。
I think that my issue is different from the one described in Multithreading in R because I am not trying to run different functions in R. Rather, I am running only one function on one dataset and would like the machine to use more processing power that is available to it.
推荐答案
它可能仅使用一个内核。
没有自动并行化计算的方法。因此,您需要做的是重写R的一部分(在这里,可能是 dist
和 pam
函数,据称它们是C或Fortran代码)以使用多个内核。
There is no automatic way to parallelize computations. So what you need to do is rewrite parts of R (here, probably the dist
and pam
functions, which supposedly are C or Fortran code) to use more than one core.
或者您使用其他工具来完成某项工作。我是ELKI的忠实拥护者,但大多是单核的。我认为Julia值得一看,因为它与R更相似(与Matlab非常相似),并且设计为更好地使用多核。
当然,可能还有一个R模块对此进行并行化。我会看看Rcpp模块,它们通常非常快。
Or you use a different tool, where someone did the work already. I'm a big fan of ELKI but it's mostly single-core. I think Julia may be worth a look because it is more similar to R (it is very similar to Matlab) and it was designed to use multi-core better.Of course there may also be an R module that parallelizes this. I'd look at the Rcpp modules, which are udually very fast.
但是快速而可扩展的集群的关键是避免距离矩阵 。请参阅:4核系统的速度可能提高3.5倍(由于涡轮增压的原因,速度通常要低得多),而8核系统的性能则提高6.5倍。但是,如果将数据集大小增加10倍,则需要100倍的内存和计算量。除了巧妙的算法,这是一场您无法获胜的比赛
But the key to fast and scalable clustering is to avoid distance matrixes. See: a 4-core system yields maybe a 3.5x speedup (often much less, because of turboboost) and a 8 core yields up to 6.5x better performance. But if you increase the data set size 10x you need 100x as much memory and computation. This is a race that you cannot win, except with clever algorithms
这篇关于如何让R在PC上利用更多的处理能力?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!