本文介绍了术语“向量化"是否在使用中?在不同背景下意味着不同的事物?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我之前所读的内容,矢量化是一种并行化形式,称为SIMD.它允许处理器在数组上同时执行相同的指令(例如加法).

但是,当我阅读关于Julia和R的矢量化性能的矢量化代码和反矢量化代码之间的关系.该帖子声称,经过向量化的Julia代码(通过循环)比在Julia和R中的向量化代码都快,原因是:

它声称R将用R编写的矢量化代码转换为C中的反矢量化代码.如果矢量化更快(作为并行化的一种形式),为什么R会对代码进行反矢量化,为什么加号呢?

解决方案

R中的向量化"是R解释器视图中的向量处理.以功能cumsum为例.进入时,R解释器看到将向量x传递到此函数中.但是,该工作随后传递给R解释器无法分析/跟踪的C语言. C在工作时,R在等待.在R的解释器恢复工作时,已经处理了向量.因此,在R看来,它发出了一条指令,但处理了一个向量.这类似于SIMD的概念-单指令,多数据".

在R中,不仅将cumsum接受向量并返回向量的函数视为向量化",将sum之类的函数接受向量并返回标量的方法也称为向量化".

简而言之:每当R为循环调用一些已编译的代码时,它就是向量化".如果您想知道为什么这种向量化"有用,那是因为用编译语言编写的循环比用解释语言编写的循环要快. C循环被翻译成CPU可以理解的机器语言.但是,如果CPU要执行R循环,则需要R的解释器的帮助来逐次读取.这就好比,如果您会说中文(最难的人类语言),则可以更快地回应说中文的人.否则,您需要翻译人员先用英语在句子中翻译中文,然后再用英语回答,然后翻译人员逐句将其翻译回中文.沟通的效率大大降低.

x <- runif(1e+7)

## R loop
system.time({
  sumx <- 0
  for (x0 in x) sumx <- sumx + x0
  sumx
  })
#   user  system elapsed
#  1.388   0.000   1.347

## C loop
system.time(sum(x))
#   user  system elapsed
#  0.032   0.000   0.030

请注意,R中的矢量化"只是SIMD的一种类比,而不是真实的类比.真正的SIMD使用CPU的向量寄存器进行计算,因此是通过数据并行性进行的真正的并行计算. R不是可以编程CPU寄存器的语言;为此,您必须编写编译后的代码或汇编代码.

R的向量化"并不关心如何真正执行以编译语言编写的循环.毕竟这是R口译员所不具备的.关于这些编译后的代码是否将与SIMD一起执行,请阅读 R在进行矢量化计算时是否会利用SIMD?


有关R中矢量化"的更多信息

我不是Julia用户,但是BogumiłKamiński展示了该语言的令人印象深刻的功能: loop融合.朱莉娅可以做到这一点,因为正如他指出的那样,朱莉娅中的矢量化是在朱莉娅中实现的",而不是在语言之外.

这揭示了R的向量化的缺点:速度通常是以内存使用为代价的.我并不是说Julia不会遇到这个问题(因为我不使用它,我也不知道),但是对于R来说绝对是正确的.

这里是一个示例:在R中两个瘦高矩阵之间计算行点积的最快方法. rowSums(A * B)是R中的向量化",因为"*"rowSums均以C语言编码为一个循环.但是,R无法将它们融合到一个C循环中,以避免将临时矩阵C = A * B生成到RAM中.

另一个示例是R的回收规则或依赖该规则的任何计算.例如,当您用A + a向矩阵A中添加标量a时,实际上发生的是a首先被复制为与A尺寸相同的矩阵B,即B <- matrix(a, nrow(A), ncol(A)),然后计算两个矩阵之间的加法:A + B.显然,临时矩阵B的生成是不希望的,但是对不起,除非您为A + a编写自己的C函数并在R中调用它,否则您无法做得更好.这被描述为只有在BogumiłKamiński的答案中,才能实现融合.

为了处理许多临时结果的记忆效应,R具有称为垃圾收集"的复杂机制.它有帮助,但是如果您在代码中的某个地方生成了一些非常大的临时结果,则内存仍然会爆炸.函数outer是一个很好的例子.使用此功能,我已经写了很多答案,但是它对内存不友好.

当我开始讨论向量化"的副作用时,我可能在本次编辑中脱节.小心使用.

  • 记住内存使用情况;可能会有一种内存效率更高的矢量化实现.例如,如在两个矩阵之间的行式点积的链接线程中所述,c(crossprod(x, y))比​​sum(x * y)更好.
  • 准备使用已编译代码的CRAN R软件包.如果在R中发现现有的矢量化函数只能完成您的任务,请在CRAN中寻找可能的R软件包来完成此任务.您可以在Stack Overflow上遇到编码瓶颈的问题,然后有人可能会在正确的包中指出您正确的功能.
  • 很高兴编写自己的编译代码.

Based on what I've read before, vectorization is a form of parallelization known as SIMD. It allows processors to execute the same instruction (such as addition) on an array simultaneously.

However, I got confused when reading The Relationship between Vectorized and Devectorized Code regarding Julia's and R's vectorization performance. The post claims that devectorized Julia code (via loops) is faster than the vectorized code in both Julia and R, because:

It claims that R turns vectorized code, written in R, into devectorized code in C. If vectorization is faster (as a form of parallelization), why would R devectorize the code and why is that a plus?

解决方案

"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function cumsum as an example. On entry, R interpreter sees that a vector x is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".

Not just the cumsum function that takes a vector and returns a vector is seen as "vectorization" in R, functions like sum that takes a vector and returns a scalar is also a "vectorization".

Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.

x <- runif(1e+7)

## R loop
system.time({
  sumx <- 0
  for (x0 in x) sumx <- sumx + x0
  sumx
  })
#   user  system elapsed
#  1.388   0.000   1.347

## C loop
system.time(sum(x))
#   user  system elapsed
#  0.032   0.000   0.030

Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.

R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?


More on "vectorization" in R

I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.

This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.

Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R. rowSums(A * B) is a "vectorization" in R, as both "*" and rowSums are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrix C = A * B into RAM.

Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar a to a matrix A by A + a, what really happens is that a is first replicated to be a matrix B that has the same dimension with A, i.e., B <- matrix(a, nrow(A), ncol(A)), then an addition between two matrices are calculated: A + B. Clearly the generation of the temporary matrix B is undesired, but sorry, you can't do it better unless you write your own C function for A + a and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.

To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function outer. I have written many answers using this function, but it is particularly memory-unfriendly.

I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.

  • Put memory usage in mind; there might be a more memory efficient vectorized implementation. For example, as mentioned in the linked thread on row-wise dot products between two matrices, c(crossprod(x, y)) is better than sum(x * y).
  • Be prepared to use CRAN R packages that have compiled code. If you find existing vectorized functions in R limited to do your task, explore CRAN for possible R packages that can do it. You can ask a question with your coding bottleneck on Stack Overflow, and somebody may point you to the right function in the right package.
  • Be happy to write your own compiled code.

这篇关于术语“向量化"是否在使用中?在不同背景下意味着不同的事物?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 10:05