本文介绍了Apache Spark 中的矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Apache Spark 和 Java 执行矩阵乘法.

I am trying to perform matrix multiplication using Apache Spark and Java.

我有两个主要问题:

  1. 如何在 Apache Spark 中创建可以表示矩阵的 RDD?
  2. 如何将两个这样的 RDD 相乘?

推荐答案

一切都取决于输入的数据和维度 但一般来说你想要的不是 RDD 而是一种分布式数据结构org.apache.spark.mllib.linalg.distributed.目前它提供了 的四种不同实现分布式矩阵

All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix

  • IndexedRowMatrix - 可以直接从 RDD[IndexedRow] 创建,其中 IndexedRow 由行索引和 org.apache.spark.mllib.linalg.Vector 组成代码>

  • IndexedRowMatrix - can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg.Vector

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
  IndexedRow}

val rows =  sc.parallelize(Seq(
  (0L, Array(1.0, 0.0, 0.0)),
  (0L, Array(0.0, 1.0, 0.0)),
  (0L, Array(0.0, 0.0, 1.0)))
).map{case (i, xs) => IndexedRow(i, Vectors.dense(xs))}

val indexedRowMatrix = new IndexedRowMatrix(rows)

  • RowMatrix - 类似于 IndexedRowMatrix 但没有有意义的行索引.可以直接从 RDD[org.apache.spark.mllib.linalg.Vector]

  • RowMatrix - similar to IndexedRowMatrix but without meaningful row indices. Can be created directly from RDD[org.apache.spark.mllib.linalg.Vector]

    import org.apache.spark.mllib.linalg.distributed.RowMatrix
    
    val rowMatrix = new RowMatrix(rows.map(_.vector))
    

  • BlockMatrix - 可以从 RDD[((Int, Int), Matrix)] 创建,其中元组的第一个元素包含块的坐标,第二个元素是局部坐标org.apache.spark.mllib.linalg.Matrix

  • BlockMatrix - can be created from RDD[((Int, Int), Matrix)] where first element of the tuple contains coordinates of the block and the second one is a local org.apache.spark.mllib.linalg.Matrix

    val eye = Matrices.sparse(
      3, 3, Array(0, 1, 2, 3), Array(0, 1, 2), Array(1, 1, 1))
    
    val blocks = sc.parallelize(Seq(
       ((0, 0), eye), ((1, 1), eye), ((2, 2), eye)))
    
    val blockMatrix = new BlockMatrix(blocks, 3, 3, 9, 9)
    

  • CoordinateMatrix - 可以从 RDD[MatrixEntry] 创建,其中 MatrixEntry 由行、列和值组成.

  • CoordinateMatrix - can be created from RDD[MatrixEntry] where MatrixEntry consist of row, column and value.

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,
      MatrixEntry}
    
    val entries = sc.parallelize(Seq(
       (0, 0, 3.0), (2, 0, -5.0), (3, 2, 1.0),
       (4, 1, 6.0), (6, 2, 2.0), (8, 1, 4.0))
    ).map{case (i, j, v) => MatrixEntry(i, j, v)}
    
    val coordinateMatrix = new CoordinateMatrix(entries, 9, 3)
    

  • 前两个实现支持乘以本地Matrix:

    First two implementations support multiplication by a local Matrix:

    val localMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
    
    indexedRowMatrix.multiply(localMatrix).rows.collect
    // Array(IndexedRow(0,[1.0,4.0]), IndexedRow(0,[2.0,5.0]),
    //   IndexedRow(0,[3.0,6.0]))
    

    并且第三个可以乘以另一个BlockMatrix,只要该矩阵中每个块的列数与另一个矩阵的每个块的行数相匹配.CoordinateMatrix 不支持乘法,但很容易创建和转换为其他类型的分布式矩阵:

    and the third one can be multiplied by an another BlockMatrix as long as number of columns per block in this matrix matches number of rows per block of the other matrix. CoordinateMatrix doesn't support multiplications but is pretty easy to create and transform to other types of distributed matrices:

    blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3, 3))
    

    每种类型都有自己的强弱方面,当您使用稀疏或密集元素(Vectors 或块 Matrices)时,还有一些额外的因素需要考虑.乘以局部矩阵通常是可取的,因为它不需要昂贵的改组.

    Each type has its own strong and weak sides and there are some additional factors to consider when you use sparse or dense elements (Vectors or block Matrices). Multiplying by a local matrix is usually preferable since it doesn't require expensive shuffling.

    您可以在MLlib 数据类型指南中找到有关每种类型的更多详细信息.

    You can find more details about each type in the MLlib Data Types guide.

    这篇关于Apache Spark 中的矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

    08-19 23:41