我正在尝试评估哪种数据结构最能代表Scala中的稀疏向量。这些稀疏向量包含索引列表和每个索引一个值。我实现了一个小的基准,这似乎表明Array[(Long, Double)]似乎比2个并行数组占用的空间少得多。那是对的吗?我是否正确地执行了基准测试? (如果我在某处做错了,我不会感到惊讶)

import java.lang.management.ManagementFactory
import java.text.NumberFormat

object TestSize {

  val N = 100000000
  val formatter: NumberFormat = java.text.NumberFormat.getIntegerInstance

  def twoParallelArrays(): Unit = {

    val Z1 = Array.ofDim[Long](N)
    val Z2 = Array.ofDim[Double](N)
    Z1(N-1) = 1
    Z2(N-1) = 1.0D
    println(Z2(N-1) - Z1(N-1))
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def arrayOfTuples(): Unit = {

    val Z = Array.ofDim[(Long, Double)](N)
    Z(N-1) = (1, 1.0D)
    println(Z(N-1)._2 - Z(N-1)._1)
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def main(args: Array[String]): Unit = {

    // Comment one or the other to look at the results
    //arrayOfTuples()
    twoParallelArrays()
  }
}

最佳答案

不,不正确。

Array.ofDim[(Long, Double)](N)

创建一个充满null的大数组,并且不为LongDouble或实际的Tuple2 -instance分配任何空间,这就是为什么堆内存使用中看不到任何东西的原因。

两阵列版本立即为所有LongDouble分配所需的所有空间,您会在堆空间使用情况中看到它。

只需将ofDim替换为适当的fill即可查看实数。

在大小为N = 1000000的数组上:
arrayOfTuples:     45,693,312 19,190,296
twoParallelArrays: 25,925,792 19,315,256
arrayOfTuples -solution显然会占用更多空间。

您可能想知道为什么该系数大约是1.8,而不是至少2.5。这是因为对于一些原始数据类型,特别是对于Tuple2LongDouble@specialized,因此可以将这两个8字节的原始数据存储在Tuple2中,而无需装箱。因此,从数组到Tuple2的64位引用的总开销只有8个字节,每个Tuple2实例的开销也很大。但是,它不仅仅是将LongDouble直接存储在数组中。

顺便说一句:这正是Apache Spark使用所有Encoder存储数据的原因:这样您就不必担心元组和案例类所引起的开销。

完整代码:
import java.lang.management.ManagementFactory
import java.text.NumberFormat

object TestSize {

  val N = 1000000
  val formatter: NumberFormat = java.text.NumberFormat.getIntegerInstance

  def twoParallelArrays(): Unit = {

    val Z1 = Array.fill[Long](N)(42L)
    val Z2 = Array.fill[Double](N)(42.0)
    println(Z1)
    println(Z2)
    Z1(N-1) = 1
    Z2(N-1) = 1.0D
    println(Z2(N-1) - Z1(N-1))
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    Z1((new scala.util.Random).nextInt(N)) = 1234L
    Z2((new scala.util.Random).nextInt(N)) = 345.0d
    println(Z2(N-1) - Z1(N-1))
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def arrayOfTuples(): Unit = {

    val Z = Array.fill[(Long, Double)](N)((42L, 42.0d))
    Z(N-1) = (1, 1.0D)
    println(Z(N-1)._2 - Z(N-1)._1)
    val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
    val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
    Z((new scala.util.Random).nextInt(N)) = (1234L, 543.0d)
    println(Z(N-1)._2 - Z(N-1)._1)
    println(s"${formatter.format(z1)} ${formatter.format(z2)}")
  }

  def main(args: Array[String]): Unit = {

    // Comment one or the other to look at the results
    arrayOfTuples()
    // twoParallelArrays()
  }
}

10-08 13:25