我正在尝试评估哪种数据结构最能代表Scala中的稀疏向量。这些稀疏向量包含索引列表和每个索引一个值。我实现了一个小的基准,这似乎表明Array[(Long, Double)]
似乎比2个并行数组占用的空间少得多。那是对的吗?我是否正确地执行了基准测试? (如果我在某处做错了,我不会感到惊讶)
import java.lang.management.ManagementFactory
import java.text.NumberFormat
object TestSize {
val N = 100000000
val formatter: NumberFormat = java.text.NumberFormat.getIntegerInstance
def twoParallelArrays(): Unit = {
val Z1 = Array.ofDim[Long](N)
val Z2 = Array.ofDim[Double](N)
Z1(N-1) = 1
Z2(N-1) = 1.0D
println(Z2(N-1) - Z1(N-1))
val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
println(s"${formatter.format(z1)} ${formatter.format(z2)}")
}
def arrayOfTuples(): Unit = {
val Z = Array.ofDim[(Long, Double)](N)
Z(N-1) = (1, 1.0D)
println(Z(N-1)._2 - Z(N-1)._1)
val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
println(s"${formatter.format(z1)} ${formatter.format(z2)}")
}
def main(args: Array[String]): Unit = {
// Comment one or the other to look at the results
//arrayOfTuples()
twoParallelArrays()
}
}
最佳答案
不,不正确。
Array.ofDim[(Long, Double)](N)
创建一个充满
null
的大数组,并且不为Long
,Double
或实际的Tuple2
-instance分配任何空间,这就是为什么堆内存使用中看不到任何东西的原因。两阵列版本立即为所有
Long
和Double
分配所需的所有空间,您会在堆空间使用情况中看到它。只需将
ofDim
替换为适当的fill
即可查看实数。在大小为
N = 1000000
的数组上:arrayOfTuples: 45,693,312 19,190,296
twoParallelArrays: 25,925,792 19,315,256
arrayOfTuples
-solution显然会占用更多空间。您可能想知道为什么该系数大约是1.8,而不是至少2.5。这是因为对于一些原始数据类型,特别是对于
Tuple2
和Long
,Double
是@specialized,因此可以将这两个8字节的原始数据存储在Tuple2
中,而无需装箱。因此,从数组到Tuple2
的64位引用的总开销只有8个字节,每个Tuple2
实例的开销也很大。但是,它不仅仅是将Long
和Double
直接存储在数组中。顺便说一句:这正是Apache Spark使用所有
Encoder
存储数据的原因:这样您就不必担心元组和案例类所引起的开销。完整代码:
import java.lang.management.ManagementFactory
import java.text.NumberFormat
object TestSize {
val N = 1000000
val formatter: NumberFormat = java.text.NumberFormat.getIntegerInstance
def twoParallelArrays(): Unit = {
val Z1 = Array.fill[Long](N)(42L)
val Z2 = Array.fill[Double](N)(42.0)
println(Z1)
println(Z2)
Z1(N-1) = 1
Z2(N-1) = 1.0D
println(Z2(N-1) - Z1(N-1))
val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
Z1((new scala.util.Random).nextInt(N)) = 1234L
Z2((new scala.util.Random).nextInt(N)) = 345.0d
println(Z2(N-1) - Z1(N-1))
println(s"${formatter.format(z1)} ${formatter.format(z2)}")
}
def arrayOfTuples(): Unit = {
val Z = Array.fill[(Long, Double)](N)((42L, 42.0d))
Z(N-1) = (1, 1.0D)
println(Z(N-1)._2 - Z(N-1)._1)
val z1 = ManagementFactory.getMemoryMXBean.getHeapMemoryUsage.getUsed
val z2 = ManagementFactory.getMemoryMXBean.getNonHeapMemoryUsage.getUsed
Z((new scala.util.Random).nextInt(N)) = (1234L, 543.0d)
println(Z(N-1)._2 - Z(N-1)._1)
println(s"${formatter.format(z1)} ${formatter.format(z2)}")
}
def main(args: Array[String]): Unit = {
// Comment one or the other to look at the results
arrayOfTuples()
// twoParallelArrays()
}
}