问题描述
对于群集,Mahout输入必须为矢量形式.向量实现有两种类型.一个是稀疏向量,另一个是密集向量.
For clustering, Mahout input needs to be in vector form. There are two types of vector implementations. One is Sparse Vector and another is Dense Vector.
两者之间有什么区别?
稀疏和密集的使用场景?
Usage scenarios for Sparse and Dense ?
推荐答案
从概念上讲,稀疏向量中的大多数值都是零,而在密集向量中则不是.对于密集和稀疏矩阵也是如此.术语 sparse 和 dense 通常描述这些属性,不仅在Mahout中.
Concept-wise, most of the values in a sparse vector are zero, in a dense vector they are not. Same for dense and sparse matrices. The terms sparse and dense generally describe these properties, not only in Mahout.
在Mahout中,DenseVector
假定没有太多零条目,因此将矢量作为双精度数组实现"( org.apache.mahout.math.DenseVector ).相比之下,AbstractVector
的稀疏向量实现例如RandomAccessSparseVector
和SequentialAccessSparseVector
,使用完全不存储零值的不同数据结构.
In Mahout the DenseVector
assumes not too many zero entries and therefore "Implements vector as an array of doubles" (org.apache.mahout.math.DenseVector). In contrast, the sparse vector implementations of AbstractVector
, e.g. RandomAccessSparseVector
and SequentialAccessSparseVector
, use different data structures which don't store the zero values at all.
取哪个取决于要存储在向量中的数据.如果您期望大多数零值,那么稀疏向量实现将更节省空间,但是如果将其用于只有几个零值的数据,则会引入很多数据结构开销,这可能会导致性能降低.
Which one to take depends on the data you want to store in the vector. If you expect mostly zero values, a sparse vector implementation would be more space efficient, however if you use it for data with just a few zero values you introduce a lot of data structure overhead which could cause worse performance.
选择密集向量还是稀疏向量不会影响您对向量的计算结果,只会影响内存使用和计算速度.
The choice of dense vs. sparse vector does not affect your calculation results on the vectors, only memory usage and calculation speed.
这篇关于聚类-稀疏向量和密集向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!