我使用Pyspark的HashingTF和IDF为3个示例文本文档计算了TFIDF,得到了以下SparseVector结果:
(1048576,[558379],[1.43841036226])
(1048576,[181911,558379,959994], [0.287682072452,0.287682072452,0.287682072452])
(1048576,[181911,959994],[0.287682072452,0.287682072452])
如何计算文档中所有术语的TFIDF值之和。
例如。 (0.287682072452 + 0.287682072452)用于3d文档。
最佳答案
当IDF
的输出暴露于Python时,它只是一个PySpark SparseVector
,其值是标准的NumPy array
,因此您所需要做的就是sum
调用:
from pyspark.mllib.linalg import SparseVector
v = SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])
v.values.sum()
## 0.57536414490400001
或超过RDD:
rdd = sc.parallelize([
SparseVector(1048576,[558379],[1.43841036226]),
SparseVector(1048576, [181911,558379,959994],
[0.287682072452,0.287682072452,0.287682072452]),
SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])])
rdd.map(lambda v: v.values.sum())