问题描述
星火是否支持分布式地图集合类型?
Does Spark support distributed Map collection types ?
所以,如果我有一个HashMap的[字符串,字符串]这是关键,值对,这可以被转换成一个分布式的地图集合类型?要访问我可以使用元素过滤,但我怀疑这个执行以及地图?
So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?
推荐答案
因为我发现了一些新的信息,我想我应该把我的意见变成一个答案。 @maasg已经覆盖查找
的功能,我想指出,你要小心,因为如果RDD的分区是无,查找只是使用过滤器反正标准。在参考了(K,V)在火花顶部店看起来这是在进步,但可用的拉请求已经提出的。下面是一个例子使用。
Since I found some new info I thought I'd turn my comments into an answer. @maasg already covered the standard lookup
function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))
// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None
好像拉入请求获得一致好评,并可能会被列入火花的未来版本,所以它可能是安全的使用在自己的code,它拉的请求。这里是如果你好奇的 JIRA票
这篇关于在斯卡拉星火分布地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!