问题描述
有什么样的汇总函数火花一样,在R。
Is there something like summary function in spark like that in "R".
附带火花(MultivariateStatisticalSummary)摘要计算只能工作在数字类型。
The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.
我感兴趣的是获得字符串类型也像前四的最大弦发生的历史结果(GROUPBY样的操作),唯一的号码等。
I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.
有没有preexisting code这个?
Is there any preexisting code for this ?
如果不是请提出来处理字符串类型的最佳方式。
If not what please suggest the best way to deal with string types.
推荐答案
我不认为这是对字符串中MLlib这样的事情。但它很可能是一个宝贵的贡献,如果你要实现它。
I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.
仅仅计算这些指标之一是容易的。例如。通过频率最高4:
Calculating just one of these metrics is easy. E.g. for top 4 by frequency:
def top4(rdd: org.apache.spark.rdd.RDD[String]) =
rdd
.map(s => (s, 1))
.reduceByKey(_ + _)
.map { case (s, count) => (count, s) }
.top(4)
.map { case (count, s) => s }
或者唯一身份号码:
Or number of uniques:
def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
rdd.distinct.count
但是,在单次这样做了所有的指标需要更多的工作。
But doing this for all metrics in a single pass takes more work.
这些例子假设一下,如果您有多个数据的列,你已经每列拆分成一个独立的RDD。这是组织中的数据的好方法,而且有必要为执行洗牌操作。
These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.
我通过拆分栏的意义:
def split(together: RDD[(Long, Seq[String])],
columns: Int): Seq[RDD[(Long, String)]] = {
together.cache // We will do N passes over this RDD.
(0 until columns).map {
i => together.mapValues(s => s(i))
}
}
这篇关于汇总统计火花字符串类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!