汇总统计火花字符串类型

本文介绍了汇总统计火花字符串类型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有什么样的汇总函数火花一样，在R。

Is there something like summary function in spark like that in "R".

附带火花（MultivariateStatisticalSummary）摘要计算只能工作在数字类型。

The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.

我感兴趣的是获得字符串类型也像前四的最大弦发生的历史结果（GROUPBY样的操作），唯一的号码等。

I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.

有没有preexisting code这个？

Is there any preexisting code for this ?

如果不是请提出来处理字符串类型的最佳方式。

If not what please suggest the best way to deal with string types.

推荐答案

我不认为这是对字符串中MLlib这样的事情。但它很可能是一个宝贵的贡献，如果你要实现它。

I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.

仅仅计算这些指标之一是容易的。例如。通过频率最高4：

Calculating just one of these metrics is easy. E.g. for top 4 by frequency:

def top4(rdd: org.apache.spark.rdd.RDD[String]) =
  rdd
    .map(s => (s, 1))
    .reduceByKey(_ + _)
    .map { case (s, count) => (count, s) }
    .top(4)
    .map { case (count, s) => s }

或者唯一身份号码：

Or number of uniques:

def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
  rdd.distinct.count

但是，在单次这样做了所有的指标需要更多的工作。

But doing this for all metrics in a single pass takes more work.

这些例子假设一下，如果您有多个数据的列，你已经每列拆分成一个独立的RDD。这是组织中的数据的好方法，而且有必要为执行洗牌操作。

These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.

我通过拆分栏的意义：

def split(together: RDD[(Long, Seq[String])],
          columns: Int): Seq[RDD[(Long, String)]] = {
  together.cache // We will do N passes over this RDD.
  (0 until columns).map {
    i => together.mapValues(s => s(i))
  }
}

这篇关于汇总统计火花字符串类型的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！