本文介绍了`combineByKey`,pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只是想知道这到底是做什么的?我理解keyBy,但是我很难理解combineByKey是什么.我已阅读以下页面 (链接),但仍然不了解.

Just wondering what exactly does this do? I understand keyBy, but I struggle to under what exacltly is that combineByKey. I have read through pages (link) and still don't understand.

df.rdd.keyBy(
        lambda row: row['id']
    ).combineByKey(
        lambda row: [row],
        lambda rows, row: rows + [row],
        lambda rows1, rows2: rows1 + rows2,
    )
)

推荐答案

简而言之,CombineByKey可让您明确指定汇总(或减少)rdd的3个阶段.

In short, combineByKey lets you specify explicitly the 3 stages of aggregating (or reducing) your rdd.

1.第一次遇到单行时该怎么办?

在您提供的示例中,该行被放置在列表中.

In the example you have provided the row is put in a list.

2.当单行遇到先前减少的行时该怎么办?

在该示例中,先前减少的行已经是一个列表,我们将新行添加到该行中并返回新的扩展列表.

In the example, a previously reduced row is already a list and we add to it the new row and return the new, extended, list.

3.对两个先前减少的行怎么办?

在上面的示例中,两行都已经是列表,并且我们返回一个包含两个项目的新列表.

In the example above, both rows are already lists and we return a new list with the items from both of them.

这些链接中提供了更多的,经过详尽解释的分步示例:

More, well explained, step by step examples are available in those links:

http://hadoopexam.com/adi/index.php/spark-blog/90-how-combinebykey-works-in-spark-step-by-step-逐步解释

http ://etlcode.com/index.php/blog/info/Bigdata/Apache-Spark-Difference-between-reduceByKey-groupByKey-and-combineByKey

第二个链接的主要解释是:

A key explanation from the second link is:

由于所有这些并行发生在不同的分区中,因此存在相同的密钥存在于另一分区的其他分区上的机会蓄能器.所以当来自不同分区的结果必须是使用mergeCombiners函数将其合并.

Since all this happens parallel in different partition, there ischance that same key exist on other partition with other set ofaccumulators. So when results from different partition has to bemerged it use mergeCombiners function.

这篇关于`combineByKey`,pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 10:08