PySpark数据帧的分区

PySpark数据帧的分区

本文介绍了如何确定“首选位置"?PySpark数据帧的分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解 coalesce 如何确定如何将初始分区加入最终问题,并且显然首选位置"与之相关.

I'm trying to understand how coalesce determines how to join initial partitions into final questions, and apparently the "preferred location" has something to do with it.

根据此问题,Scala Spark有一个功能 preferredLocations(split:Partition)可以识别此功能.但是我一点都不熟悉Spark的Scala方面.有没有办法在PySpark级别确定给定行或分区ID的首选位置?

According to this question, Scala Spark has a function preferredLocations(split: Partition) that can identify this. But I'm not at all familiar with the Scala side of Spark. Is there a way to determine the preferred location of a given row or partition ID at the PySpark level?

推荐答案

是的,理论上是可能的.强制某些形式的偏好的示例数据(可能有一个更简单的示例):

Yes, it is theoretically possible. Example data to force some form of preference (there could be a simpler example):

rdd1 = sc.range(10).map(lambda x: (x % 4, None)).partitionBy(8)
rdd2 = sc.range(10).map(lambda x: (x % 4, None)).partitionBy(8)

# Force caching so downstream plan has preferences
rdd1.cache().count()

rdd3 = rdd1.union(rdd2)

现在您可以定义一个助手:

Now you can define a helper:

from pyspark import SparkContext

def prefered_locations(rdd):
    def to_py_generator(xs):
        """Convert Scala List to Python generator"""
        j_iter = xs.iterator()
        while j_iter.hasNext():
            yield j_iter.next()

    # Get JVM
    jvm =  SparkContext._active_spark_context._jvm
    # Get Scala RDD
    srdd = jvm.org.apache.spark.api.java.JavaRDD.toRDD(rdd._jrdd)
    # Get partitions
    partitions = srdd.partitions()
    return {
        p.index(): list(to_py_generator(srdd.preferredLocations(p)))
        for p in partitions
    }

已应用:

prefered_locations(rdd3)

# {0: ['...'],
#  1: ['...'],
#  2: ['...'],
#  3: ['...'],
#  4: [],
#  5: [],
#  6: [],
#  7: []}

这篇关于如何确定“首选位置"?PySpark数据帧的分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 22:08