问题描述
我有一个名为 df 的非常大的
看起来没有 PARTITION BY
子句调用的窗口函数将所有数据移动到单个分区,所以上面可能不是最好的解决方案.
It looks like window functions called without PARTITION BY
clause move all data to the single partition so above may be not the best solution after all.
有什么更快更简单的方法来处理它?
不是真的.Spark DataFrames 不支持随机行访问.
Not really. Spark DataFrames don't support random row access.
PairedRDD
可以使用lookup
方法访问,如果使用HashPartitioner
对数据进行分区,该方法相对较快.还有 indexed-rdd 项目支持高效查找.
PairedRDD
can be accessed using lookup
method which is relatively fast if data is partitioned using HashPartitioner
. There is also indexed-rdd project which supports efficient lookups.
编辑:
独立于 PySpark 版本,您可以尝试这样的操作:
Independent of PySpark version you can try something like this:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
row = Row("char")
row_with_index = Row("char", "index")
df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()
df.show(5)
## +----+
## |char|
## +----+
## | a|
## | b|
## | c|
## | d|
## | e|
## +----+
## only showing top 5 rows
# This part is not tested but should work and save some work later
schema = StructType(
df.schema.fields[:] + [StructField("index", LongType(), False)])
indexed = (df.rdd # Extract rdd
.zipWithIndex() # Add index
.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows
.toDF(schema)) # It will work without schema but will be more expensive
# inSet in Spark < 1.3
indexed.where(col("index").isin(indexes))
这篇关于PySpark DataFrames - 枚举而不转换为 Pandas 的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!