本文介绍了PySpark-按列值拆分/过滤DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个与此示例类似的DataFrame:
I have a DataFrame similar to this example:
Timestamp | Word | Count
30/12/2015 | example_1 | 3
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
27/12/2015 | example_3 | 7
... | ... | ...
,我想按"word"列的值拆分此数据帧以获得DataFrame的列表"(以便在下一步中绘制一些图形).例如:
and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:
DF1
Timestamp | Word | Count
30/12/2015 | example_1 | 3
DF2
Timestamp | Word | Count
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
DF3
Timestamp | Word | Count
27/12/2015 | example_3 | 7
是否有办法使用PySpark(1.6)?
Is there a way to do this with PySpark (1.6)?
推荐答案
效率不高,但是您可以使用过滤器映射唯一值列表:
It won't be efficient but you can map with filter over the list of unique values:
words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]
Post Spark 2.0
Post Spark 2.0
words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()
这篇关于PySpark-按列值拆分/过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!