本文介绍了pyspark-合并2列集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个spark数据帧,该数据帧具有由collect_set函数形成的2列.我想将这2列集合合并为1列集合.我应该怎么做?它们都是字符串
I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings
对于实例,我通过调用collect_set形成了2列
For Instance I have 2 columns formed from calling collect_set
Fruits | Meat
[Apple,Orange,Pear] [Beef, Chicken, Pork]
如何将其转换为:
Food
[Apple,Orange,Pear, Beef, Chicken, Pork]
非常感谢您的提前帮助
推荐答案
让我们说df
有
+--------------------+--------------------+
| Fruits| Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+
然后
import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()
创建一组Fruits
& Meat
组合成一组,即
creates a set of Fruits
& Meat
combined into one set i.e.
[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]
希望这会有所帮助!
Hope this helps!
这篇关于pyspark-合并2列集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!