本文介绍了pyspark在大 pandas 中的flatMap的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
熊猫中是否存在与 flatMap ?
flatMap示例:
flatMap example:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]
到目前为止,我想到的是apply
,然后是itertools.chain
,但我想知道是否有一个单步解决方案.
So far I can think of apply
followed by itertools.chain
, but I am wondering if there is a one-step solution.
推荐答案
有一个hack.我经常做类似的事情
There's a hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
NaN
的引入是因为中间对象创建了MultiIndex
,但是对于很多事情,您可以将其删除:
The introduction of NaN
is because the intermediate object creates a MultiIndex
, but for a lot of things you can just drop that:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
此技巧使用了所有的熊猫代码,因此我希望它是相当有效的,尽管它可能不喜欢大小不同的列表之类的东西.
This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.
这篇关于pyspark在大 pandas 中的flatMap的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!