问题描述
我有一个RDD,其中每个元素都具有以下格式
I have an RDD in which each element is having the following format
['979500797', ' 979500797,260973244733,2014-05-0402:05:12,645/01/105/9931,78,645/01/105/9931,a1,forward;979500797,260972593713,2014-05-0407:05:04,645/01/105/9931,22,645/01/105/863,a4,forward']
我想将其转换为另一个RDD,以使密钥相同,即979500797,但该值是在';'上分割的结果.换句话说,最终的输出应该是
I want to transform it to another RDD such that key is the same i.e. 979500797 but the value is result of splitting on ';' . In other words the final output should be
[
['979500797', ' 979500797,260973244733,2014-05-0402:05:12,645/01/105/9931,78,645/01/105/9931,a1,forward']
['979500797','979500797,260972593713,2014-05-0407:05:04,645/01/105/9931,22,645/01/105/863,a4,forward']
]
我一直在尝试使用这样的地图
I have been trying to use map like this
df_feat3 = df_feat2.map(lambda (x, y):(x, y.split(';')))
但它似乎不起作用
推荐答案
这里需要的是 flatMap
. flatMap
具有返回序列并连接结果的功能.
What you need here is a flatMap
. flatMap
takes function that returns sequence and concatenates the results.
df_feat3 = df_feat2.flatMap(lambda (x, y): ((x, v) for v in y.split(';')))
另一方面,我将避免使用元组参数.这是一个很酷的功能,但是在Python 3中不再可用.请参见 PEP 3113
On a side note I would avoid using tuple parameters. It is a cool feature but it is no longer available in Python 3. See PEP 3113
这篇关于PySpark拆分行并转换为RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!