PySpark根据名称将列表分解为多个列

本文介绍了PySpark根据名称将列表分解为多个列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理一个稍微困难的文件格式，我正在尝试对其进行清理以备将来使用.我一直在使用Pyspark将数据处理成数据框.

Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe.

文件看起来与此类似:

AA 1234  ZXYW
BB A 890
CC B 321
AA 1234  LMNO
BB D 123
CC E 321
AA 1234  ZXYW
CC E 456

每条"AA"记录定义一个或多个逻辑组的开始，并且每一行上的数据都是固定长度的，并且其中要编码的信息已编码.至少有20-30种不同的记录类型.每一行的开头始终用两个字母代码来标识它们.每个组中可以有1个或许多不同的记录类型(即，并非每个组都存在所有记录类型)

Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record types. They are always identified with a two letter code at the start of each line. There can be 1 or many different record types in each group (i.e. not all record types are present for each group)

作为第一阶段，我设法将记录按以下格式分组在一起:

As a first stage, I've managed to group the records together in this format:

+----------------+---------------------------------+
|           index|                           result|
+----------------+---------------------------------+
|               1|[AA 1234  ZXYV,BB A 890,CC B 321]|
|               2|[AA 1234  LMNO,BB D 123,CC E 321]|
|               3|[AA 1234  ZXYV,CC B 321]         |
+----------------+---------------------------------+

第二阶段，我真的想将数据放入数据框的以下列中:

And as a second stage I really want to get data into the following columns in a dataframe:

+----------------+---------------------------------+-------------+--------+--------+
|           index|                           result|           AA|      BB|      CC|
+----------------+---------------------------------+-------------+--------+--------+
|               1|[AA 1234  ZXYV,BB A 890,CC B 321]|AA 1234  ZXYV|BB A 890|CC B 321|
|               2|[AA 1234  LMNO,BB D 123,CC E 321]|AA 1234  LMNO|BB D 123|CC E 321|
|               3|[AA 1234  ZXYV,CC B 321]         |AA 1234  ZXYV|    Null|CC B 321|
+----------------+---------------------------------+-------------+--------+--------+

因为此时提取我需要的信息应该很简单.

Because at that point extracting the information that I need should be trivial.

有人对我如何做到这一点有任何建议吗?

Does anyone have any suggestions as to how I might be able to do this?

非常感谢.

推荐答案

您可以使用 flatMap 和 pivot 来实现此目的.从第一阶段的结果开始:

You can use flatMap and pivot to achieve this. Starting from the results from the first stage:

rdd = sc.parallelize([(1,['AA 1234  ZXYV','BB A 890','CC B 321']),
                      (2,['AA 1234  LMNO','BB D 123','CC E 321']),
                      (3,['AA 1234  ZXYV','CC B 321'])])

df = rdd.toDF(['index', 'result'])

您可以先使用 flatMap 将数组分解为多行，然后将两个字母的标识符提取到单独的列中.

You can first explode the array into multiple rows using flatMap and extract the two letter identifier into a separate column.

df_flattened = df.rdd.flatMap(lambda x: [(x[0],y, y[0:2],y[3::]) for y in x[1]])\
               .toDF(['index','result', 'identifier','identifiertype'])

并使用 pivot 将两个字母的标识符更改为列名:

and use pivot to change the two letter identifier into column names:

df_result = df_flattened.groupby(df_flattened.index,)\
                        .pivot("identifier")\
                        .agg(first("identifiertype"))\
                        .join(df,'index')

我添加了联接以使 result 列返回

I added the join to get the result column back

这篇关于PySpark根据名称将列表分解为多个列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！