使用pyspark连接数据帧的多列

本文介绍了使用pyspark连接数据帧的多列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个列列表，例如:

Suppose I have a list of columns, for example:

col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']

我需要通过连接 col1 和 col2 来创建一个新列.我不想在连接时对列名进行硬编码，但需要从列表中选择它.

I need to create a new column by concatenating col1 and col2. I don't want to hard code the column names while concatenating but need to pick it from the list.

我该怎么做?

推荐答案

您可以使用 pyspark.sql.functions.concat() 到 concatenate 与您在 list 中指定的一样多的列.继续将它们作为参数传递.

You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list. Keep on passing them as arguments.

from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  A1|  11|  A3|  A4|
|  B1|  22|  B3|  B4|
|  C1|  33|  C3|  C4|
+----+----+----+----+

在 concat() 函数中，您传递所有需要连接的列 - 如 concat('col1','col2').如果你有一个列表，你可以使用 * un-list 它.所以 (*['col1','col2']) 返回 ('col1','col2')

In the concat() function, you pass all the columns you need to concatenate - like concat('col1','col2'). If you have a list, you can un-list it using *. So (*['col1','col2']) returns ('col1','col2')

col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
|  A1|  11|  A3|  A4|             A111|
|  B1|  22|  B3|  B4|             B122|
|  C1|  33|  C3|  C4|             C133|
+----+----+----+----+-----------------+

这篇关于使用pyspark连接数据帧的多列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Col2