本文介绍了使用pyspark连接数据帧的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个列列表,例如:

Suppose I have a list of columns, for example:

col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']

我需要通过连接 col1col2 来创建一个新列.我不想在连接时对列名进行硬编码,但需要从列表中选择它.

I need to create a new column by concatenating col1 and col2. I don't want to hard code the column names while concatenating but need to pick it from the list.

我该怎么做?

推荐答案

您可以使用 pyspark.sql.functions.concat()concatenate 与您在 list 中指定的一样多的列.继续将它们作为参数传递.

You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list. Keep on passing them as arguments.

from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  A1|  11|  A3|  A4|
|  B1|  22|  B3|  B4|
|  C1|  33|  C3|  C4|
+----+----+----+----+

concat() 函数中,您传递所有需要连接的列 - 如 concat('col1','col2').如果你有一个列表,你可以使用 * un-list 它.所以 (*['col1','col2']) 返回 ('col1','col2')

In the concat() function, you pass all the columns you need to concatenate - like concat('col1','col2'). If you have a list, you can un-list it using *. So (*['col1','col2']) returns ('col1','col2')

col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
|  A1|  11|  A3|  A4|             A111|
|  B1|  22|  B3|  B4|             B122|
|  C1|  33|  C3|  C4|             C133|
+----+----+----+----+-----------------+

这篇关于使用pyspark连接数据帧的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-18 13:49