问题描述
假设我有一个列列表,例如:
Suppose I have a list of columns, for example:
col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']
我需要通过连接 col1
和 col2
来创建一个新列.我不想在连接时对列名进行硬编码,但需要从列表中选择它.
I need to create a new column by concatenating col1
and col2
. I don't want to hard code the column names while concatenating but need to pick it from the list.
我该怎么做?
推荐答案
您可以使用 pyspark.sql.functions.concat()
到 concatenate
与您在 list
中指定的一样多的列.继续将它们作为参数传递.
You can use pyspark.sql.functions.concat()
to concatenate
as many columns as you specify in your list
. Keep on passing them as arguments.
from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
在 concat()
函数中,您传递所有需要连接的列 - 如 concat('col1','col2')
.如果你有一个列表,你可以使用 *
un-list
它.所以 (*['col1','col2'])
返回 ('col1','col2')
In the concat()
function, you pass all the columns you need to concatenate - like concat('col1','col2')
. If you have a list, you can un-list
it using *
. So (*['col1','col2'])
returns ('col1','col2')
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
| A1| 11| A3| A4| A111|
| B1| 22| B3| B4| B122|
| C1| 33| C3| C4| C133|
+----+----+----+----+-----------------+
这篇关于使用pyspark连接数据帧的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!