本文介绍了在Pyspark中有条件地向数据框添加列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在PySpark中有一个数据框.我想有条件地在数据框中添加一列.
I have a data frame in PySpark. I would like to add a column to the data frame conditionally.
说如果数据框没有该列,则添加具有null
值的列.如果存在该列,则不执行任何操作,并返回与新数据帧相同的数据帧
Say If the data frame doesn’t have the column then add a column with null
values.If the column is present then do nothing and return the same data frame as a new data frame
如何在PySpark中传递条件语句
How do I pass the conditional statement in PySpark
推荐答案
这并不难,但是您需要的不仅仅是列名.必需进口
It is not hard but you'll need a bit more than a column name to do it right. Required imports
from pyspark.sql import types as t
from pyspark.sql.functions import lit
from pyspark.sql import DataFrame
示例数据:
df = sc.parallelize([("a", 1, [1, 2, 3])]).toDF(["x", "y", "z"])
辅助函数(用于旧的Python版本带状类型注释):
def add_if_not_present(df: DataFrame, name: str, dtype: t.DataType) -> DataFrame:
return (df if name in df.columns
else df.withColumn(name, lit(None).cast(dtype)))
示例用法:
add_if_not_present(df, "foo", t.IntegerType())
DataFrame[x: string, y: bigint, z: array<bigint>, foo: int]
add_if_not_present(df, "x", t.IntegerType())
DataFrame[x: string, y: bigint, z: array<bigint>]
add_if_not_present(df, "foobar",
t.StructType([
t.StructField("foo", t.IntegerType()),
t.StructField("bar", t.IntegerType())]))
DataFrame[x: string, y: bigint, z: array<bigint>, foobar: struct<foo:int,bar:int>]
这篇关于在Pyspark中有条件地向数据框添加列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!