问题描述
我有一个数据框
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
你看,day
的最大值是4,最小值是1.我要为amount
填0
列在所有用户的所有缺失天数中,因此上述数据框将变为.
You see that, the maximum value of day
is 4, and the minimum value is 1. I want to fill 0
for amount
column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
我怎么能在 PySpark 中做到这一点?非常感谢.
How could I do that in PySpark? Many thanks.
推荐答案
这是一种方法.您可以先获取最小值和最大值,然后在 user
列和枢轴上分组,然后填充缺失的列并将所有空值填充为 0,然后将它们堆叠回去:
Here is one approach. You can get the min and max values first , then group on user
column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
请注意,由于列日期被旋转,dtype 可能已更改,因此您可能必须cast
将它们恢复为原始 dtype
Note, since column day was pivotted , the dtype might have changed so you may have to cast
them back to the original dtype
这篇关于pyspark:自动填充隐式缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!