

我有一个 Spark (2.4.0) 数据框,其中一列只有两个值(01).我需要计算此数据中连续 0s 和 1s 的连续性,如果值发生变化,将连续性重置为零.

I have a Spark (2.4.0) data frame with a column that has just two values (either 0 or 1). I need to calculate the streak of consecutive 0s and 1s in this data, resetting the streak to zero if the value changes.


from pyspark.sql import (SparkSession, Window)
from pyspark.sql.functions import (to_date, row_number, lead, col)

spark = SparkSession.builder.appName('test').getOrCreate()

# Create dataframe
df = spark.createDataFrame([
    ('2018-01-01', 'John', 0, 0),
    ('2018-01-01', 'Paul', 1, 0),
    ('2018-01-08', 'Paul', 3, 1),
    ('2018-01-08', 'Pete', 4, 0),
    ('2018-01-08', 'John', 3, 0),
    ('2018-01-15', 'Mary', 6, 0),
    ('2018-01-15', 'Pete', 6, 0),
    ('2018-01-15', 'John', 6, 1),
    ('2018-01-15', 'Paul', 6, 1),
], ['str_date', 'name', 'value', 'flag'])

df.orderBy('name', 'str_date').show()
## +----------+----+-----+----+
## |  str_date|name|value|flag|
## +----------+----+-----+----+
## |2018-01-01|John|    0|   0|
## |2018-01-08|John|    3|   0|
## |2018-01-15|John|    6|   1|
## |2018-01-15|Mary|    6|   0|
## |2018-01-01|Paul|    1|   0|
## |2018-01-08|Paul|    3|   1|
## |2018-01-15|Paul|    6|   1|
## |2018-01-08|Pete|    4|   0|
## |2018-01-15|Pete|    6|   0|
## +----------+----+-----+----+

根据这些数据,我想计算连续的 0 和 1,按日期排序并按名称窗口化":

With this data, I'd like to calculate the streak of consecutive zeros and ones, ordered by date and "windowed" by name:

# Expected result:
## +----------+----+-----+----+--------+--------+
## |  str_date|name|value|flag|streak_0|streak_1|
## +----------+----+-----+----+--------+--------+
## |2018-01-01|John|    0|   0|       1|       0|
## |2018-01-08|John|    3|   0|       2|       0|
## |2018-01-15|John|    6|   1|       0|       1|
## |2018-01-15|Mary|    6|   0|       1|       0|
## |2018-01-01|Paul|    1|   0|       1|       0|
## |2018-01-08|Paul|    3|   1|       0|       1|
## |2018-01-15|Paul|    6|   1|       0|       2|
## |2018-01-08|Pete|    4|   0|       1|       0|
## |2018-01-15|Pete|    6|   0|       2|       0|
## +----------+----+-----+----+--------+--------+


Of course, I would need the streak to reset itself to zero if the 'flag' changes.


Is there a way of doing this?



This would require a difference in row numbers approach to first group consecutive rows with the same value and then using a ranking approach among the groups.

from pyspark.sql import Window
from pyspark.sql import functions as f
#Windows definition
w1 = Window.partitionBy(df.name).orderBy(df.date)
w2 = Window.partitionBy(df.name,df.flag).orderBy(df.date)

res = df.withColumn('grp',f.row_number().over(w1)-f.row_number().over(w2))
#Window definition for streak
w3 = Window.partitionBy(res.name,res.flag,res.grp).orderBy(res.date)
streak_res = res.withColumn('streak_0',f.when(res.flag == 1,0).otherwise(f.row_number().over(w3))) \
                .withColumn('streak_1',f.when(res.flag == 0,0).otherwise(f.row_number().over(w3)))


08-13 17:55