在pyspark中划分数据帧 | 在pyspark中划分数据帧

本文介绍了在pyspark中划分数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

跟进

在这方面(我知道它看起来一样，但是请参考下一行代码以查看不同之处):

在熊猫中，我使用了行代码 teste_2 =(value/value.groupby(level = 0).sum())，在pyspark中，我尝试了几种解决方案；第一个是:

  df_2 =(df/df.groupby(["age"]].sum())

但是，我收到以下错误: TypeError:/:'DataFrame'和'DataFrame'

不受支持的操作数类型

第二个是:

  df_2 =(df.filter(col('Siblings'))/gr.groupby(col('Age')).sum())

但是它仍然无法正常工作.谁能帮我吗?

解决方案

希望我已经正确理解了这个问题.看来您想将每个年龄段的计数除以计数之和.

来自pyspark.sql的

 导入功能为F，Windowdf2 = df.groupBy('age'，'siblings').count().withColumn('数数'，F.col('count')/F.sum('count').over(Window.partitionBy('age')))df2.show()+ --- + -------- + ----- +|年龄|兄弟姐妹|计数|+ --- + -------- + ----- +|15 |0 |1.0 ||10 |3 |1.0 ||14 |1 |1.0 |+ --- + -------- + ----- +

Following up this question and dataframes, I am trying to convert this

Into this (I know it looks the same, but refer to the next code line to see the difference):

In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum()) and in pyspark I tried several solutions; the first one was:

 df_2 = (df/df.groupby(["age"]).sum())

However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'

The second one was:

df_2 = (df.filter(col('Siblings'))/gr.groupby(col('Age')).sum())

But it's still not working. Can anyone help me?

解决方案

Hope I've understood the question correctly. It seems you want to divide the count by the sum of count for each age group.

from pyspark.sql import functions as F, Window

df2 = df.groupBy('age', 'siblings').count().withColumn(
    'count',
    F.col('count') / F.sum('count').over(Window.partitionBy('age'))
)

df2.show()
+---+--------+-----+
|age|siblings|count|
+---+--------+-----+
| 15|       0|  1.0|
| 10|       3|  1.0|
| 14|       1|  1.0|
+---+--------+-----+

这篇关于在pyspark中划分数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！