问题描述
跟进
在这方面(我知道它看起来一样,但是请参考下一行代码以查看不同之处):
在熊猫中,我使用了行代码 teste_2 =(value/value.groupby(level = 0).sum())
,在pyspark中,我尝试了几种解决方案;第一个是:
df_2 =(df/df.groupby(["age"]].sum())
但是,我收到以下错误: TypeError:/:'DataFrame'和'DataFrame'
第二个是:
df_2 =(df.filter(col('Siblings'))/gr.groupby(col('Age')).sum())
但是它仍然无法正常工作.谁能帮我吗?
希望我已经正确理解了这个问题.看来您想将每个年龄段的计数除以计数之和.
来自pyspark.sql的 导入功能为F,Windowdf2 = df.groupBy('age','siblings').count().withColumn('数数',F.col('count')/F.sum('count').over(Window.partitionBy('age')))df2.show()+ --- + -------- + ----- +|年龄|兄弟姐妹|计数|+ --- + -------- + ----- +|15 |0 |1.0 ||10 |3 |1.0 ||14 |1 |1.0 |+ --- + -------- + ----- +
Following up this question and dataframes, I am trying to convert this
Into this (I know it looks the same, but refer to the next code line to see the difference):
In pandas, I used the line code teste_2 = (value/value.groupby(level=0).sum())
and in pyspark I tried several solutions; the first one was:
df_2 = (df/df.groupby(["age"]).sum())
However, I am getting the following error: TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'
The second one was:
df_2 = (df.filter(col('Siblings'))/gr.groupby(col('Age')).sum())
But it's still not working. Can anyone help me?
Hope I've understood the question correctly. It seems you want to divide the count by the sum of count for each age group.
from pyspark.sql import functions as F, Window
df2 = df.groupBy('age', 'siblings').count().withColumn(
'count',
F.col('count') / F.sum('count').over(Window.partitionBy('age'))
)
df2.show()
+---+--------+-----+
|age|siblings|count|
+---+--------+-----+
| 15| 0| 1.0|
| 10| 3| 1.0|
| 14| 1| 1.0|
+---+--------+-----+
这篇关于在pyspark中划分数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!