我想在 pyspark 中获取百分比频率。我在 python 中这样做如下
Companies = df['Company'].value_counts(normalize = True)
获取频率相当简单:
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
FROM Comp \
GROUP BY Company \
ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+
| Company| cnt|
+--------------------+----+
|BANK OF AMERICA, ...|1387|
| EQUIFAX, INC.|1285|
|WELLS FARGO & COM...|1119|
|Experian Informat...|1115|
|TRANSUNION INTERM...|1001|
|JPMORGAN CHASE & CO.| 905|
| CITIBANK, N.A.| 772|
|OCWEN LOAN SERVIC...| 481|
我如何从这里获得百分比频率?我尝试了很多东西,但运气不佳。
任何帮助,将不胜感激。
最佳答案
正如 Suresh 在评论中暗示的那样,假设 total_count
是数据帧 Companies
中的行数,您可以使用 withColumn
在 percentages
中添加一个名为 CompDF
的新列:
total_count = Companies.count()
df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))
关于pyspark - 如何在pyspark中获得百分比频率,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46574860/