我想在 pyspark 中获取百分比频率。我在 python 中这样做如下

Companies = df['Company'].value_counts(normalize = True)

获取频率相当简单:
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
                    FROM Comp \
                    GROUP BY Company \
                    ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+
|             Company| cnt|
+--------------------+----+
|BANK OF AMERICA, ...|1387|
|       EQUIFAX, INC.|1285|
|WELLS FARGO & COM...|1119|
|Experian Informat...|1115|
|TRANSUNION INTERM...|1001|
|JPMORGAN CHASE & CO.| 905|
|      CITIBANK, N.A.| 772|
|OCWEN LOAN SERVIC...| 481|

我如何从这里获得百分比频率?我尝试了很多东西,但运气不佳。
任何帮助,将不胜感激。

最佳答案

正如 Suresh 在评论中暗示的那样,假设 total_count 是数据帧 Companies 中的行数,您可以使用 withColumnpercentages 中添加一个名为 CompDF 的新列:

total_count = Companies.count()

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))

关于pyspark - 如何在pyspark中获得百分比频率,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46574860/

10-12 23:45