本文介绍了如何在pyspark中进行广播连接之前获取数据帧的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我是 spark 新手,我想做一个广播连接,在此之前我试图获得我想要广播的数据帧的大小..
I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast..
无论如何都可以找到数据框的大小.
Is there anyway to find the size of a data frame .
我使用 Python 作为我的 spark 编程语言
I am using Python as my programming language for spark
非常感谢任何帮助
推荐答案
如果您正在寻找以字节为单位的大小以及以行数为单位的大小,请遵循此-
If you are looking for size in bytes as well as size in row count follow this-
// ### Alternative -1
/**
* file content
* spark-test-data.json
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val fileName = "spark-test-data.json"
val path = getClass.getResource("/" + fileName).getPath
spark.catalog.createTable("df", path, "json")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
// Collect only statistics that do not require scanning the whole table (that is, size in bytes).
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+---------+-------+
* |col_name |data_type|comment|
* +----------+---------+-------+
* |Statistics|68 bytes | |
* +----------+---------+-------+
*/
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+----------------+-------+
* |col_name |data_type |comment|
* +----------+----------------+-------+
* |Statistics|68 bytes, 3 rows| |
* +----------+----------------+-------+
*/
替代方案 2
// ### Alternative 2
val df = spark.range(10)
df.createOrReplaceTempView("myView")
spark.sql("explain cost select * from myView").show(false)
/**
* +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
* |plan |
* +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
* |== Optimized Logical Plan ==
* Range (0, 10, step=1, splits=Some(2)), Statistics(sizeInBytes=80.0 B, hints=none)
*
* == Physical Plan ==
* *(1) Range (0, 10, step=1, splits=2)|
* +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
*/
替代方案 3
// ### altervative 3
println(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes)
// 80
这篇关于如何在pyspark中进行广播连接之前获取数据帧的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!