问题描述
调用.show()
时如何在PySpark中设置显示精度?
How do you set the display precision in PySpark when calling .show()
?
请考虑以下示例:
from math import sqrt
import pyspark.sql.functions as f
data = zip(
map(lambda x: sqrt(x), range(100, 105)),
map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()
哪个输出:
#+------------------+------------------+
#| col1| col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+
如何更改它,使其仅显示小数点后3位?
How can I change it so that it only displays 3 digits after the decimal point?
所需的输出:
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
这是此scala问题的PySpark版本.之所以将其发布在这里,是因为在搜索PySpark解决方案时找不到答案,并且我认为它将来可能会对其他人有所帮助.
This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.
推荐答案
圆形
最简单的选择是使用 pyspark.sql.functions.round()
:
from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
这会将值保留为数字类型.
This will maintain the values as numeric types.
functions
与 scala 和 python 相同.唯一的区别是import
.
The functions
are the same for scala and python. The only difference is the import
.
您可以使用 format_number
将数字格式化为所需的小数位数,如官方api文档中所述:
You can use format_number
to format a number to desired decimal places as stated in the official api document:
from pyspark.sql.functions import avg, format_number
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#| col1| col2|
#+------+------+
#|10.099|14.213|
#+------+------+
转换后的列将为StringType
,并且逗号用作千位分隔符:
The transformed columns would of StringType
and a comma is used as a thousands separator:
#+-----------+--------------+
#| col1| col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+
如此 answer 的Scala版本所述,我们可以使用 regexp_replace
将,
替换为所需的任何字符串
As stated in the scala version of this answer we can use regexp_replace
to replace the ,
with any string you want
from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
[regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#| col1| col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+
这篇关于如何在PySpark Dataframe show中设置显示精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!