问题描述
我想向数据框添加一个新列,其值由0或1组成.我从中使用了"randint"功能,
I want to add a new column to the dataframe with values consist of either 0 or 1.I used 'randint' function from,
from random import randint
df1 = df.withColumn('isVal',randint(0,1))
但是我收到以下错误消息,
But I get the following error,
如何使用自定义函数或randint函数为列生成随机值?
how to use a custom function or randint function for generate random value for the column?
推荐答案
您正在使用随机内置的python.这将返回一个恒定的特定值(返回值).
You are using python builtin random. This returns a specific value which is constant (the returned value).
如错误消息所示,我们希望有一个表示表达式的列.
As the error message shows, we expect a column which represents the expression.
为此,请执行以下操作:
To do this do:
from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))
这将提供0到1之间的均匀分布.有关更多选项,请参见功能文档( http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions )
This would give a uniform distribution between 0 and 1. See the functions documentation for more options (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)
这篇关于Spark数据框添加带有随机数据的新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!