问题描述
我在Spark数据框中有一个列(myCol),其值为1,2,并且我想用该值的描述创建一个新列,例如1->'A',2->'B'等
I have a column (myCol) in a Spark dataframe that has values 1,2 and I want to create a new column with the description of this values like 1-> 'A', 2->'B' etc
我知道这可以通过联接来完成,但是我尝试了一下,因为它看起来更优雅:
I know that this can be done with a join but I tried this because it seems more elegant:
dictionary= { 1:'A' , 2:'B' }
add_descriptions = udf(lambda x , dictionary: dictionary[x] if x in dictionary.keys() else None)
df.withColumn("description",add_descriptions(df.myCol,dictionary))
它失败并显示错误
lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.function s.col. Trace:
py4j.Py4JException: Method col([class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
使用字典作为参数的用户定义函数是否不可能?
Is it not possible to have a user difined function with dictionaries as arguments?
推荐答案
有可能,您只需要做一些不同的事情即可.
It is possible, you just have to do it a bit differently.
dictionary= { 1:'A' , 2:'B' }
def add_descriptions(in_dict):
def f(x):
return in_dict.get(x)
return udf(f)
df.withColumn(
"description",
add_descriptions(dictionary)(df.myCol)
)
如果要直接在UDF中添加字典,因为UDF仅接受列作为参数,则需要有一个map列来替换字典.
If you want to add directly your dict in the UDF, as UDFs only accept columns as argument, you need to have a map column to replace your dict.
这篇关于带有字典参数的Spark UDF失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!