问题描述
我想使用一些不是 pyspark 原生的字符串相似度函数,例如数据帧上的 jaro 和 jaro-winkler 度量.这些在 python 模块中很容易获得,例如 jellyfish
.对于不存在 null
值的情况,我可以编写 pyspark udf,即比较 cat 和 dog.当我将这些 udf 应用于存在 null
值的数据时,它不起作用.在诸如我正在解决的问题中,其中一个字符串为 null
I want to use some string similarity functions that are not native to pyspark such as the jaro and jaro-winkler measures on dataframes. These are readily available in python modules such as jellyfish
. I can write pyspark udf's fine for cases where there a no null
values present, i.e. comparing cat to dog. when I apply these udf's to data where null
values are present, it doesn't work. In problems such as the one I'm solving it is very common for one of the strings to be null
我需要帮助让我的字符串相似度 udf 正常工作,更具体地说,在其中一个值为 null
I need help getting my string similarity udf to work in general, to be more specific, to work in cases where one of the values are null
我写了一个 udf,当输入数据中没有空值时它可以工作:
I wrote a udf that works when there are no null values in the input data:
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish
def jaro_winkler_func(df, column_left, column_right):
jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())
df = (df
.withColumn('test',
jaro_winkler_udf(df[column_left], df[column_right])))
return df
示例输入和输出:
+-----------+------------+
|string_left|string_right|
+-----------+------------+
| dude| dud|
| spud| dud|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right| test|
+-----------+------------+------------------+
| dude| dud|0.9166666666666666|
| spud| dud|0.7222222222222222|
+-----------+------------+------------------+
当我在具有空值的数据上运行它时,我会得到大量的火花错误,最适用的似乎是TypeError: str argument expected
.我认为这是由于数据中的 null
值造成的,因为它在没有值时工作.
When I run this on data that has a null value then I get the usual reams of spark errors, the most applicable one seems to be TypeError: str argument expected
. I assume this one is due to null
values in the data since it worked when there were none.
我将上面的函数修改为检查两个值是否不为空,如果是,则仅运行该函数,否则返回 0.
I modified the function above to to check if both values are not null and only run the function if that's the case, otherwise return 0.
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F
import jellyfish.cjellyfish
def jaro_winkler_func(df, column_left, column_right):
jaro_winkler_udf = udf(f=lambda s1, s2: jellyfish.jaro_winkler(s1, s2), returnType=DoubleType())
df = (df
.withColumn('test',
F.when(df[column_left].isNotNull() & df[column_right].isNotNull(),
jaro_winkler_udf(df[column_left], df[column_right]))
.otherwise(0.0)))
return df
但是,我仍然遇到与以前相同的错误.
However, I still get the same errors as before.
示例输入以及我想要的输出:
Sample input and what I would like the output to be:
+-----------+------------+
|string_left|string_right|
+-----------+------------+
| dude| dud|
| spud| dud|
| spud| null|
| null| null|
+-----------+------------+
+-----------+------------+------------------+
|string_left|string_right| test|
+-----------+------------+------------------+
| dude| dud|0.9166666666666666|
| spud| dud|0.7222222222222222|
| spud| null|0.0 |
| null| null|0.0 |
+-----------+------------+------------------+
推荐答案
我们将对您的代码稍作修改,它应该可以正常工作:
We will modify a little bit your code and it should works fine :
@udf(DoubleType())
def jaro_winkler(s1, s2):
if not all((s1,s2)):
out = 0
else:
out = jellyfish.jaro_winkler(s1, s2)
return out
def jaro_winkler_func(df, column_left, column_right):
df = df.withColumn(
'test',
jaro_winkler(df[column_left], df[column_right]))
)
return df
这篇关于Pyspark:如何处理 Python 用户定义函数中的空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!