问题描述
我有2个数据框:
df_1的列id
仅包含标准化的字符和数字==>,而id_no_normalized
示例:
df_1, column id
contain only characters and numbers ==> normalized, and id_no_normalized
Example:
id_normalized | id_no_normalized
-------------|-------------------
ABC | A_B.C
-------------|-------------------
ERFD | E.R_FD
-------------|-------------------
12ZED | 12_Z.ED
df_2,第name
列仅包含字符和数字==>已附加规范化的字符
df_2, column name
contain only characters and numbers ==> normalized are attached
示例:
name
----------------------------
googleisa12ZEDgoodnavigator
----------------------------
internetABCexplorer
----------------------------
如果name (dataset_2)
中存在id_normalized (dataset_1)
,我想看一下.如果找到它,则取id_no_normalized
的值,并将其存储在dataset_2
I would like to look the id_normalized (dataset_1)
if exist in name (dataset_2)
. If I find it, I take the value of id_no_normalized
and I store it in a new column in dataset_2
预期结果:
name | result
----------------------------|----------
googleisa12ZEDgoodnavigator | 12_Z.ED
----------------------------|----------
internetABCexplorer | A_B.C
----------------------------|----------
我是使用以下代码完成的:
I did it using this code:
df_result = df_2.withColumn("id_no_normalized", dft_2.name.contains(df_1.id_normalized))
return df_result.select("name", "id_normalized")
不起作用,因为它在df_2中找不到id_normalized
.
is not working because, it doesn't find the id_normalized
in the df_2.
Second solution, it work only when I limited the output on 300 rows almost, but when I return all the data, is took many time running and not finish:
df_1 = df_1.select("id_no_normalized").drop_duplicates()
df_1 = df_1.withColumn(
"id_normalized",
F.regexp_replace(F.col("id_no_normalized"), "[^a-zA-Z0-9]+", ""))
df_2 = df_2.select("name")
extract = F.expr('position(id_normalized IN name)>0')
result = df_1.join(df_2, extract)
return result
如何更正我的代码以解决它?谢谢
How can I correct my code to resolve it ?Thank you
推荐答案
我们可以使用交叉联接并在新DF上应用UDF来解决此问题,但同样,我们需要确保它可以在大型数据集上使用.
We can solve this using cross join and applying UDF on new DF, but again we need to ensure it works on a big dataset.
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
data1 = [
{"id_normalized":"ABC","id_no_normalized":"A_B.C"},
{"id_normalized":"ERFD","id_no_normalized":"E.R_FD"},
{"id_normalized":"12ZED","id_no_normalized":"12_Z.ED"}
]
data2 = [
{"name": "googleisa12ZEDgoodnavigator"},
{"name": "internetABCexplorer"}
]
df1 = spark.createDataFrame(data1, ["id_no_normalized", "id_normalized"])
df2 = spark.createDataFrame(data2, ["name"])
df3 = df1.crossJoin(df2)
search_for_udf = udf(lambda name,id_normalized: name.find(id_normalized), returnType=IntegerType())
df4 = df3.withColumn("contain", search_for_udf(df3["name"], df3["id_normalized"]))
df4.filter(df4["contain"] > -1).show()
>>> df4.filter(df4["contain"] > -1).show()
+----------------+-------------+--------------------+-------+
|id_no_normalized|id_normalized| name|contain|
+----------------+-------------+--------------------+-------+
| A_B.C| ABC| internetABCexplorer| 8|
| 12_Z.ED| 12ZED|googleisa12ZEDgoo...| 9|
+----------------+-------------+--------------------+-------+
我相信可以使用一些火花技术来提高交叉连接的效率.
I believe there are some spark techniques available to make cross join efficient.
这篇关于寻找字符串是否包含不同数据帧中的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!