问题描述
我有一个 pyspark 数据框:
I have a pyspark dataframe:
示例:
text <String> | name <String> | original_name <String>
----------------------------------------------------------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019
----------------------------------------------------------------------------
NATUREISVERYGOODFORHEALTH | null | null
----------------------------------------------------------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D
----------------------------------------------------------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH
----------------------------------------------------------------------------
我想循环 name
列并查看 name
值是否存在于 text
中,如果是,我创建一个 new_column
,将包含在 text
中找到的 name
值的 original_name
值.知道有时数据框列是 null
.
I want to loop the name
column and look if name
values exists in text
, if yes, I create a new_column
, will be contain the original_name
value of the name
values found in text
. Knowing that some times the dataframe columns are null
.
示例:
在数据框示例的第 4 行中,
text
包含来自name
列的 2 个值:[OURHEALTH, VITAMIND]
,我应该取它的original_name
值并将它们存储在new_column
中.
in the line 4 in the dataframe example, the
text
contain 2 values fromname
column:[OURHEALTH, VITAMIND]
, I should take itsoriginal_name
values and store them in anew_column
.
在第 2 行中,text
包含 name
列中的 OURHEALTH
,我应该存储在 new_column
找到的原始 name
值 ==> [OUR_/HEALTH]
in the line 2, the text
contain OURHEALTH
from name
column, I should store in the new_column
the original name
value that found ==> [OUR_/HEALTH]
预期结果:
text <String> | name <String> | original_name <String> | new_column <Array>
------------------------------|------------------|---------------------------|----------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019 | [WORLD_2019]
------------------------------|------------------|---------------------------|----------------------------
NATUREISVERYGOODFOROURHEALTH | null | null | [OUR_/HEALTH]
------------------------------|------------------|---------------------------|----------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D | [VITAMIN_D]
------------------------------|------------------|---------------------------|----------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH | [OUR_/HEALTH, VITAMIN_D ]
-----------------------------------------------------------------------------|----------------------------
我希望我的解释清楚.
我尝试了以下代码:
df = df.select("text", "name", "original_name").agg(collect_set("name").alias("name_array"))
for name_item in name_array:
df.withColumn("new_column", F.when(df.text.contains(name_item), "original_name").otherwise(None))
有人可以帮我吗?谢谢
推荐答案
一个简单的解决方案是使用 join
在原始 DataFrame 和仅具有 name
的派生 DataFrame 之间柱子.由于多行可以满足连接条件,因此我们必须在连接后按原始列进行分组.
One simple solution is to use join
between the original DataFrame and a derived DataFrame with just the name
column. As the join condition could be satisfied by multiple rows, we'll have to groupby the original column after join.
这是您输入的详细示例:
Here is a detailed example for your input :
data = [
("HELLOWORLD2019THISISGOOGLE", "WORLD2019", "WORLD_2019"),
("NATUREISVERYGOODFOROURHEALTH", None, None),
("THESUNCONTAINVITAMIND", "VITAMIND", "VITAMIN_D"),
("BECARETOOURHEALTHISVITAMIND", "OURHEALTH", "OUR_ / HEALTH")
]
df = spark.createDataFrame(data, ["text", "name", "original_name"])
# create new DF with search words
# as it's the originl_name which interests us for the final list so we select it too
search_df = df.select(struct(col("name"), col("original_name")).alias("search_match"))
# join on df.text contains search_df.name
df_join = df.join(search_df, df.text.contains(search_df["search_match.name"]), "left")
# group by original columns and collect matches in a list
df_join.groupBy("text", "name", "original_name")\
.agg(collect_list(col("search_match.original_name")).alias("new_column"))\
.show(truncate=False)
输出:
+----------------------------+---------+-------------+--------------------------+
|text |name |original_name|new_column |
+----------------------------+---------+-------------+--------------------------+
|HELLOWORLD2019THISISGOOGLE |WORLD2019|WORLD_2019 |[WORLD_2019] |
|THESUNCONTAINVITAMIND |VITAMIND |VITAMIN_D |[VITAMIN_D] |
|NATUREISVERYGOODFOROURHEALTH|null |null |[OUR_ / HEALTH] |
|BECARETOOURHEALTHISVITAMIND |OURHEALTH|OUR_ / HEALTH|[VITAMIN_D, OUR_ / HEALTH]|
+----------------------------+---------+-------------+--------------------------+
这篇关于从文本pyspark中提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!