我想从col1中删除​​存在于col2中的字符串:

val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")


使用regexp_replacetranslate引用:spark functions api

val res = df.withColumn("sentence_without_label", regexp_replace
(col("sentence") , "(?????)", "" ))


因此res如下所示:

scala - 当存在于其他列(行)中时,Spark列字符串替换-LMLPHP

最佳答案

您可以简单地使用regexp_replace

df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit("" )))


或者您可以使用以下简单的udf函数

val df5 = spark.createDataFrame(Seq(
  ("Hi I heard about Spark", "Spark"),
  ("I wish Java could use case classes", "Java"),
  ("Logistic regression models are neat", "models")
)).toDF("sentence", "label")

val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))

val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label"))

res.show()


输出:

+-----------------------------------+------+------------------------------+
|sentence                           |label |sentence_without_label        |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark             |Spark |Hi I heard about              |
|I wish Java could use case classes |Java  |I wish  could use case classes|
|Logistic regression models are neat|models|Logistic regression  are neat |
+-----------------------------------+------+------------------------------+

07-27 22:42