我想从col1
中删除存在于col2
中的字符串:
val df = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")
使用
regexp_replace
或translate
引用:spark functions apival res = df.withColumn("sentence_without_label", regexp_replace
(col("sentence") , "(?????)", "" ))
因此
res
如下所示:最佳答案
您可以简单地使用regexp_replace
df5.withColumn("sentence_without_label", regexp_replace($"sentence" , lit($"label"), lit("" )))
或者您可以使用以下简单的udf函数
val df5 = spark.createDataFrame(Seq(
("Hi I heard about Spark", "Spark"),
("I wish Java could use case classes", "Java"),
("Logistic regression models are neat", "models")
)).toDF("sentence", "label")
val replace = udf((data: String , rep : String)=>data.replaceAll(rep, ""))
val res = df5.withColumn("sentence_without_label", replace($"sentence" , $"label"))
res.show()
输出:
+-----------------------------------+------+------------------------------+
|sentence |label |sentence_without_label |
+-----------------------------------+------+------------------------------+
|Hi I heard about Spark |Spark |Hi I heard about |
|I wish Java could use case classes |Java |I wish could use case classes|
|Logistic regression models are neat|models|Logistic regression are neat |
+-----------------------------------+------+------------------------------+