本文介绍了Spark 数据帧过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
我想过滤掉包含c2"列的前 3 个字符(MSL"或HCP")的记录.
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
所以输出应该如下所示.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
有人可以帮忙吗?
我知道 df.filter($"c2".rlike("MSL"))
-- 这是用于选择记录但如何排除记录.?
I knew that df.filter($"c2".rlike("MSL"))
-- This is for selecting the records but how to exclude the records. ?
版本:Spark 1.6.2斯卡拉:2.10
Version: Spark 1.6.2Scala : 2.10
推荐答案
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
这篇关于Spark 数据帧过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!