本文介绍了Spark数据框过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
我想过滤出具有列"c2"的前3个字符"MSL"或"HCP"的记录.
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
所以输出应如下所示.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
任何人都可以帮忙吗?
我知道df.filter($"c2".rlike("MSL"))
-这是用于选择记录,但如何排除记录. ?
I knew that df.filter($"c2".rlike("MSL"))
-- This is for selecting the records but how to exclude the records. ?
版本:Spark 1.6.2Scala:2.10
Version: Spark 1.6.2Scala : 2.10
推荐答案
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
这篇关于Spark数据框过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!