val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+


我想过滤出具有列“ c2”的前3个字符“ MSL”或“ HCP”的记录。

因此输出应如下所示。

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+


有人可以帮忙吗?

我知道df.filter($"c2".rlike("MSL"))-这是用于选择记录,但如何排除记录。 ?

版本:Spark 1.6.2
斯卡拉:2.10

最佳答案

df.filter(not(
    substring(col("c2"), 0, 3).isin("MSL", "HCP"))
    )

09-11 18:31