本文介绍了Spark 数据帧过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+

我想过滤掉包含c2"列的前 3 个字符(MSL"或HCP")的记录.

I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.

所以输出应该如下所示.

So the output should be like below.

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

有人可以帮忙吗?

我知道 df.filter($"c2".rlike("MSL")) -- 这是用于选择记录但如何排除记录.?

I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?

版本:Spark 1.6.2斯卡拉:2.10

Version: Spark 1.6.2Scala : 2.10

推荐答案

df.filter(not(
    substring(col("c2"), 0, 3).isin("MSL", "HCP"))
    )

这篇关于Spark 数据帧过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:49