本文介绍了Spark数据框过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+

我想过滤出具有列"c2"的前3个字符"MSL"或"HCP"的记录.

I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.

所以输出应如下所示.

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

任何人都可以帮忙吗?

我知道df.filter($"c2".rlike("MSL"))-这是用于选择记录,但如何排除记录. ?

I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?

版本:Spark 1.6.2Scala:2.10

Version: Spark 1.6.2Scala : 2.10

推荐答案

df.filter(not(
    substring(col("c2"), 0, 3).isin("MSL", "HCP"))
    )

这篇关于Spark数据框过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:55