本文介绍了使用正则表达式查找字符串中的特定字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这些样本值

prm_2020 P02 United Kingdom London 2 for 2
prm_2020 P2 United Kingdom London 2 for 2
prm_2020 P10 United Kingdom London 2 for 2
prm_2020 P11 United Kingdom London 2 for 2

需要找到这样的P2,P02,P11,p06,p05,并尝试在数据块中使用Regexp_extract函数.努力寻找正确的表达方式.一旦找到P10,字符串中的p6,我需要将数字放在名为ID的新列中

Need to find P2, P02, P11,p06,p05 like this, trying to use Regexp_extract function in databricks. struggling to find the correct expression. Once i find P10, p6 from string i need to put numbers in new column called ID

select distinct
    promo_name
   ,regexp_extract(promo_name, '(?<=p\d+\s+)P\d+') as regexp_id
from stock
where promo_name is not null


select distinct
    promo_name
   ,regexp_extract(promo_name, 'P[0-9]+') as regexp_id
from stock
where promo_name is not null

两者都会产生错误

推荐答案

函数 regexp_extract 将使用3个参数.

  • 列值
  • 正则表达式模式
  • 组索引
def regexp_extract(e: org.apache.spark.sql.Column,exp: String,groupIdx: Int): org.apache.spark.sql.Column

您缺少 regexp_extract 函数中的最后一个参数.

You are missing last parameter in regexp_extract function.

检查以下代码.

scala> df.show(truncate=False)
+------------------------------------------+
|data                                      |
+------------------------------------------+
|prm_2020 P02 United Kingdom London 2 for 2|
|prm_2020 P2 United Kingdom London 2 for 2 |
|prm_2020 P10 United Kingdom London 2 for 2|
|prm_2020 P11 United Kingdom London 2 for 2|
+------------------------------------------+
df
.withColumn("parsed_data",regexp_extract(col("data"),"(P[0-9]*)",0))
.show(truncate=False)
+------------------------------------------+-----------+
|data                                      |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02        |
|prm_2020 P2 United Kingdom London 2 for 2 |P2         |
|prm_2020 P10 United Kingdom London 2 for 2|P10        |
|prm_2020 P11 United Kingdom London 2 for 2|P11        |
+------------------------------------------+-----------+
df.createTempView("tbl")
spark
.sql("select data,regexp_extract(data,'(P[0-9]*)',0) as parsed_data from tbl")
.show(truncate=False)
+------------------------------------------+-----------+
|data                                      |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02        |
|prm_2020 P2 United Kingdom London 2 for 2 |P2         |
|prm_2020 P10 United Kingdom London 2 for 2|P10        |
|prm_2020 P11 United Kingdom London 2 for 2|P11        |
+------------------------------------------+-----------+

这篇关于使用正则表达式查找字符串中的特定字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 12:13