问题描述
我有一个 CTE
,其中包含这样的数据.它几乎遵循两种格式,其中 counts
和 process_ids
将具有这两种类型的数据.
I have a CTE
that has data like this. It follows two formats pretty much where counts
and process_ids
will have these two types of data.
client_id day counts process_ids
--------------------------------------------------------------------------------------------
abc1 Feb-01-2021 3 C1,C2 | C3,C4,C5 | C6,C7
abc2 Feb-05-2021 2, 3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18
现在,在按 counts
和 process_ids
-
client_id day counts process_ids
--------------------------------------------------------
abc1 Feb-01-2021 3 C1
abc1 Feb-01-2021 3 C2
abc1 Feb-01-2021 3 C3
abc1 Feb-01-2021 3 C4
abc1 Feb-01-2021 3 C5
abc1 Feb-01-2021 3 C6
abc1 Feb-01-2021 3 C7
abc2 Feb-05-2021 2 C10
abc2 Feb-05-2021 2 C11
abc2 Feb-05-2021 2 C12
abc2 Feb-05-2021 2 C13
abc2 Feb-05-2021 2 C14
abc2 Feb-05-2021 3 C15
abc2 Feb-05-2021 3 C16
abc2 Feb-05-2021 3 C17
abc2 Feb-05-2021 3 C18
基本上,这个想法是根据以下两种使用情况(如果采用任何一种格式)来划分 counts
和 process_ids
.
Basically, the idea is to split counts
and process_ids
basis on the below two use cases if they follow any of those formats.
用例1
如果 counts
列仅包含一位数字,而 process_ids
列具有 |
分隔符.
If counts
column only has single-digit and process_ids
column has |
delimiter.
用例2
如果 counts
列仅用两位数字分隔,并由,
分隔符和 process_ids
列具有#
分隔符以及 pipe
.
If counts
column only has two-digit separated by a ,
delimiter and process_ids
column has #
delimiter along with pipe
.
我在这里与 Amazon Redshift
一起工作,我对如何根据需要将它们分开感到困惑.
I am working with Amazon Redshift
here and I am confused about how can I split them out as needed.
这有可能吗?
推荐答案
乍一看,这似乎有些毛茸茸,但它是通过扎实的技术构建而成的,并提供了理想的结果...
This might look a bit hairy at first sight but has been built up from solid techniques and gives the desired result...
WITH seq_0_9 AS (
SELECT 0 AS d
UNION ALL SELECT 1 AS d
UNION ALL SELECT 2 AS d
UNION ALL SELECT 3 AS d
UNION ALL SELECT 4 AS d
UNION ALL SELECT 5 AS d
UNION ALL SELECT 6 AS d
UNION ALL SELECT 7 AS d
UNION ALL SELECT 8 AS d
UNION ALL SELECT 9 AS d
),
numbers AS (
SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
FROM seq_0_9 a, seq_0_9 b, seq_0_9 c
),
processed AS
(SELECT client_id,
day,
REPLACE(counts, ' ', '') AS counts,
REPLACE(REPLACE(process_ids, ' ', ''), '|', ',') AS process_ids
FROM tbl),
split_pids AS
(SELECT
client_id,
day,
counts,
split_part(process_ids, '#', n) AS process_ids,
n AS n1
FROM processed
CROSS JOIN numbers
WHERE
split_part(process_ids, '#', n) IS NOT NULL
AND split_part(process_ids, '#', n) != ''),
split_counts AS
(SELECT
client_id,
day,
split_part(counts, ',', n) AS counts,
process_ids,
n1,
n AS n2
FROM split_pids
CROSS JOIN numbers
WHERE
split_part(counts, ',', n) IS NOT NULL
and split_part(counts, ',', n) != ''),
matched_up AS
(SELECT * FROM split_counts WHERE n1 = n2)
SELECT
client_id,
day,
counts,
split_part(process_ids, ',', n) AS process_ids
FROM
matched_up
CROSS JOIN
numbers
WHERE
split_part(process_ids, ',', n) IS NOT NULL
AND split_part(process_ids, ',', n) != '';
演示
在线rextester演示(使用PostgreSQL,但应与Redshift兼容): https://rextester.com/FNA16497
此技术用于生成数字表(范围从1到1000).然后,此技术与多个公用表表达式可以在单个SQL语句中实现.
This technique is used to generate a numbers table (from 1 to 1000 inclusive). This technique is then used multiple times with multiple Common Table Expressions to achieve it in a single SQL statement.
这篇关于如何在Redshift中使用两个不同的定界符从两列中拆分数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!