问题描述
我需要使用一个现有的有7列的表格在HIVE中创建一个临时表。我只想摆脱前三列的重复项,并在其他4列中保留相应的值。我不关心哪一行实际上被删除,而是单独使用前三行来去除重复。
下面如果你不考虑订购
创建表格2作为
选择col1,col2,col3,
,split(agg_col,|)[0]作为col4
,split(agg_col,|)[1]作为col5
,split(agg_col,|)[2] as col6
split(agg_col,|)[3] as col7
from(选择col1,col2,col3,
max(concat(cast(col4 as string),| ,
cast(col5 as string),|,
cast(col6 as string),|,
cast(col7 as string)))as agg_col
from table1
group by col1,col2,col3)A;
下面是另一种方法,它可以很好地控制排序,但比上面的方法慢b
$ b
创建表格table2为
选择col1,col2,col3,max(col4),max(col5),max(col6),max (col7)
from(选择col1,col2,col3,col4,col5,col6,col7,
rank()over(由col1,col2,col3分区
按col4 desc,col5排序desc,col6 desc,col7 desc)作为col_rank
from table1)A
其中A.col_rank = 1
GROUP BY col1,col2,col3; $($)$ <$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $按列排序都是平等的。在我们的例子中,如果所有七列中有两列具有完全相同的值,那么当我们使用过滤器作为col_rank = 1时将会有重复。这些重复项可以使用上面查询中写的max和group by子句来发起。
I need to create a temporary table in HIVE using an existing table that has 7 columns. I just want to get rid of duplicates with respect to first three columns and also retain the corresponding values in the other 4 columns. I don't care which row is actually dropped while de-duplicating using first three rows alone.
解决方案 You could use something as below if you are not considered about ordering
create table table2 as
select col1, col2, col3,
,split(agg_col,"|")[0] as col4
,split(agg_col,"|")[1] as col5
,split(agg_col,"|")[2] as col6
,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
max(concat(cast(col4 as string),"|",
cast(col5 as string),"|",
cast(col6 as string),"|",
cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;
Below is another approach, which gives much control over ordering but slower than above approach
create table table2 as
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
rank() over ( partition by col1, col2, col3
order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;
rank() over(..) function returns more than one column with rank as '1' if order by columns are all equal. In our case if there are 2 columns with exact same values for all seven columns then there will be duplicates when we use filter as col_rank =1. These duplicates can be eleminated using max and group by clauses as written in above query.
这篇关于在某个列中删除表中的行,并在HIVE的其他列中保留相应的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!