以下是我推送到名为 temp_stat 的 Hive 表中的数据集:
COUNTRY CITY TEMP
---------- -------------------- -----
US Arizona 51.7
US California 56.7
US Bullhead City 51.1
India Jaisalmer 42.4
Libya Aziziya 57.8
Iran Lut Desert 70.7
India Banda 42.4
当我尝试通过 select 命令查看数据时,我得到以下数据集:
US,Arizona,51.7 NULL NULL
US,California,56.7 NULL NULL
US,Bullhead City,51.1 NULL NULL
India,Jaisalmer,42.4 NULL NULL
Libya,Aziziya,57.8 NULL NULL
Iran,Lut Desert,70.7 NULL NULL
India,Banda,42.4 NULL NULL
接下来,我想根据国家/地区对这些记录进行分组,并获取每个国家/地区的最高气温以及城市名称,因此我运行了以下查询:
select country,city,temp
from (
select country,city,temp,
row_number() over (partition by country order by temp desc) as part
from temp_stat
) a
where part = 1
order by country, city;
在配置单元 shell中运行上述查询后,将得到以下结果:
US,Arizona,51.7 NULL NULL
US,California,56.7 NULL NULL
US,Bullhead City,51.1 NULL NULL
India,Jaisalmer,42.4 NULL NULL
Libya,Aziziya,57.8 NULL NULL
Iran,Lut Desert,70.7 NULL NULL
India,Banda,42.4 NULL NULL
即使我运行内部查询来生成 row_number ,对于所有记录,我也会得到相似的行号。
(是这样的:)
India,Banda,42.4 NULL NULL 1
India,Jaisalmer,42.4 NULL NULL 1
Iran,Lut Desert,70.7 NULL NULL 1
Libya,Aziziya,57.8 NULL NULL 1
US,Arizona,51.7 NULL NULL 1
US,Bullhead City,51.1 NULL NULL 1
US,California,56.7 NULL NULL 1
enter code here
我还尝试了 density_rank()和 rank()。没有新结果。表定义是否有问题?
所有帮助将不胜感激!
最佳答案
字段以“,” 结尾
您的表定义应该是这样的:
create external table temp_stat
(
country string
,city string
,temp decimal(11,1)
)
row format delimited
fields terminated by ','
;
select * from temp_stat;
+---------+---------------+------+
| country | city | temp |
+---------+---------------+------+
| US | Arizona | 51.7 |
| US | California | 56.7 |
| US | Bullhead City | 51.1 |
| India | Jaisalmer | 42.4 |
| Libya | Aziziya | 57.8 |
| Iran | Lut Desert | 70.7 |
| India | Banda | 42.4 |
+---------+---------------+------+