问题描述
我写了一个查询来查找 3 月到 4 月美国最繁忙的 10 个机场.它产生所需的输出,但我想尝试进一步优化它.
I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.
是否有任何 HiveQL 特定优化可以应用于查询?GROUPING SETS
是否适用于此处?我是 Hive 的新手,目前这是我提出的最短查询.
Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS
applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.
SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum
FROM flights_stats
WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;
表格列如下:
机场
|iata|airport|city|state|country|
Flights_stats
Flights_stats
|originAirport|destAirport|FlightsNum|Cancelled|Month|
推荐答案
按机场过滤(内连接)并在 UNION ALL 之前进行聚合以减少传递给最终聚合减速器的数据集.带有连接的 UNION ALL 子查询应该并行运行,并且比在 UNION ALL 之后连接更大的数据集更快.
Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.
SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
UNION ALL
SELECT a.airport, COUNT(*) as cnt
FROM flights_stats f
INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
WHERE Cancelled = 0 AND Month IN (3,4)
GROUP BY a.airport
) f
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;
调整 mapjoins 并启用并行执行:
Tune mapjoins and enable parallel execution:
set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
使用 Tez 和矢量化、调整映射器和化简器并行性:https://stackoverflow.com/a/48487306/2700344
Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344
这篇关于Hive - 有没有办法进一步优化 HiveQL 查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!