我有一个这样的文件:

232404812.913232|1248|ip:tcp:jxta
232404812.913238|66|ip:udp:data
232404812.913615|98|ip:udp:l2tp:ppp:ip:tcp

我执行了以下HiveQL命令:
CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "|"
LOCATION 's3://b-file/input/';

CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
LOCATION 's3://b-file/output/1/';

INSERT OVERWRITE TABLE b_packet_out SELECT 'overall',
COUNT(*) FROM b_packet GROUP BY protocol;

INSERT INTO TABLE b_packet_out SELECT 'tcp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:tcp';

INSERT INTO TABLE b_packet_out SELECT 'udp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:udp';

INSERT INTO TABLE b_packet_out SELECT 'icmp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:icmp';

这样我在输出表中有以下内容。
hive> select * from b_packet_out;
OK
udp 2241
overall 10000
icmp    64
tcp 7633

HiveQL查询是否有更优美的方式,因此我可以减少行数以获得相同的输出?

最佳答案

select
count(*) as overall,
sum( if(protocol like '^ip:tcp',1,0) as tcp,
sum( if(protocol like '^ip:udp',1.0) as udp,
sum( if(protocol like '^ip:icmp'1,0) as icmp
from b_packet

这样一遍数据就会产生相同的计数。

如果您有更多协议(protocol),也可以说
选择
split(协议(protocol),':')[1],
计数(*)
按拆分分组(协议(protocol),':')[1]
但这不会给出总数。

关于sql - 优美的HiveQL查询,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/20317756/

10-11 08:53