嗨,我有一个文档上传到名为Data的Hive表中,其示例行如下所示:

He is a good boy and but his brother is a bad boy.
He is a naughty boy.

该表的架构为:
create table Data(
    document_data STRING)
row format delimited
fields terminated by '\n'
stored as textfile;

我想编写一个查询,该查询仅统计单词boy和naughty`的出现并将其输出为:
 boy 3
 naughty 1

最佳答案

在这里,我们将使用LATERAL功能,该功能可以将单行转换为多行。

SELECT
    word,
    COUNT(*)
FROM Data
WHERE
    word="boy" OR
    word="naughty"
LATERAL VIEW
    explode(split(document_data, ' ')) lateralTable AS word GROUP BY word;

我修改了在Word Count program in Hive中找到的版本。

关于hadoop - 使用配置单元搜索文档中特定单词的出现,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33302316/

10-16 02:57