我想使用HiveQL创建一个n-gram列表。我的想法是使用带正则表达式和split函数的正则表达式-但这不起作用,但是:
select split('This is my sentence', '(\\S+) +(?=(\\S+))');
输入是表格的一列
|sentence |
|-------------------------|
|This is my sentence |
|This is another sentence |
输出应该是:
["This is","is my","my sentence"]
["This is","is another","another sentence"]
Hive中有一个n-gram udf,但是该函数直接计算n-gram的频率-我想查看所有n-gram的列表。
在此先多谢!
最佳答案
这可能不是最佳的解决方案,但确实可行。用定界符分隔句子(在我的示例中是一个或多个空格或逗号),然后爆炸并合并以获取n-gram,然后使用collect_set
(如果需要唯一的n-gram)或collect_list
组合n-gram数组:
with src as
(
select source_data.sentence, words.pos, words.word
from
(--Replace this subquery (source_data) with your table
select stack (2,
'This is my sentence',
'This is another sentence'
) as sentence
) source_data
--split and explode words
lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)
select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams
from src s1
inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos
group by s1.sentence;
结果:
OK
This is another sentence ["This is","is another","another sentence"]
This is my sentence ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)
关于sql - 如何在Hive中生成所有n-gram,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52782188/