问题描述
我想对列的子集执行 DISTINCT 操作.文档 说这可以通过嵌套的 foreach 实现:
I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:
您不能在字段子集上使用 DISTINCT;为此,请使用 FOREACH 和嵌套块首先选择字段,然后应用 DISTINCT(参见示例:嵌套块).
对所有列执行 DISTINCT 操作很简单:
It is simple to perform a DISTINCT operation on all of the columns:
A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;
假设我有兴趣在 a1、a2 和 a3 之间执行不同的操作.谁能提供一个示例,说明如何按照文档中的建议使用嵌套的 foreach 执行此操作?
Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?
以下是输入和预期输出的示例:
Here's an example of input and expected output:
A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;
(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)
-- insert DISTINCT operation on a1,a2,a3 here:
-- ...
DUMP A_unique;
(1 2 3 4)
(1 2 4 4)
推荐答案
对所有其他列进行分组,只将感兴趣的列投影到一个包中,然后使用 FLATTEN
再次展开它们:
Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN
to expand them out again:
A_unique =
FOREACH (GROUP A BY a4) {
b = A.(a1,a2,a3);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};
这篇关于如何在列子集上使用 Pig Latin 执行 DISTINCT?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!