如何在嵌套袋中的键上联接数据

如何在嵌套袋中的键上联接数据

本文介绍了Pig:如何在嵌套袋中的键上联接数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是想将data1data2中出现的'value1'/'value2'键上的data2data1的值合并(请注意

I'm simply trying to merge in the values from data2 to data1 on the 'value1'/'value2' keys seen in both data1 and data2 (note the nested structure of

容易吗?在面向对象的代码中,它是嵌套的for循环.但是在Pig中,感觉就像在解决一个魔方.

Easy right? In object oriented code it's a nested for loop. But in Pig it feels like solving a rubix cube.

data1 = 'item1'     111     { ('thing1', 222, {('value1'),('value2')}) }
data2 = 'value1'    'result1'
        'value2'    'result2'

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );

expected: 'item1', 111, {('thing1', 222, {('value1','result1'), ('value2','result2')})}
                                           ^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^

好奇的是:data1来自面向对象的数据存储,该存储解释了双重嵌套(简单的面向对象格式).

For the curious: data1 comes from an object oriented datastore, which explains the double nesting (simple object oriented format).

推荐答案

听起来您基本上只是想加入一个连接(从问题中不清楚这应该是INNER,LEFT,RIGHT还是FULL.我认为@SNeumann基本上有写答案,但是我将添加一些代码以使其更清晰.

It sounds like you basically just want to do a join (unclear from the question if this should be INNER, LEFT, RIGHT, or FULL. I think @SNeumann basically has the write answer, but I'll add some code to make it clearer.

假设数据如下:

data1 = 'item1'     111     { ('thing1', 222, {('value1'),('value2')}) }
        ...
data2 = 'value1'    'result1'
        'value2'    'result2'
        ...

我会做类似( unested )的事情:

I'd do something like (untested):

A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
A_flattened = FOREACH A GENERATE item, d, things.thing AS thing; things.d1 AS d1, FLATTEN(things.values) AS value;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1'
--'item1', 111, 'thing1', 222, 'value2'
A_B_joined = JOIN A_flattened BY value, B BY v;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1', 'value1', 'result1'
--'item1', 111, 'thing1', 222, 'value1', 'value2', 'result2'
A_B_joined1 = FOREACH A_B_JOINED GENERATE item, d, thing, d1, A_flattened::value AS value, r AS result;
A_B_grouped = GROUP A_B_joined1 BY (value, result);

从那里开始,随心所欲地重新装袋应该是微不足道的.

From there, rebagging however you like should be trivial.

编辑:以上内容应使用.".作为元组的投影运算符.我将其切换了.它还假定things是一个大元组,不是.一袋一包.如果OP从不打算在那个袋子中装多个物品,那么我强烈建议您使用一个元组来代替,并加载为:

EDIT: The above should have used '.' as the projection operator on tuples. I've switched that in. It also assumed things was a big tuple, which it isn't. It's a bag of one item. If the OP never plans to have more than one item in that bag, I'd highly recommend using a tuple instead and loading as:

A = load 'data1' as (item:chararray, d:int, things:(thing:chararray, d1:int, values:bag{(v:chararray)}));

,然后基本上按原样使用其余代码(注意:仍未经测试).

and then using the rest of the code essentially as is (note: still untested).

如果绝对需要一个袋子,那么整个问题都会改变,并且当袋子中有多个things对象时,OP会发生什么情况也不清楚.如此处

If a bag is absolutely required, then the entire problem changes, and it becomes unclear what the OP wants to happen when there are multiple things objects in the bag. Bag projection is also quite a bit more complicated as noted here

这篇关于Pig:如何在嵌套袋中的键上联接数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:16