问题描述
- 我有一个包含两列的表(col1:string,col2:boolean)
- 让我们说col1 ="aaa"
- 对于col1 ="aaa",有许多True/False值col2
- 我要计算col1(aaa)的True值的百分比
- I have a table with two columns (col1:string, col2:boolean)
- Lets say col1 = "aaa"
- For col1 = "aaa", there are many True/False values ofcol2
- I want to calculate the percentage of True values for col1 (aaa)
输入:
aaa T
aaa F
aaa F
bbb T
bbb T
ccc F
ccc F
输出
COL1 TOTAL_ROWS_IN_INPUT_TABLE PERCENTAGE_TRUE_IN_INPUT_TABLE
aaa 3 33%
bbb 2 100%
ccc 2 0%
我该如何使用猪(拉丁文)来做到这一点?
How would I do this using PIG (LATIN)?
推荐答案
在Pig 0.10中,SUM(INPUT.col2)不起作用,无法转换为布尔值,因为它将INPUT.col2视为一袋布尔值,而bag是不是原始类型.另一件事是,如果将col2的输入数据指定为布尔值,则输入的转储没有col2的任何值,但是将其视为chararray可以正常工作.
In Pig 0.10 SUM(INPUT.col2) does not work and casting to boolean is not possible as it treats INPUT.col2 as a bag of boolean and bag is not a primitive type. Another thing is that if the input data for col2 is specified as boolean, than dump of the input does not have any values for the col2, but treating it as a chararray works just fine.
Pig非常适合此类任务,因为它具有使用嵌套在FOREACH中的运算符与单个组一起工作的手段.这是可行的解决方案:
Pig is well suited for this type of tasks as it has means to work with individual groups by using operators nested in a FOREACH. Here is the solution which works:
inpt = load '....' as (col1 : chararray, col2 : chararray);
grp = group inpt by col1; -- creates bags for each value in col1
result = foreach grp {
total = COUNT(inpt);
t = filter inpt by col2 == 'T'; --create a bag which contains only T values
generate flatten(group) as col1, total as TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total as PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
dump result;
输出:
(aaa,3,33.333333333333336)
(bbb,2,100.0)
(ccc,2,0.0)
这篇关于计算猪查询中的百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!