我正在尝试使用内置的BuildBloomBloom UDF在PIG中编写Bloom过滤器生成器。调用BuildBloom UDF的语法为:

define bb BuildBloom('hash_type', 'vector_size', 'false_positive_rate');

vector 大小和误报率参数作为charrarrays传入。由于我不一定事先知道 vector 的大小,但是在调用BuildBloom UDF之前脚本中始终可以使用它,因此我想使用内置的COUNT UDF而不是某些硬编码的值。就像是:
records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE
    (long)     $0 AS value_fld:long,
    (chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);
define bb BuildBloom('jenkins', n, '$false_positive_rate');

问题是,当我描述n时,我得到:n: {count: chararray}。可以预期,BuildBloom UDF调用失败,因为它在需要简单的chararray的情况下将其作为元组输入。我应该如何仅提取chararray(即从COUNT转换为chararray的整数返回值)并将其分配给n以便在对BuildBloom(...)的调用中使用?

编辑:这是当我尝试将N::count传递到BuildBloom(...) UDF时产生的错误。 describe N产生:N {count: chararray}。令人反感的行(第40行)显示为:define bb BuildBloom('jenkins', N::count, '$fpr');
ERROR 1200: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
    at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:604)
    at org.apache.pig.Main.main(Main.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: Failed to parse: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
    ... 14 more

最佳答案

如果您使用的是grunt shell,那么显而易见的方法是调用DUMP n;,等待作业完成运行,然后将值复制到define bloom...调用中。

我猜这不是一个很令人满意的答案。您很可能希望在脚本中运行它。这是一种非常简单的方法。您将需要3个文件:

  • 'n_start.txt',其中包含:
    n='
    
  • 'n_end.txt',其中包含单个字符:
    '
    
  • 'bloom_build.pig'其中包含:
    define bb BuildBloom('jenkins', '$n', '0.0001');
    

  • 一旦有了这些,就可以运行以下脚本:
    records = LOAD '$input' using PigStorage();
    records = FOREACH records GENERATE
        (long)     $0 AS value_fld:long,
        (chararray)$1 AS filter_fld:chararray;
    records_fltr = FILTER records by (filter_fld=='$filter_value')
        AND (value_fld is not null);
    records_grp = GROUP records_fltr all;
    records_count = FOREACH records_grp GENERATE
        (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
    n = FOREACH records_count GENERATE flatten(count);
    
    --the new part
    STORE records_count INTO 'n' USING PigStorgae(',');
    --this will copy what you just stored into a local directory
    fs -copyToLocal n n
    --this will cat the two static files we created prior to running pig
    --with the count we just generated.  it will pass it through tr which will
    --strip out the newlines and then store it into a file called 'n.txt' which we
    --will use as a parameter file
    sh cat -s nstart.txt n/part-r-00000 nend.txt| tr -d '\n' > n.txt
    --RUN makes pig call one script within another.  Be forewarned that if pig returns
    --a message with an error on a certain line, it is the line number of the expanded script
    RUN -param_file n.txt bloom_bulid.pig;
    

    之后,您可以像以前打算的那样调用bb。这很丑陋,也许更精通Unix的人可能会摆脱n_start.txt和n_end.txt文件。

    另一个更干净但涉及更多的选项是编写一个新的UDF(例如BuildBloom),扩展BuildBloomBase.java,但具有一个空的构造函数,并且可以处理exec()方法中的所有内容。

    关于hadoop - 如何从复杂的 pig 数据类型中提取简单的 pig 数据类型,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23373000/

    10-12 23:45