我在Pig中有以下代码,其中我正在检查存储在记录中的主文件中的字段(记录中的srcgt和destgt),以获取具有值338,918299,181,238
的另一个文件(intlgt.txt)中提到的值,但是会引发错误,如下所述。您能否建议在Apache Pig版本0.15.0(r1682971)上克服此问题。
pig 码:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
错误是:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
最佳答案
使用时
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG正在寻找标量。可以是数字,也可以是字符数组;但一个。因此Pig假设您的intlgt::intlgt是具有一行的关系。例如的结果
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(这将生成单行,并且原始关系中的记录数)
在您的情况下,intlgt包含多个行,因为您尚未对其进行任何分组。
根据您的代码,您正在尝试查找两端都有intlgt的SMS消息。可能的解决方案:
关于hadoop - PIG:标量在输出中有超过一行,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/35739830/