如何在Apache Pig上强制执行正确的数据类型?

本文介绍了如何在Apache Pig上强制执行正确的数据类型?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于数据类型错误，我无法对一袋值求和.

I am having trouble SUMming a bag of values, due to a Data type error.

当我加载一个csv文件，其行如下所示:

When I load a csv file whose lines look like this:

6   574 false   10.1.72.23  2010-05-16 13:56:19 +0930   fbcdn.net   static.ak.fbcdn.net 304 text/css    1   /rsrc.php/zPTJC/hash/50l7x7eg.css   http    pwong

使用以下内容:

logs_base = FOREACH raw_logs GENERATE
  FLATTEN(
     EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
  )
  as (
    account_id: int,
    bytes: long,
    cached: chararray,
    ip: chararray,
    time: chararray,
    domain: chararray,
    host: chararray,
    status: chararray,
    mime_type: chararray,
    page_view: chararray,
    path: chararray,
    protocol: chararray,
    username: chararray
  );

如"describe"命令所示，所有字段似乎都可以正确加载并具有正确的类型:

All fields seem to be loaded fine, and with the right type, as shown by the "describe" command:

grunt> describe logs_base
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}

每当我使用以下方法执行SUM时:

Whenever I perform a SUM using:

bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);

并存储或转储内容，mapreduce进程失败，并出现以下错误:

and store, or dump the contents, the mapreduce process fails with the following error:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87)
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79)
    ... 15 more

引起我注意的那一行是:

The line that catches my attention is:

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

这使我相信提取函数不会将字节字段转换为所需的数据类型(长整数).

Which leads me to believe that the extract function is not converting the bytes field to the required data type (long).

是否有一种方法可以强制将提取功能转换为正确的数据类型?如何在不对所有记录进行前瞻的情况下转换值? (将时间转换为unix时间戳并尝试找到MIN时，也会发生同样的问题.我绝对希望找到一种不需要不必要的投影的解决方案.)

Is there a way to enforce the extract function to convert to the correct data types? How can I cast the value, without having to do a FOREACH on all the records? (Same problem happens when converting the time to a unix time stamp, and attempting to find MIN. I definitely would like to find a solution that does not require unnecessary projections).

任何指针将不胜感激.非常感谢您的帮助.

Any pointers will be appreciated. Thanks a lot for your help.

关于，豪尔赫C.

P.S.我正在Amazon Elastic Mapreduce服务上以交互模式运行此程序.

P.S. I am running this in interactive mode on Amazon elastic mapreduce service.

Pig上强制执行正确的数据类型

问题描述

推荐答案