Pig上强制执行正确的数据类型

Pig上强制执行正确的数据类型

本文介绍了如何在Apache Pig上强制执行正确的数据类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于数据类型错误,我无法对一袋值求和.

I am having trouble SUMming a bag of values, due to a Data type error.

当我加载一个csv文件,其行如下所示:

When I load a csv file whose lines look like this:

6   574 false   10.1.72.23  2010-05-16 13:56:19 +0930   fbcdn.net   static.ak.fbcdn.net 304 text/css    1   /rsrc.php/zPTJC/hash/50l7x7eg.css   http    pwong

使用以下内容:

logs_base = FOREACH raw_logs GENERATE
  FLATTEN(
     EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
  )
  as (
    account_id: int,
    bytes: long,
    cached: chararray,
    ip: chararray,
    time: chararray,
    domain: chararray,
    host: chararray,
    status: chararray,
    mime_type: chararray,
    page_view: chararray,
    path: chararray,
    protocol: chararray,
    username: chararray
  );

如"describe"命令所示,所有字段似乎都可以正确加载并具有正确的类型:

All fields seem to be loaded fine, and with the right type, as shown by the "describe" command:

grunt> describe logs_base
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}

每当我使用以下方法执行SUM时:

Whenever I perform a SUM using:

bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);

并存储或转储内容,mapreduce进程失败,并出现以下错误:

and store, or dump the contents, the mapreduce process fails with the following error:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87)
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79)
    ... 15 more

引起我注意的那一行是:

The line that catches my attention is:

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

这使我相信提取函数不会将字节字段转换为所需的数据类型(长整数).

Which leads me to believe that the extract function is not converting the bytes field to the required data type (long).

是否有一种方法可以强制将提取功能转换为正确的数据类型?如何在不对所有记录进行前瞻的情况下转换值? (将时间转换为unix时间戳并尝试找到MIN时,也会发生同样的问题.我绝对希望找到一种不需要不必要的投影的解决方案.)

Is there a way to enforce the extract function to convert to the correct data types? How can I cast the value, without having to do a FOREACH on all the records? (Same problem happens when converting the time to a unix time stamp, and attempting to find MIN. I definitely would like to find a solution that does not require unnecessary projections).

任何指针将不胜感激.非常感谢您的帮助.

Any pointers will be appreciated. Thanks a lot for your help.

关于,豪尔赫C.

P.S.我正在Amazon Elastic Mapreduce服务上以交互模式运行此程序.

P.S. I am running this in interactive mode on Amazon elastic mapreduce service.

推荐答案

您是否尝试过广播从UDF检索的数据?在此处应用架构不会执行任何强制转换.

Have you tried to cast the data retrieved from the UDF? Applying the schema here does not perform any casting.

例如

logs_base =
   FOREACH raw_logs
   GENERATE
       FLATTEN(
           (tuple(LONG,LONG,CHARARRAY,....)) EXTRACT(line, '^...')
       )
       AS (account_id: INT, ...);

这篇关于如何在Apache Pig上强制执行正确的数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 04:45