本文介绍了转储Json数据时出现Apache Pig错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个JSON文件,并希望使用Apache Pig进行加载。



我使用内置JSONLOADER加载json数据,下面是示例json数据。

  cat jsondata1.json 
{response:{id:10123,thread:树懒,评论:[树懒很可爱所以寒意]},response_time:0.425}
{response:{id:13828,thread:Bigfoot, :[hello world]},response_time:0.517}

使用内置Json加载程序。加载时没有错误,但在转储数据时会出现以下错误。

  grunt> a = load'/home/cloudera/jsondata1.json'使用JsonLoader('response:tuple(id:int,thread:chararray,comments:bag {tuple(comment:chararray)}),response_time:double'); 

grunt>转储一个;

2016-04-17 01:11:13,286 [pool-4-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - 正在处理的当前分割文件: /home/cloudera/jsondata1.json:0+229
2016-04-17 01:11:13,287 [pool-4-thread-1] WARN org.apache.hadoop.conf.Configuration - dfs.https。地址已弃用。相反,使用dfs.namenode.https-address
2016-04-17 01:11:13,311 [pool-4-thread-1] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend已经初始化
2016-04-17 01:11:13,321 [pool-4-thread-1] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly $ Map - 每个作业阶段正在处理的别名(AliasName [行,偏移]):M:a [5,4] C:R:
2016-04-17 01:11:13,349 [线程16] INFO org.apache.hadoop.mapred.LocalJobRunner - 映射任务执行者完成。
2016年4月17日01:11:13351 [主题-16] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local801054416_0004
java.lang.Exception的:org.codehaus.jackson.JsonParseException:当前标记(FIELD_NAME)不是数字,不能使用数值访问器
at [Source:java.io.ByteArrayInputStream@2484de3c; line:1,column:120]
at org.apache.hadoop.mapred.LocalJobRunner $ Job.run(LocalJobRunner.java:406)
导致:org.codehaus.jackson.JsonParseException:当前令牌(FIELD_NAME)不是数字,不能在[Source:java.io.ByteArrayInputStream@2484de3c; 1,列:120]
在org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
在org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java: 385)
at org.codehaus.jackson.impl.JsonNumericParserBase._parseNumericValue(JsonNumericParserBase.java:399)
at org.codehaus.jackson.impl.JsonNumericParserBase.getDoubleValue(JsonNumericParserBase.java:311)$ b在org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:203)
$ b在org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
。在组织.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask $ NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper $ Context.nextKeyValue(WrappedMapper的.java:85)
。在在org.apache.hadoop.mapred.MapTask.runNewMapper org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
(MapTask.java:672 )
在org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
在org.apache.hadoop.mapred.LocalJobRunner $ Job $ MapTaskRunnable.run(LocalJobRunner.java:268 )
在java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:441)$ b $在java.util.concurrent.FutureTask $ Sync.innerRun(FutureTask.java:303)
。在在在java.util中java.util.concurrent.ThreadPoolExecutor中的$ Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.FutureTask.run(FutureTask.java:138)
。并发.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:908)$ b $在java.lang.Thread.run(Thread.java:662)
2016-04-17 01:11:13,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId:job_local801054416_0004
2016年4月17日01:11:13548 [主] INFO org.ap ache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 处理别名a
2016-04-17 01:11:13,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 详细位置:M:a [5,4] C:R:
2016-04-17 01:11:18,059 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 哎呀!有些工作失败了!如果希望Pig在失败时立即停止,请指定-stop_on_failure。
2016-04-17 01:11:18,059 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local801054416_0004失败!停止运行所有依赖作业
2016-04-17 01:11:18,059 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100%完成
2016-04- 17 01:11:18,059 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s)failed!
2016-04-17 01:11:18,060 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - 检测到本地模式。以下统计数据可能不完整
2016-04-17 01:11:18,060 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - 脚本统计数据:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.7.0 0.11.0-cdh4.7.0 cloudera 2016-04-17 01:11:12 2016-04-17 01:11:18 UNKNOWN

失败!

失败作业:
作业ID别名功能消息输出
job_local801054416_0004 a MAP_ONLY消息:作业失败!文件:/ tmp / temp-1766116741 / tmp1151698221,

输入:
无法从/home/cloudera/jsondata1.json读取数据

。输出(一个或多个):
无法产生导致 文件:/ TMP / TEMP-1766116741 / tmp1151698221

工作DAG:
job_local801054416_0004


2016-04-17 01:11:18,060 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败!
2016-04-17 01:11:18,061 [main] ERROR org.apache.pig.tools.grunt.Grunt - 错误1066:无法打开别名
的迭代器在日志文件处的详细信息:/ home /cloudera/pig_1460877001124.log

我无法找到问题。我可以知道如何为上面的json数据定义正确的模式吗?

解决方案

试试这个:

 评论:{(chararray)} 

因为此版本:

 评论:bag {tuple(评论:chararray)} 

comments:[{comment:hello world}] 

,而不是另一个嵌套文件:

 comments:[hello world] 


I have a JSON file and want to load using Apache Pig.

I am using the built-in JSONLOADER to load json data, Below is the sample json data.

cat jsondata1.json
{ "response": { "id": 10123, "thread": "Sloths", "comments": ["Sloths are adorable So chill"] }, "response_time": 0.425 }
{ "response": { "id": 13828, "thread": "Bigfoot", "comments": ["hello world"] } , "response_time": 0.517 }

Here I loading json data using builtin Json loader. While loading there is no error, but while dumping the data it gives the following error.

grunt> a = load '/home/cloudera/jsondata1.json' using JsonLoader('response:tuple (id:int, thread:chararray, comments:bag {tuple(comment:chararray)}), response_time:double');

grunt> dump a;

2016-04-17 01:11:13,286 [pool-4-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/home/cloudera/jsondata1.json:0+229
2016-04-17 01:11:13,287 [pool-4-thread-1] WARN  org.apache.hadoop.conf.Configuration - dfs.https.address is deprecated. Instead, use dfs.namenode.https-address
2016-04-17 01:11:13,311 [pool-4-thread-1] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2016-04-17 01:11:13,321 [pool-4-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: a[5,4] C:  R:
2016-04-17 01:11:13,349 [Thread-16] INFO  org.apache.hadoop.mapred.LocalJobRunner - Map task executor complete.
2016-04-17 01:11:13,351 [Thread-16] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local801054416_0004
java.lang.Exception: org.codehaus.jackson.JsonParseException: Current token (FIELD_NAME) not numeric, can not use numeric value accessors
 at [Source: java.io.ByteArrayInputStream@2484de3c; line: 1, column: 120]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: org.codehaus.jackson.JsonParseException: Current token (FIELD_NAME) not numeric, can not use numeric value accessors
 at [Source: java.io.ByteArrayInputStream@2484de3c; line: 1, column: 120]
    at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
    at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
    at org.codehaus.jackson.impl.JsonNumericParserBase._parseNumericValue(JsonNumericParserBase.java:399)
    at org.codehaus.jackson.impl.JsonNumericParserBase.getDoubleValue(JsonNumericParserBase.java:311)
    at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:203)
    at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
2016-04-17 01:11:13,548 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local801054416_0004
2016-04-17 01:11:13,548 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases a
2016-04-17 01:11:13,548 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a[5,4] C:  R:
2016-04-17 01:11:18,059 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2016-04-17 01:11:18,059 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local801054416_0004 has failed! Stop running all dependent jobs
2016-04-17 01:11:18,059 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-04-17 01:11:18,059 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2016-04-17 01:11:18,060 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats reported below may be incomplete
2016-04-17 01:11:18,060 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.7.0  0.11.0-cdh4.7.0 cloudera    2016-04-17 01:11:12 2016-04-17 01:11:18 UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_local801054416_0004 a   MAP_ONLY    Message: Job failed!    file:/tmp/temp-1766116741/tmp1151698221,

Input(s):
Failed to read data from "/home/cloudera/jsondata1.json"

Output(s):
Failed to produce result in "file:/tmp/temp-1766116741/tmp1151698221"

Job DAG:
job_local801054416_0004


2016-04-17 01:11:18,060 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2016-04-17 01:11:18,061 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias a
Details at logfile: /home/cloudera/pig_1460877001124.log

I could not able to find the issue. Can I know how to define the correct schema for the above json data?.

解决方案

Try this:

comments:{(chararray)}

because this version:

comments:bag {tuple(comment:chararray)}

fits this JSON schema:

"comments": [{comment:"hello world"}]

and you have simple string values, not another nested documents:

"comments": ["hello world"]

这篇关于转储Json数据时出现Apache Pig错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 10:09