加载文件中由双冒号::分隔的猪

本文介绍了加载文件中由双冒号::分隔的猪的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

以下是由双冒号(::)分隔的示例数据集.

Following is a sample dataset delimited by double colon(::).

1::Toy Story (1995)::Animation|Children's|Comedy

我想从上述数据集中提取三个字段，分别为movieID，title和genre.我为此编写了以下代码

I want to extract three fields from above data set as movieID,title and genre. I have written following code for that

movies = LOAD 'location/of/dataset/on/hdfs ' 
using PigStorage('::')
as 
(MovieID:int,title:chararray,genre:chararray);

但是我遇到以下错误

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to  parse:  
 <file script.pig, line 1, column 9> pig script failed to validate:
 java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[::]'

推荐答案

使用 MyRegExloader :为此，您将需要piggybank.jar.

Use MyRegExloader: You will need piggybank.jar for this.

REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('([^\\:]+)::([^\\:]+)::([^\\:]+)') 
      as (movieid:int, title:chararray, genre:chararray);

(1，《玩具总动员》(1995年，动画|儿童的|喜剧)

(1,Toy Story (1995),Animation|Children's|Comedy)

这篇关于加载文件中由双冒号::分隔的猪的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！