问题描述
我有一个包含3列的CSV文件:tweetid
,tweet
和Userid
.但是,在tweet
列中有逗号分隔的值.
I have a CSV file with 3 columns: tweetid
, tweet
, and Userid
. However within the tweet
column there are comma separated values.
即1行数据:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
我想分别提取所有3个字段,但是REGEX_EXTRACT
使用此代码给我一个错误:
I want to extract all 3 fields individually, but REGEX_EXTRACT
is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
错误是:
error: Filter's condition must evaluate to boolean.
推荐答案
在共享的用例中,使用PigStrorage(',')读取数据将导致缺少savava143(最后一个字段值)
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
输出:A:请注意缺少最后一个字段值.
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
对于共享的用例,要从CSV文件中提取所有值为''的值,我们可以使用CSVExcelStorage或CSVLoader.
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
方法1:使用CSVExcelStorage
Ref: http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
输入:a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
猪脚本:
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
输出:A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
方法2:使用CSVLoader
Ref: http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
下面的脚本使用CSVLoader(),DUMP A将产生与前面相同的输出.
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
这篇关于PIG中的REGEX_EXTRACT错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!