问题描述
我有一个包含 3 列的 CSV 文件:tweetid
、tweet
和 Userid
.但是在 tweet
列中有逗号分隔值.
I have a CSV file with 3 columns: tweetid
, tweet
, and Userid
. However within the tweet
column there are comma separated values.
即1 行数据:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
我想单独提取所有 3 个字段,但是 REGEX_EXTRACT
给我一个错误代码:
I want to extract all 3 fields individually, but REGEX_EXTRACT
is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
错误是:
error: Filter's condition must evaluate to boolean.
推荐答案
在shared的用例中,使用PigStrorage(',')读取数据会导致savava143(last field value)丢失
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
输出:A:观察最后一个字段值丢失.
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
对于共享用例,要从 CSV 文件中提取字段值为,"的所有值,我们可以使用 CSVExcelStorage 或 CSVLoader.
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
方法一:使用CSVExcelStorage
参考:http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
输入:a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
猪脚本:
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
输出:A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
方法 2:使用 CSVLoader
参考:http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
下面的脚本使用了 CSVLoader(),DUMP A 将产生与之前看到的相同的输出.
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
这篇关于PIG 中的 REGEX_EXTRACT 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!