我正在尝试使用Pig中的以下数据集
https://www.kaggle.com/zynicide/wine-reviews/version/4
我从查询中得到了错误的值,我能想到的唯一原因是与数据集中的数据丢失有关
但我不知道是这样还是我为什么得到错误的值

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted  5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;

DESCRIBE allWinesPricePoints;

我得到的实际结果是
(56203,黄油烤面包和香料风味,包裹成奶油状。应保存一两年。”)
(61341,单宁味甜。新鲜的酸度使它更具刺激性。给它时间。最好的2007–2012年。”)
(16417年,霞多丽也为人所知)
(115384,杏仁和 Vanilla )
(136804,杏仁和 Vanilla )

我认为输出应该是
(56203,23)
(61341,30)
(16417,16)
(115384,250)
(136804,250)

我本来希望第二个值是数字并且在价格栏中

最佳答案

进行如下:

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

--comments
--add below foreach to generate the values this will help you out to parse data correctly
--generate column in the same order as it is in the text file
allWines= FOREACH allWines GENERATE
id AS id,
country AS country,
description AS description,
designation AS designation,
points AS points,
price AS price,
province AS provience,
region_2 AS region_2,
region_1 AS region_1,
variety AS variety,
winery AS winery;

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted  5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;
DESCRIBE allWinesPricePoints;

希望这对您有所帮助。
如有任何疑问,请通知我。

关于hadoop - 为什么我的 pig 查询返回错误的值,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/55909565/

10-09 06:48