我不明白HiveQL rank()。我在WWW上找到了一些排名UDF的实现,例如Edward's nice example。我可以加载和访问功能,但无法让它们做我想做的事情。这是一个详细的示例:

将UDF加载到CLI进程中:

$ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com/m6d/hiveudf/Rank2.java
$ jar -cvf Rank2.jar com/m6d/hiveudf/Rank2.class
hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;
hive> CREATE TEMPORARY FUNCTION Rank2 AS 'com.m6d.hiveudf.Rank2';

创建一个表:
create table purchases (
  SalesRepId String,
  PurchaseOrderId INT,
  Amount INT
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n';

从此CSV加载数据:
Jana,1,100
Nadia,2,200
Nadia,3,600
Daniel,4,80
Jana,5,120
William,6,170
Daniel,7,140

通过CLI进行以下操作:
LOAD DATA
  LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
  INTO TABLE purchases;

现在,我可以看到我的主要销售代表:
select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC;

纳迪亚卖了800美元,丹尼尔和贾娜都卖了220美元,威廉卖了170美元
SalesRep    Amount
--------    ------
Nadia       800
Daniel      220
Jana        220
William     170

现在我只想给他们编号:纳迪亚(Nadia)是第一名,丹尼尔(Daniel)和贾娜(Jana)并列第二名,威廉(William)是第四名(不是第三名)。
select SalesRepId, V.volume,rank2(V.volume)
from
(select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC) V;

这是我得到的,但不是我想要的:
SalesRep   Amount  Rank
--------   ------  ----
Nadia       800      1
Daniel      220      1
Jana        220      2
William     170      1

这就是我想要的,但是我不能让它为我做:
SalesRep   Amount  Rank
--------   ------  ----
Nadia       800      1
Daniel      220      2
Jana        220      2
William     170      4

您可以通过正确的HiveQL帮助我对销售代表进行排名吗?

感谢JtheRocker的回应。他的更改产生了以下列表:
SalesRep   Amount  Rank
--------   ------  ----
William     170     1
Daniel      220     2
Jana        220     2
Nadia       800     3

稍作修改以将Nadia显示为第四(不是第三):
private row_number;
@Override
public Object evaluate(DeferredObject[] currentKey) throws HiveException {
  row_number++;
  if (!sameAsPreviousKey(currentKey)) {
    this.counter = row_number;
    copyToPreviousKey(currentKey);
  }
return new Long(this.counter);
}

最佳答案

通过Hive 0.11中引入的Windowing and Analytics functions,您可以使用:

select SalesRepId, volume as amount , rank() over (order by V.volume desc) as rank from
(select SalesRepId,sum(amount) as volume from purchases group by SalesRepId) V;

关于hadoop - HiveQL和rank(),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/18156105/

10-12 19:08