hive--UDF、UDAF

1、UDF

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public final class Lower extends UDF {

  public Text evaluate(final Text s) {

    if (s == null) { return null; }

    return new Text(s.toString().toLowerCase());

  }

}

add jar my_jar.jar;

create temporary function my_lower as 'com.example.hive.udf.Lower';

主要描述了实现一个udf的过程，首先自然是实现一个UDF函数，然后编译为jar并加入到hive的classpath中，最后创建一个临时变量名字让hive中调用。

2、UDAF

package org.apache.hadoop.hive.contrib.udaf.example;

import org.apache.hadoop.hive.ql.exec.UDAF;

import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

/**

 * This is a simple UDAF that calculates average.

 *

 * It should be very easy to follow and can be used as an example for writing

 * new UDAFs.

 *

 * Note that Hive internally uses a different mechanism (called GenericUDAF) to

 * implement built-in aggregation functions, which are harder to program but

 * more efficient.

 *

 */

public final class UDAFExampleAvg extends UDAF {

  /**

   * The internal state of an aggregation for average.

   *

   * Note that this is only needed if the internal state cannot be represented

   * by a primitive.

   *

   * The internal state can also contains fields with types like

   * ArrayList<String> and HashMap<String,Double> if needed.

   */

  public static class UDAFAvgState {

    private long mCount;

    private double mSum;

  }

  /**

   * The actual class for doing the aggregation. Hive will automatically look

   * for all internal classes of the UDAF that implements UDAFEvaluator.

   */

  public static class UDAFExampleAvgEvaluator implements UDAFEvaluator {

    UDAFAvgState state;

    public UDAFExampleAvgEvaluator() {

      super();

      state = new UDAFAvgState();

      init();

    }

    /**

     * Reset the state of the aggregation.

     */

    public void init() {

      state.mSum = 0;

      state.mCount = 0;

    }

    /**

     * Iterate through one row of original data.

     *

     * The number and type of arguments need to the same as we call this UDAF

     * from Hive command line.

     *

     * This function should always return true.

     */

    public boolean iterate(Double o) {

      if (o != null) {

        state.mSum += o;

        state.mCount++;

      }

      return true;

    }

    /**

     * Terminate a partial aggregation and return the state. If the state is a

     * primitive, just return primitive Java classes like Integer or String.

     */

    public UDAFAvgState terminatePartial() {

      // This is SQL standard - average of zero items should be null.

      return state.mCount == 0 ? null : state;

    }

    /**

     * Merge with a partial aggregation.

     *

     * This function should always have a single argument which has the same

     * type as the return value of terminatePartial().

     */

    public boolean merge(UDAFAvgState o) {

      if (o != null) {

        state.mSum += o.mSum;

        state.mCount += o.mCount;

      }

      return true;

    }

    /**

     * Terminates the aggregation and return the final result.

     */

    public Double terminate() {

      // This is SQL standard - average of zero items should be null.

      return state.mCount == 0 ? null : Double.valueOf(state.mSum

          / state.mCount);

    }

  }

  private UDAFExampleAvg() {

    // prevent instantiation

  }

}

关于UDAF开发注意点：

1.需要import org.apache.hadoop.hive.ql.exec.UDAF以及org.apache.hadoop.hive.ql.exec.UDAFEvaluator,这两个包都是必须的

2.函数类需要继承UDAF类，内部类Evaluator实现UDAFEvaluator接口

3.Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数

1）init函数类似于构造函数，用于UDAF的初始化

2）iterate接收传入的参数，并进行内部的轮转。其返回类型为boolean

3）terminatePartial无参数，其为iterate函数轮转结束后，返回乱转数据，iterate和terminatePartial类似于hadoop的Combiner

4）merge接收terminatePartial的返回结果，进行数据merge操作，其返回类型为boolean

5）terminate返回最终的聚集函数结果