java - 考虑到我要将DataBag溢出到磁盘，为什么此Pig UDF会导致 “Error: Java heap space”？

这是我的UDF:

public DataBag exec(Tuple input) throws IOException {
    Aggregate aggregatedOutput = null;

    int spillCount = 0;

    DataBag outputBag = BagFactory.newDefaultBag();
    DataBag values = (DataBag)input.get(0);
    for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
        Tuple tuple = iterator.next();
        //spillCount++;
        ...
        if (some condition regarding current input tuple){
            //do something to aggregatedOutput with information from input tuple
        } else {
            //Because input tuple does not apply to current aggregateOutput
            //return current aggregateOutput and apply input tuple
            //to new aggregateOutput
            Tuple returnTuple = aggregatedOutput.getTuple();
            outputBag.add(returnTuple);
            spillCount++;
            aggregatedOutputTuple = new Aggregate(tuple);


            if (spillCount == 1000) {
                outputBag.spill();
                spillCount = 0;
            }
        }
    }
    return outputBag;
}

请关注以下事实:每输入1000个元组，包就会溢出到磁盘上。我将此数字设置为低至50，高至100,000，但仍然收到内存错误:

Pig logfile dump:

Backend error message
---------------------
Error: Java heap space

Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space

我该怎么解决？它正在处理大约一百万行。
这是解决方案
使用累加器界面:

public class Foo extends EvalFunc<DataBag> implements Accumulator<DataBag> {
    private DataBag outputBag = null;
    private UltraAggregation currentAggregation = null;

    public void accumulate(Tuple input) throws IOException {
        DataBag values = (DataBag)input.get(0);
        Aggregate aggregatedOutput = null;
        outputBag = BagFactory.getInstance().newDefaultBag();

        for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
            Tuple tuple = iterator.next();
            ...
            if (some condition regarding current input tuple){
                //do something to aggregatedOutput with information from input tuple
            } else {
                //Because input tuple does not apply to current aggregateOutput
                //return current aggregateOutput and apply input tuple
                //to new aggregateOutput
                outputBag.add(aggregatedOutput.getTuple());
                aggregatedOutputTuple = new Aggregate(tuple);
            }
        }
    }

    // Called when all tuples from current key have been passed to accumulate
    public DataBag getValue() {
        //Add final current aggregation
        outputBag.add(currentAggregation.getTuple());
        return outputBag;
    }
    // This is called after getValue()
    // Not sure if these commands are necessary as they are repeated in beginning of accumulate
    public void cleanup() {
        outputBag = null;
        currentAggregation = null;
    }

    public DataBag exec(Tuple input) throws IOException {
        // Same as above ^^ but this doesn't appear to ever be called.
    }

    public Schema outputSchema(Schema input) {
        try {
            return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG));
        } catch {FrontendException e) {
            e.printStackTrace();
            return null;
        }
    }

    class Aggregate {
        ...
        public Tuple getTuple() {
            Tuple output = TupleFactory.getInstance().newTuple(OUTPUT_TUPLE_SIZE);
            try {
                output.set(0, val);
                ...
            } catch (ExecException e) {
                e.printStackTrace();
                return null;
            }
        }
        ...
    }
}

最佳答案

每次追加到spillCount时，都应增加outputBag，而不是每次从迭代器中获取元组时，都应增加。仅当spillCount为1000的倍数且您的if条件不满足时，您才在进行溢出，这可能不会经常发生(取决于逻辑)。这可以解释为什么对于不同的泄漏阈值您看不出多少差异。

如果那不能解决您的问题，我将尝试扩展AccumulatorEvalFunc<DataBag>。就您而言，您实际上不需要访问整个包。您的实现适合于累加器样式的实现，因为您只需要访问当前的元组。这可能会减少内存使用。本质上，您将拥有一个DataBag类型的实例变量来累积最终输出。您还将为aggregatedOutput拥有一个实例变量，该变量将具有当前聚合。调用accumulate()可能会1)更新当前聚合，或2)将当前聚合添加到aggregatedOutput并开始新的聚合。这基本上遵循您的for循环的主体。