这是我的UDF:
public DataBag exec(Tuple input) throws IOException {
Aggregate aggregatedOutput = null;
int spillCount = 0;
DataBag outputBag = BagFactory.newDefaultBag();
DataBag values = (DataBag)input.get(0);
for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
Tuple tuple = iterator.next();
//spillCount++;
...
if (some condition regarding current input tuple){
//do something to aggregatedOutput with information from input tuple
} else {
//Because input tuple does not apply to current aggregateOutput
//return current aggregateOutput and apply input tuple
//to new aggregateOutput
Tuple returnTuple = aggregatedOutput.getTuple();
outputBag.add(returnTuple);
spillCount++;
aggregatedOutputTuple = new Aggregate(tuple);
if (spillCount == 1000) {
outputBag.spill();
spillCount = 0;
}
}
}
return outputBag;
}
请关注以下事实:每输入1000个元组,包就会溢出到磁盘上。我将此数字设置为低至50,高至100,000,但仍然收到内存错误:Pig logfile dump:
Backend error message
---------------------
Error: Java heap space
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space
我该怎么解决?它正在处理大约一百万行。这是解决方案
使用累加器界面:
public class Foo extends EvalFunc<DataBag> implements Accumulator<DataBag> {
private DataBag outputBag = null;
private UltraAggregation currentAggregation = null;
public void accumulate(Tuple input) throws IOException {
DataBag values = (DataBag)input.get(0);
Aggregate aggregatedOutput = null;
outputBag = BagFactory.getInstance().newDefaultBag();
for (Iterator<Tuple> iterator = values.iterator(); iterator.hasNext();) {
Tuple tuple = iterator.next();
...
if (some condition regarding current input tuple){
//do something to aggregatedOutput with information from input tuple
} else {
//Because input tuple does not apply to current aggregateOutput
//return current aggregateOutput and apply input tuple
//to new aggregateOutput
outputBag.add(aggregatedOutput.getTuple());
aggregatedOutputTuple = new Aggregate(tuple);
}
}
}
// Called when all tuples from current key have been passed to accumulate
public DataBag getValue() {
//Add final current aggregation
outputBag.add(currentAggregation.getTuple());
return outputBag;
}
// This is called after getValue()
// Not sure if these commands are necessary as they are repeated in beginning of accumulate
public void cleanup() {
outputBag = null;
currentAggregation = null;
}
public DataBag exec(Tuple input) throws IOException {
// Same as above ^^ but this doesn't appear to ever be called.
}
public Schema outputSchema(Schema input) {
try {
return new Schema(new FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG));
} catch {FrontendException e) {
e.printStackTrace();
return null;
}
}
class Aggregate {
...
public Tuple getTuple() {
Tuple output = TupleFactory.getInstance().newTuple(OUTPUT_TUPLE_SIZE);
try {
output.set(0, val);
...
} catch (ExecException e) {
e.printStackTrace();
return null;
}
}
...
}
}
最佳答案
每次追加到spillCount
时,都应增加outputBag
,而不是每次从迭代器中获取元组时,都应增加。仅当spillCount为1000的倍数且您的if条件不满足时,您才在进行溢出,这可能不会经常发生(取决于逻辑)。这可以解释为什么对于不同的泄漏阈值您看不出多少差异。
如果那不能解决您的问题,我将尝试扩展AccumulatorEvalFunc<DataBag>
。就您而言,您实际上不需要访问整个包。您的实现适合于累加器样式的实现,因为您只需要访问当前的元组。这可能会减少内存使用。本质上,您将拥有一个DataBag类型的实例变量来累积最终输出。您还将为aggregatedOutput
拥有一个实例变量,该变量将具有当前聚合。调用accumulate()
可能会1)更新当前聚合,或2)将当前聚合添加到aggregatedOutput
并开始新的聚合。这基本上遵循您的for循环的主体。