hadoop - 为什么 hbase KeyValueSortReducer 需要对所有 KeyValue 进行排序

我最近正在学习 Phoenix CSV Bulk Load，我发现当一行中的列很大时，org.apache.phoenix.mapreduce.CsvToKeyValueReducer 的源代码会导致 OOM(java 堆内存不足)(在我的情况下，一行中有 44 列，平均大小为 1行是 4KB)。

更重要的是，这个类与 hbase 批量加载 reducer 类 - KeyValueSortReducer 类似。这意味着在我的情况下使用 KeyValueSortReducer 时可能会发生 OOM。

所以，我有一个关于 KeyValueSortReducer - 的问题，为什么它需要首先对树集中的所有 kv 进行排序，然后将它们全部写入上下文？如果我删除treeset排序代码并将所有kvs直接写入上下文，结果会不同还是错误？

我期待着您的回复。给你最好的祝愿!

这是 KeyValueSortReducer 的源代码:

public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
  protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
      org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
  throws java.io.IOException, InterruptedException {
    TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
    for (KeyValue kv: kvs) {
      try {
        map.add(kv.clone());
      } catch (CloneNotSupportedException e) {
        throw new java.io.IOException(e);
      }
    }
    context.setStatus("Read " + map.getClass());
    int index = 0;
    for (KeyValue kv: map) {
      context.write(row, kv);
      if (++index % 100 == 0) context.setStatus("Wrote " + index);
    }
  }
}

最佳答案

请查看 this case study 。有一些要求，您需要将键值对排序到 HFile 的同一行中。

关于hadoop - 为什么 hbase KeyValueSortReducer 需要对所有 KeyValue 进行排序，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/37047145/