问题描述
我最近学习Phoenix CSV Bulk Load,发现 org.apache.phoenix.mapreduce.CsvToKeyValueReducer
的源代码会导致OOM(java堆内存不足)当列在一行中很大时(在我的情况下,一行中有44列,一行的平均大小是4KB)。 更重要的是,这个类与hbase批量加载减速器类相似 - KeyValueSortReducer
。这意味着在我的情况下使用 KeyValueSortReducer
时可能会发生OOM。因此,我有一个 KeyValueSortReducer
- 的问题,为什么它需要先对treeset中的所有kvs进行排序然后将它们全部写入上下文?如果我删除treeset排序代码并直接向上下文中提供所有kvs,结果会有所不同或出错?
我期待着您回复。祝你好运!
这里是KeyValueSortReducer的源代码:
public class KeyValueSortReducer extends Reducer< ImmutableBytesWritable,KeyValue,ImmutableBytesWritable,KeyValue> {
protected void reduce(ImmutableBytesWritable row,java.lang.Iterable< KeyValue> kvs,
org.apache.hadoop.mapreduce.Reducer< ImmutableBytesWritable,KeyValue,ImmutableBytesWritable,KeyValue> .Context context)
抛出java.io.IOException,InterruptedException {
TreeSet< KeyValue> map = new TreeSet< KeyValue>(KeyValue.COMPARATOR); (KeyValue kv:kvs){
try {
map.add(kv.clone());
catch(CloneNotSupportedException e){
抛出新的java.io.IOException(e);
}
}
context.setStatus(Read+ map.getClass());
int index = 0; (KeyValue kv:map){
context.write(row,kv);
if(++ index%100 == 0)context.setStatus(Wrote+ index);
请查看。有一些要求您需要将keyvalue对订购到HFile中的同一行。
I am learning Phoenix CSV Bulk Load recently and I found that the source code of org.apache.phoenix.mapreduce.CsvToKeyValueReducer
will cause OOM ( java heap out of memory ) when columns are large in one row (In my case, 44 columns in one row and the avg size of one row is 4KB).
What's more, this class is similar with the hbase bulk load reducer class - KeyValueSortReducer
. It means that OOM may happen when using KeyValueSortReducer
in my case.
So, I have a question of KeyValueSortReducer
- why it need to sort all kvs in treeset first and then write all of them to context? If I remove the treeset sorting code and wirte all kvs directly to the context, the result will be different or be wrong ?
I am looking forward to your reply. Best wish to you!
here is the source code of KeyValueSortReducer:
public class KeyValueSortReducer extends Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue> {
protected void reduce(ImmutableBytesWritable row, java.lang.Iterable<KeyValue> kvs,
org.apache.hadoop.mapreduce.Reducer<ImmutableBytesWritable, KeyValue, ImmutableBytesWritable, KeyValue>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
for (KeyValue kv: kvs) {
try {
map.add(kv.clone());
} catch (CloneNotSupportedException e) {
throw new java.io.IOException(e);
}
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (KeyValue kv: map) {
context.write(row, kv);
if (++index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}
please have a look in to this case study. there are some requirements where you need to order keyvalue pairs into the same row in the HFile.
这篇关于为什么hbase KeyValueSortReducer需要对所有KeyValue进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!