问题描述
我们目前使用saveAsNewAPIHadoopDataset导入数据通过星火RDDS(pyspark)HBase的表()。
通过MA preduce使用HBase的批量加载功能这个功能呢?换句话说,将saveAsNewAPIHadoopDataset(),它直接导入到HBase的,相当于用saveAsNewAPIHadoopFile()写Hfiles到HDFS,然后调用org.apache.hadoop.hbase.ma preduce.LoadIncrementalHFiles加载到HBase的?
下面是我们的HBase的加载程序的一个例子片断:
的conf = {hbase.zookeeper.quorum:config.get(的gethostname(),'HBaseQuorum'),
zookeeper.znode.parent:config.get(的gethostname(),'ZKznode'),
hbase.ma pred.outputtable:TABLE_NAME,
马preduce.outputformat.class:org.apache.hadoop.hbase.ma preduce.TableOutputFormat
马preduce.job.output.key.class:org.apache.hadoop.hbase.io.ImmutableBytesWritable
马preduce.job.output.value.class:org.apache.hadoop.io.Writable}keyConv =org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
valueConv =org.apache.spark.examples.pythonconverters.StringListToPutConverterspark_rdd.saveAsNewAPIHadoopDataset(CONF = CONF,keyConverter = keyConv,valueConverter = valueConv)
不完全是。 RDD.saveAsNewAPIHadoopDataset
和 RDD.saveAsNewAPIHadoopFile
做几乎同样的事情。他们的API仅仅是一个有点不同。每个提供了不同的机制VS政策的选择。
We currently import data to HBase tables via Spark RDDs (pyspark) by using saveAsNewAPIHadoopDataset().
Is this function using the HBase bulk loading feature via mapreduce? In other words, would saveAsNewAPIHadoopDataset(), which imports directly to HBase, be equivalent to using saveAsNewAPIHadoopFile() to write Hfiles to HDFS, and then invoke org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load to HBase?
Here is an example snippet of our HBase loading routine:
conf = {"hbase.zookeeper.quorum": config.get(gethostname(),'HBaseQuorum'),
"zookeeper.znode.parent":config.get(gethostname(),'ZKznode'),
"hbase.mapred.outputtable": table_name,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
spark_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
Not exactly. RDD.saveAsNewAPIHadoopDataset
and RDD.saveAsNewAPIHadoopFile
do almost the same thing. Their API is just a little different. Each provides a different 'mechanism vs policy' choice.
这篇关于PySpark:可以saveAsNewAPIHadoopDataset()作为批量加载到HBase的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!