本文介绍了datanode,regionserver在Hbase-hadoop集成中的作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我的理解,行被插入到HBase表中,并作为区域存储在不同的区域服务器中。因此,区域服务器存储数据



同样,就Hadoop而言,数据存储在存在于hadoop群集中的数据节点中。



假设我在Hadoop 1.1.1之上配置了HBase 0.90.6,如​​下所示:

节点 - 主从服务器


  1. 主节点充当

    • Hadoop - Namenode,Secondary Namenode,作业跟踪器,数据节点,任务跟踪程序

    • HBase - Master,RegionServer和zookeeper。




      • Hadoop数据节点和任务跟踪器

      • HBase地区服务器


基于我的声明,表格数据是否存储在区域服务器中;那么数据节点和区域服务器的作用是什么?

解决方案

数据节点存储数据。区域服务器本质上缓冲I / O操作;数据永久存储在HDFS上(即数据节点)。我不认为把区域服务器放在你的'主'节点上是个好主意。



您可以使用集群运行HDFS(NameNode + DataNodes),复制因子为3(每个HDFS块都被复制到3个不同的DataNode中)。

您在与DataNodes相同的服务器上运行RegionServers。当写入请求到达RegionServer时,它首先将更改写入内存并提交日志;那么在某一时刻,它决定是时候将更改写入HDFS上的永久存储器。以下是数据局部性:由于您在同一台服务器上运行RegionServer和DataNode,因此该文件的第一个HDFS块副本将写入同一台服务器。其他两个副本将写入其他DataNode。因此,服务于该地区的RegionServer几乎总是可以访问本地数据副本。



如果RegionServer崩溃或RegionMaster决定将区域重新分配给另一个RegionServer(保持群集均衡)? New RegionServer将被迫首先执行远程读取,但是一旦执行压缩(将更改日志合并到数据中) - 新文件将由新的RegionServer写入HDFS,并且将在RegionServer上创建本地副本(再次,因为DataNode和RegionServer在同一台服务器上运行)。注意:如果发生RegionServer崩溃,先前分配给它的区域将被重新分配给多个RegionServer。



好的阅读:


  • Tom White,权威指南Hadoop对HDFS体系结构有很好的解释。不幸的是,我没有阅读原始的Google GFS论文,所以我不能说是否容易遵循。

  • google.com/archive/bigtable-osdi06.pdf\">Google BigTable 文章。 HBase是Google BigTable的实现,我发现本文中的架构描述是最容易遵循的。




这里是谷歌Bigtable和HBase实现之间的命名差异(来自Lars George,HBase,权威指南):


  • HBase - Bigtable li>
  • 区域 - 平板电脑
  • RegionServer - 平板电脑服务器
  • 小型压缩 - 合并压缩

  • 主要压缩 - 主要压缩

  • 预先写入日志 - 提交日志
  • HDFS - GFS

  • Hadoop MapReduce - MapReduce

  • MemStore - memtable
  • HFile - SSTable
  • li>
  • Zookeeper - 胖乎乎的


From my understanding rows are inserted into HBase tables and are getting stored as regions in different region server. So, the region server stores the data

Similarly in terms of Hadoop, data is stored in the data nodes present in the hadoop cluster.

Lets say that i have HBase 0.90.6 configured on top of Hadoop 1.1.1 as follows

2 nodes - master and slave

  1. Master node acts as,
    • Hadoop - Namenode, Secondary Namenode, job tracker, data node, task tracker
    • HBase - Master, RegionServer and zookeeper.
  2. Slave node acts as,
    • Hadoop datanode and task tracker
    • HBase region server

Based on my statement if table data is stored in the region servers; then what is the role of the data nodes and region servers?

解决方案

Data nodes store data. Region server(s) essentially buffer I/O operations; data is permanently stored on HDFS (that is, data nodes). I do not think that putting region server on your 'master' node is a good idea.

Here is a simplified picture of how regions are managed:

You have a cluster running HDFS (NameNode + DataNodes) with replication factor of 3 (each HDFS block is copied into 3 different DataNodes).

You run RegionServers on the same servers as DataNodes. When write request comes to RegionServer it first writes changes into memory and commit log; then at some point it decides that it is time to write changes to permanent storage on HDFS. Here is were data locality comes into play: since you run RegionServer and DataNode on the same server, first HDFS block replica of the file will be written to the same server. Two other replicas will be written to, well, other DataNodes. As a result RegionServer serving the region will almost always have access to local copy of data.

What if RegionServer crashes or RegionMaster decided to reassign region to another RegionServer (to keep cluster balanced)? New RegionServer will be forced to perform remote read first, but as soon as compaction is performed (merging of change log into the data) - new file will be written to HDFS by the new RegionServer, and local copy will be created on the RegionServer (again, because DataNode and RegionServer runs on the same server).

Note: in case of RegionServer crash, regions previously assigned to it will be reassigned to multiple RegionServers.

Good reads:

  • Tom White, "Hadoop, The Definitive Guide" has good explanation of HDFS architecture. Unfortunately I did not read original Google GFS paper, so I cannot tell if it is easy to follow.

  • Google BigTable article. HBase is implementation of Google BigTable, and I found that architecture description in this article is the easiest to follow.

Here is nomenclature differences between Google Bigtable and HBase implementation (from Lars George, "HBase, The Definitive Guide"):

  • HBase - Bigtable
  • Region - Tablet
  • RegionServer - Tablet server
  • Flush - Minor compaction
  • Minor compaction - Merging compaction
  • Major compaction - Major compaction
  • Write ahead log - Commit log
  • HDFS - GFS
  • Hadoop MapReduce - MapReduce
  • MemStore - memtable
  • HFile - SSTable
  • Zookeeper - Chubby

这篇关于datanode,regionserver在Hbase-hadoop集成中的作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 05:15