问题描述
所以这是我的情况:
我有一个使用HBase的mapreduce作业。我的映射器需要一行文本输入并更新HBase。我没有减速器,也没有写入任何输出到光盘。我希望能够在我预计会出现一次利用率时向群集添加更多处理能力,然后在利用率降低时缩减回去。假设我暂时无法使用Amazon或任何其他云提供商;我在一个私人集群中运行。
当我需要更多容量时,一种解决方案是将新机器添加到群集中。但是,我希望能够无需等待或麻烦地添加和删除这些机器。我不想在每次需要添加或删除节点时重新平衡HDFS。所以看起来好的策略是拥有一个核心集群,每个机器运行一个任务跟踪器和一个数据节点,并且当我需要增加容量时,我可以启动一些运行tasktrackers的一次性机器,但不是datanode。这可能吗?如果是这样,有什么影响?
我意识到在没有datanode的机器上运行的tasktracker不会有数据局部性的好处。但在实践中,这是什么意思?我在想,当在一个一次性机器上安排工作时,jobtracker将通过网络发送一行输入到tasktracker,然后该tasktracker将该行输入并直接输入给Mapper,而无需写任何东西到光盘。这是怎么回事?
哦,我正在使用Cloudera cdh3u3。不知道是否重要。
不完全是,作业跟踪器会执行任务跟踪器来运行地图任务来处理输入分割。 JobTracker不会将数据传递给任务跟踪器,更多的是传递序列化的分割信息(文件名,起始偏移量和长度)。 TaskTracker运行MapTask,它是MapTask实例化InputFormat和关联的RecordReader以获取拆分信息 - 将输入键/值传递给Mapper。
在没有本地数据节点的情况下,或者您确实有本地数据节点,但数据不会在本地数据节点上复制,数据将从另一个数据节点通过网络读取(希望可以挂在本地,但仍可能来自其他地方)。
您可以查看多长时间一次的统计信息数据块在Hadoop计数器输出中位于任务本地或机架本地。
So here's my situation :
I have a mapreduce job that uses HBase. My mapper takes one line of text input and updates HBase. I have no reducer, and I'm not writing any output to the disc. I would like the ability to add more processing power to my cluster when I'm expecting a burst of utilization, and then scale back down when utilization decreases. Let's assume for the moment that I can't use Amazon or any other cloud provider; I'm running in a private cluster.
One solution would be to add new machines to my cluster when I need more capacity. However, I want to be able to add and remove these machines without any waiting or hassle. I don't want to rebalance HDFS every time I need to add or remove a node.
So it would seem that a good strategy would be to have a "core" cluster, where each machine is running a tasktracker AND a datanode, and when I need added capacity, I can spin up some "disposable" machines that are running tasktrackers, but NOT datanodes. Is this possible? If so, what are the implications?
I realize that a tasktracker running on a machine with no datanode won't have the benefit of data locality. But in practice, what does this mean? I'm imagining that, when scheduling a job on one of the "disposable" machines, the jobtracker will send a line of input over the network to the tasktracker, which then takes that line of input and feeds it directly to a Mapper, without writing anything to the disc. Is this what happens?
Oh, and I'm using Cloudera cdh3u3. Don't know if that matters.
Not quite, the Job tracker tasks a task tracker to run a map task to process the input split. The JobTracker does not pass the data to the task tracker, more is passes the serialized split information (file name, start offset and length). The TaskTracker runs the MapTask, and it is the MapTask that instantiates the InputFormat and associated RecordReader for the split information - passing the input Key/Values to the Mapper.
In the case where you don't have a local data node, OR you do have a local data node, but the data is not replicated on the local data node, the data will be read across the network from another data node (hopefully rack local, but could still come from somewhere else).
You can see the stats for how often a data block was local to the task or local to the rack in the Hadoop counters output.
这篇关于使用Hadoop,我可以在没有运行datanode的机器上创建一个tasktracker吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!