本文介绍了是否可以在没有 HDFS 的伪分布式操作中运行 Hadoop?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索在本地系统上运行 hadoop 应用程序的选项.

I'm exploring the options for running a hadoop application on a local system.

与许多应用程序一样,前几个版本应该能够在单个节点上运行,只要我们可以使用所有可用的 CPU 内核(是的,这与 这个问题).当前的限制是,在我们的生产系统上,我们有 Java 1.5,因此我们必须将 Hadoop 0.18.3 作为最新版本(参见 这个问题).所以很遗憾,我们还不能使用这个新功能.

As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.

第一个选项是在伪分布式模式下简单地运行 hadoop.本质上:创建一个完整的 hadoop 集群,其中的所有内容都在 1 个节点上运行.

The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.

这种形式的缺点"是它还使用了成熟的 HDFS.这意味着为了处理输入数据,必须首先将其上传"到本地存储的 DFS 上.因此,这需要额外的输入和输出数据传输时间并使用额外的磁盘空间.当我们保持在单节点配置上时,我想避免这两种情况.

The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.

所以我在想:是否可以覆盖fs.hdfs.impl"设置并将其从org.apache.hadoop.dfs.DistributedFileSystem"更改为(例如)org.apache.hadoop.fs".LocalFileSystem"?

So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?

如果这可行,本地"hadoop 集群(只能由一个节点组成)可以使用现有文件而无需任何额外的存储要求,并且可以更快地启动,因为不需要上传文件.我希望仍然有一个工作和任务跟踪器,也许还有一个 namenode 来控制整个事情.

If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.

有没有人试过这个?它可以工作还是这个想法与预期用途相差太远?

Has anyone tried this before?Can it work or is this idea much too far off the intended use?

或者有没有更好的方法达到同样的效果:没有HDFS的伪分布式操作?

Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?

感谢您的见解.

编辑 2:

这是我为 hadoop 0.18.3 创建的配置conf/hadoop-site.xml 使用 bajafresh4life 提供的答案.

This is the config I created for hadoop 0.18.3conf/hadoop-site.xml using the answer provided by bajafresh4life.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>file:///</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:33301</value>
  </property>

  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>localhost:33302</value>
    <description>
    The job tracker http server address and port the server will listen on.
    If the port is 0 then the server will start on a free port.
    </description>
  </property>

  <property>
    <name>mapred.task.tracker.http.address</name>
    <value>localhost:33303</value>
    <description>
    The task tracker http server address and port.
    If the port is 0 then the server will start on a free port.
    </description>
  </property>

</configuration>

推荐答案

是的,这是可能的,尽管我使用的是 0.19.2.我对 0.18.3 不太熟悉,但我很确定它不会有什么不同.

Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.

只需确保 fs.default.name 设置为默认值(即 file:///),并且 mapred.job.tracker 设置为指向您的 jobtracker 所在的位置.然后使用 bin/start-mapred.sh 启动你的守护进程.您不需要启动名称节点或数据节点.此时,您应该能够使用 bin/hadoop jar ...

Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...

我们已使用此配置在使用安装在 NFS 上的 Netapp 设备的小型机器集群上运行 Hadoop.

We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

这篇关于是否可以在没有 HDFS 的伪分布式操作中运行 Hadoop?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 10:06