问题描述
我设立在EC2上Hadoop集群,我不知道该怎么办了DFS。我所有的数据,目前在S3中,所有的map / reduce应用程序使用S3文件路径来访问数据。现在,我一直在寻找在亚马逊电子病历是如何设置和似乎每个jobflow,一个名称节点和数据节点的设置。现在,我不知道我是否真的需要那样做,或者我可以只使用S3(N)为DFS?如果这样做,有什么缺点?
I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks?
谢谢!
推荐答案
为了使用S3代替HDFS fs.name.default核心-site.xml中需要指向你的水桶:
in order to use S3 instead of HDFS fs.name.default in core-site.xml needs to point to your bucket:
<property>
<name>fs.default.name</name>
<value>s3n://your-bucket-name</value>
</property>
我们推荐您使用S3N而不简单S3实现,因为S3N是readble任何其他应用程序,并通过自己:)
It's recommended that you use S3N and NOT simple S3 implementation, because S3N is readble by any other application and by yourself :)
此外,在同一核心site.xml文件,你需要指定以下属性:
Also, in the same core-site.xml file you need to specify the following properties:
- fs.s3n.awsAccessKeyId
- fs.s3n.awsSecretAccessKey
fs.s3n.awsSecretAccessKey
fs.s3n.awsSecretAccessKey
这篇关于使用S3作为fs.default.name或HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!