问题描述
我要处理1000个文件.每个文件包含1000个串联在一起的XML文件.
I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.
我想使用Hadoop分别拆分每个XML文件.使用Hadoop做到这一点的一种好方法是什么?
I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?
注意:我是Hadoop新手.我打算使用Amazon EMR.
NOTES: I am total Hadoop newbie. I plan on using Amazon EMR.
推荐答案
查看 Mahout的XmlInputFormat .可惜的是,这是在Mahout中,而不是在核心发行版中.
Check out Mahout's XmlInputFormat. It's a shame that this is in Mahout and not in the core distribution.
连接的XML文件是否至少具有相同的格式?如果是这样,则将START_TAG_KEY
和END_TAG_KEY
设置为每个文件的根目录.每个文件将在map
中显示为一个Text
记录.然后,您可以使用自己喜欢的Java XML解析器来完成工作.
Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY
and END_TAG_KEY
to the root in each of your files. Each file will show up as one Text
record in the map
. Then, you can use your favorite Java XML parser to finish the job.
这篇关于hadoop作业以拆分xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!