问题描述
我遇到以下问题。我有200k xml文件。我有200个文件夹,每个文件夹有2000个xml文件。我在HDFS中有这个。架构低于
RootFolder
Folder001
1.xml
2.xml
2000.xml
Folder002
2001.xml
我需要编写一个mapper程序读取文件并执行一些Xpath过程。
如果我给出 RootFolder 输入路径,那么映射器应该读取一个文件夹并进行处理xml文件
即应该有200个Task。并且每个文件夹应该由单个映射器读取
如何处理多个文件夹?
1:需要通过一个map任务映射子文件夹中的所有文件:
答案:您可以在这种情况下使用 CombineFileInputFormat
。它将为指定的 PathFilter
(在你的情况下,过滤器应该接受同一文件夹的文件)分组文件,并将它分配给一个单独的maptask。即可以实现每个文件夹的maptask。为了获得更好的控制,请扩展 CombineFileInputFormat
,并将其设为您自己的,这就是我在我的案例中所做的。
2 :需要在子文件夹中包含文件,作为地图任务的输入,只需指定根文件夹。
回答:在新的API版本中, FileInputFormat
可以从其子文件夹中递归地将文件提取到任何级别。
如需更多信息,请点击。
或者如果你想自己做,子类 FileInputFormat
并覆盖 listStatus
方法。
I'm having the following problem. I have 200k xml files. I have 200 folders and each folder has 2000 xml files. I have this in my HDFS. Architecture is below
RootFolder
Folder001
1.xml
2.xml
2000.xml
Folder002
2001.xml
I need to write a mapper program to read the files and do some Xpath process.
If I give the RootFolder input path then a mapper should read a folder and process the xml files
That is there should be 200 Task. And each folder should be read by a single mapper
How to process multiple folders?
From my understanding you have 2 problems:
1: Need to map all files in a subfolder by a single map task:
Ans: You can make use of CombineFileInputFormat
for this scenario. It will group files for a specified PathFilter
(in your case , filter should accept files of same folder) and will assign it to a single maptask. i.e, maptask per folder can be achieved. To get better control please extend CombineFileInputFormat
and make it your own , that what I do in my case.
2: Need to include files inside the subfolders too as input for your map task(s), by specifying only the root folder.
Ans: In the new API releases, FileInputFormat
can take files recursively from its subfolders up to any level.For more info you can see the jira here.
Or if you want to do it yourself, subclass FileInputFormat
and override listStatus
method.
这篇关于如何在HADOOP中处理多个文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!