我正在尝试在map reduce上运行weka分类器,并且加载甚至200mb的整个arff文件都会导致堆空间错误,因此我想将arff文件拆分为多个块,但问题是它必须维护块信息,即arff在每个块中分配信息,以便在每个映射器中运行分类器。这是我试图拆分数据但无法提高效率的代码,
List<InputSplit> splits = new ArrayList<InputSplit>();
for (FileStatus file: listStatus(job)) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
//number of bytes in this file
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
// make sure this is actually a valid file
if(length != 0) {
// set the number of splits to make. NOTE: the value can be changed to anything
int count = job.getConfiguration().getInt("Run-num.splits",1);
for(int t = 0; t < count; t++) {
//split the file and add each chunk to the list
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
}
}
else {
// Create empty array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
return splits;
最佳答案
你先试过这个吗?
在mapred-site.xml中,添加以下属性:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
</property>
// MR作业的内存分配
关于java - 将输入的Arff文件分割成较小的块以处理非常大的数据集,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30080643/