问题描述
我只有将映射器PrepareData用于将文本数据转换为具有 VLongWritable 的 SequencialFile 作为键和 DoubleArrayWritable 作为值。
当我使用直线在455000x90(〜384 Mb)的数据上运行它时, p>
在本地模式平均需要:
- 在Athlon 64 X2 Dual Core 5600+,2.79Ггц上51秒 >
- 在Athlon 64处理器上运行54秒,3700+,1Ггц;
=>平均52-53秒。但是当我用这两台机器(Athlon 64 X2 Dual Core 5600+,3700+)在真正的集群上运行它时,最好的情况是81秒。 p>
作业使用4个mapper(块大小〜96 mb)和2个缩减器执行。
由 Hadoop 0.21.0 支持的群集,配置为用于jvm重用。
$ b Mapper :
public class PrepareDataMapper
扩展Mapper< LongWritable,Text,VLongWritable,DoubleArrayWritable> {
private int size;
//提示
private DoubleWritable [] doubleArray;
private DoubleArrayWritable mapperOutArray = new DoubleArrayWritable();
private VLongWritable mapOutKey = new VLongWritable();
@Override
protected void set(Context context)throws IOException {
Configuration conf = context.getConfiguration();
size = conf.getInt(dataDimSize,0);
doubleArray = new DoubleWritable [size];
for(int i = 0; i< size; i ++){
doubleArray [i] = new DoubleWritable();
$ b @Override
public void map(
LongWritable key,
文本行,
上下文上下文)throws IOException,InterruptedException {
String [] fields = row.toString()。split(,);
for(int i = 0; i< size; i ++){
doubleArray [i] .set(Double.valueOf(fields [i]));
}
mapperOutArray.set(doubleArray);
mapOutKey.set(key.get());
context.write(mapOutKey,mapperOutArray);
DoubleArrayWritable : p>
public class DoubleArrayWritable extends ArrayWritable {
$ b $ public DoubleArrayWritable(){
super(DoubleWritable.class);
}
public DoubleArrayWritable(DoubleWritable [] values){
super(DoubleWritable.class,values);
}
public void set(DoubleWritable [] values){
super.set(values);
}
public DoubleWritable get(int idx){
return(DoubleWritable)get()[idx];
public double [] getVector(int from,int to){
int sz = to - from + 1;
double [] vector = new double [sz];
for(int i = from; i vector [i-from] = get(i).get();
}
返回向量;
}
}
解决方案 I可以猜测,不同的是在工作的时间。对于本地模式是几秒钟,而对于集群来说通常是几十秒。
要验证此假设,您可以放置更多数据并验证集群性能是否比单节点更好。
其他可能的原因 - 您可能没有足够的映射器来充分利用您的硬件。我会建议尝试一些映射器x2的核心数量。
I have a job with mapper PrepareData only which needed for converting text data to SequencialFile with VLongWritable as a key and DoubleArrayWritable as a value.
When I run it over 455000x90 (~384 Mb) data with lines, for example:
in local mode it's takes on average:
- 51 seconds on Athlon 64 X2 Dual Core 5600+, 2.79Ггц;
- 54 seconds on Athlon 64 Processor 3700+, 1Ггц;
=> 52-53 seconds on average.
but when I run it in real cluster with this 2 machines (Athlon 64 X2 Dual Core 5600+, 3700+) it's takes 81 seconds in best case.
Job executed with 4 mapper (block size ~96 mb) and 2 reducers.
Cluster powered by Hadoop 0.21.0, configured for jvm reuse.
Mapper:
public class PrepareDataMapper
extends Mapper<LongWritable, Text, VLongWritable, DoubleArrayWritable> {
private int size;
// hint
private DoubleWritable[] doubleArray;
private DoubleArrayWritable mapperOutArray = new DoubleArrayWritable();
private VLongWritable mapOutKey = new VLongWritable();
@Override
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
size = conf.getInt("dataDimSize", 0);
doubleArray = new DoubleWritable[size];
for (int i = 0; i < size; i++) {
doubleArray[i] = new DoubleWritable();
}
}
@Override
public void map(
LongWritable key,
Text row,
Context context) throws IOException, InterruptedException {
String[] fields = row.toString().split(",");
for (int i = 0; i < size; i++) {
doubleArray[i].set(Double.valueOf(fields[i]));
}
mapperOutArray.set(doubleArray);
mapOutKey.set(key.get());
context.write(mapOutKey, mapperOutArray);
}
}
DoubleArrayWritable:
public class DoubleArrayWritable extends ArrayWritable {
public DoubleArrayWritable() {
super(DoubleWritable.class);
}
public DoubleArrayWritable(DoubleWritable[] values) {
super(DoubleWritable.class, values);
}
public void set(DoubleWritable[] values) {
super.set(values);
}
public DoubleWritable get(int idx) {
return (DoubleWritable) get()[idx];
}
public double[] getVector(int from, int to) {
int sz = to - from + 1;
double[] vector = new double[sz];
for (int i = from; i <= to; i++) {
vector[i-from] = get(i).get();
}
return vector;
}
}
解决方案 I can guess that the different is in the job srart-up time. For the local mode it is a few seconds, while for the cluster it is usually dozens of seconds.
To verify this assumption you can put more data and verify that cluster performance became better then single node.
Additional possible cause - you might have not enough mappers to fully utilize your hardware. I would suggest trying number of mappers x2 of number of cores you have.
这篇关于为什么仅仅使用mapper的工作在真正的集群中太慢了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!