问题描述
我有一个RDD(combinerRDD),我在这个RDD下应用了转换
JavaPairRDD< String,Integer> count = combinerRDD.mapToPair(
)新PairFunction< Tuple2< LongWritable,Text>,String,Integer>(){
字符串文件名;
整数;
消息;
@Override
public Tuple2< String,Integer> call(Tuple2< LongWritable,Text> tuple)throws Exception {
xlhrCount = 0;
filename =;
filename =New_File;
for(JobStep js:message.getJobStep()){
if(js.getStepName()。equals(StepName.NEW_STEP)){
count + = 1;
}
}
返回新的Tuple2< String,Integer>(filename,xlhrCount);
}
} ).reduceByKey(新函数2<整数,整数,整数> (){
@Override
public Integer调用(Integer count1,Integer count2)抛出Exception {
return(count1 + count2);
}
}
);
我的问题是当 combinerRDD 有一些数据在里面,我得到了正确的结果。但是当 combinerRDD 为空时,写入HDFS的结果只是一个空的_SUCCESS文件。我期待2个文件在空RDD上转换,例如_SUCCESS和空白部分00000文件。我对吗?我应该得到多少个输出文件。
我之所以问这是因为我在2个群集中得到了不同的结果,群集1上运行的代码导致了_SUCCESS文件并且第2组导致了_SUCCESS和空的部分00000。我现在很困惑。注意:我在 newRDD.leftOuterJoin(combinerRDD)上进行左连接是否依赖任何群集设置?
c>,这给了我没有结果(当combinerRDD只有_SUCCESS)和newRDD包含值时。
我创建了如下所示的空对RDD :
JavaRDD< Tuple2< LongWritable,Text>> emptyRDD = context.emptyRDD();
myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);
现在使用:
列表< Tuple2< LongWritable,Text>> data = Arrays.asList();
JavaRDD< Tuple2< LongWritable,Text>> emptyRDD = context.parallelize(data);
myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);
它现在可以工作,即我的RDD不再是空的。修正版本有以下版本:
1.3.2,1.4.2,1.5.0(参考上面的链接)。
I have an RDD(combinerRDD)on which I applied below transformation
JavaPairRDD<String, Integer> counts = combinerRDD.mapToPair( new PairFunction<Tuple2<LongWritable, Text>, String, Integer>() { String filename; Integer count; Message message; @Override public Tuple2<String, Integer> call(Tuple2<LongWritable, Text> tuple) throws Exception { xlhrCount = 0; filename = ""; filename = "New_File"; for (JobStep js : message.getJobStep()) { if (js.getStepName().equals(StepName.NEW_STEP)) { count += 1; } } return new Tuple2<String, Integer>(filename, xlhrCount); } }).reduceByKey(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer count1, Integer count2) throws Exception { return (count1 + count2); } } );
My question is when combinerRDD has some data inside, I get right result .But when combinerRDD is empty the result written into HDFS is only an empty _SUCCESS file . I was expecting 2 files in the case of transformation on an empty RDD ie _SUCCESS and empty part-00000 file .Am I right? How many output files should I get .
I reason why I am asking this is because I got different result in 2 clusters , the code ran on cluster 1 resulted in _SUCCESS file and cluster 2 resulted in _SUCCESS and empty part-00000 . I am confused now . Is the result dependent on any cluster setup?
Note : I am doing a left join on newRDD.leftOuterJoin(combinerRDD), which gives me no result(when combinerRDD has only _SUCCESS) and newRDD contains value .
Ok,so I found a solution. I am using spark-1.3.0, which has below issue: ie. a left outer join with an emptyRDD gives empty result .
https://issues.apache.org/jira/browse/SPARK-9236
I was creating empty Pair RDD like below:
JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.emptyRDD(); myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);
Now using :
List<Tuple2<LongWritable, Text>> data = Arrays.asList(); JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.parallelize(data); myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);
It works now, ie my RDD is no more empty. Fix is available in versions: 1.3.2, 1.4.2, 1.5.0 (reference above link).
这篇关于空RDD上的转换结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!