问题描述
我正在使用带有java的spark,而且我有500万行的RDD。是否有一个sollution可以让我计算我的RDD的行数。我已经尝试过 RDD.count()
,但这需要花费很多时间。我已经看到我可以使用函数 fold
。但我没有找到这个函数的java文档。
你能告诉我如何使用它或向我展示另一种解决方案来获取我的RDD的行数。
I'm using spark with java, and i hava an RDD of 5 millions rows. Is there a sollution that allows me to calculate the number of rows of my RDD. I've tried RDD.count()
but it takes a lot of time. I've seen that i can use the function fold
. But i didn't found a java documentation of this function.Could you please show me how to use it or show me another solution to get the number of rows of my RDD.
这是我的代码:
JavaPairRDD<String, String> lines = getAllCustomers(sc).cache();
JavaPairRDD<String,String> CFIDNotNull = lines.filter(notNull()).cache();
JavaPairRDD<String, Tuple2<String, String>> join =lines.join(CFIDNotNull).cache();
double count_ctid = (double)join.count(); // i want to get the count of these three RDD
double all = (double)lines.count();
double count_cfid = all - CFIDNotNull.count();
System.out.println("********** :"+count_cfid*100/all +"% and now : "+ count_ctid*100/all+"%");
谢谢。
推荐答案
您有正确的想法:使用 rdd.count()
来计算行数。没有更快的方法。
You had the right idea: use rdd.count()
to count the number of rows. There is no faster way.
我认为你应该问的问题是为什么 rdd.count()
这么慢?
I think the question you should have asked is why is rdd.count()
so slow?
答案是 rdd.count()
是一个行动—这是一个急切的操作,因为它必须返回一个实际的数字。您在 count()
之前执行的RDD操作是转换—他们懒洋洋地将RDD变成了另一个。实际上,转换并未实际执行,只是排队等候。当您调用 count()
时,将强制执行所有先前的延迟操作。现在需要加载输入文件, map()
s和 filter()
s执行,shuffle执行等,直到最后我们有数据,并可以说它有多少行。
The answer is that rdd.count()
is an "action" — it is an eager operation, because it has to return an actual number. The RDD operations you've performed before count()
were "transformations" — they transformed an RDD into another lazily. In effect the transformations were not actually performed, just queued up. When you call count()
, you force all the previous lazy operations to be performed. The input files need to be loaded now, map()
s and filter()
s executed, shuffles performed, etc, until finally we have the data and can say how many rows it has.
注意,如果你打电话给 count()
两次,这一切都会发生两次。返回计数后,所有数据都将被丢弃!如果您想避免这种情况,请在RDD上调用 cache()
。然后第二次调用 count()
将是快速的,并且派生的RDD将更快地计算。但是,在这种情况下,RDD必须存储在内存(或磁盘)中。
Note that if you call count()
twice, all this will happen twice. After the count is returned, all the data is discarded! If you want to avoid this, call cache()
on the RDD. Then the second call to count()
will be fast and also derived RDDs will be faster to calculate. However, in this case the RDD will have to be stored in memory (or disk).
这篇关于计算RDD中的行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!