java - Apache Spark isEmpty为false，但集合为空

我在JavaRDD集合上调用isEmpty时遇到Apache Spark问题，即使集合为空，它也会返回false。

这是示例代码（由于是我在最后一年的项目中修改的，因此我不允许发布任何代码）：

sampleRdd = inputRdd.filter(someFilterFunction)
if(sampleRdd.isEmpty()) {
       return inputRdd.first();
} else {
        return sampleRdd.first(); // JVM points error on this line
}

问题是有时条件为假，所以sampleRdd.isEmpty()返回false表示它不为空，因此执行将继续返回return语句，在该语句中尝试检索该集合的first()元素，但会引发异常：

Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1314)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.first(RDD.scala:1311)
at org.apache.spark.api.java.JavaRDDLike$class.first(JavaRDDLike.scala:510)
at org.apache.spark.api.java.AbstractJavaRDDLike.first(JavaRDDLike.scala:47)
.
.
.

我有点想念吗？我目前正在本地计算机上运行它，因为它尚未完全开发。

谢谢

编辑：要添加更多信息，当我收到此错误时，JVM指向行sampleRdd.first()，因此初始inputRdd不为空

EDIT2：我写了一些额外的行，它们在过滤器之前打印inputRDD的大小，在过滤器之后打印sampleRDD的大小，如下所示：

System.out.println(inputRdd.count());  // Returns nonzero possitive int eg.100
sampleRdd = inputRdd.filter(someFilterFunction)
System.out.println(sampleRdd.count()); // Returns int eg. 1
System.out.println(sampleRdd.count()); // Sometimes returns different int than first call eg.3
if(sampleRdd.isEmpty()) {
       return inputRdd.first();
} else {
        return sampleRdd.first(); // JVM points error on this line
}

我观察到非常有趣的行为，即有时inputRdd.count()返回100，但是第一个sampleRdd.count()返回1，第二个sampleRdd.count()返回3或与第一次调用基本不同的数字。因此，基本上看起来像sampleRdd的大小在两次调用之间发生了变化，因此我认为有时在传递条件并尝试调用first()返回错误后，它可能会更改为。

知道是什么原因造成的吗？

最佳答案

如果inputRdd最初为空怎么办？

在这种情况下，sampleRdd也为空。因此，samplerdd.isEmpty的计算结果为true，并且inputRdd.first()引发异常。

关于java - Apache Spark isEmpty为false，但集合为空，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/35294891/