问题描述
我使用的是 Spark 1.3.
I'm on Spark 1.3.
我想对数据帧的每一行应用一个函数.此函数对行的每一列进行散列并返回散列列表.
I would like to apply a function to each row of a dataframe. This function hashes each column of the row and returns a list of the hashes.
dataframe.map(row => row.toSeq.map(col => col.hashCode))
当我运行这段代码时,我得到了一个 NullPointerException.我认为这与 SPARK-5063 相关.
I get a NullPointerException when I run this code. I assume that this is related to SPARK-5063.
如果不使用嵌套映射,我想不出一种方法来实现相同的结果.
I can't think of a way to achieve the same result without using a nested map.
推荐答案
这不是 SPARK-5063 的实例,因为您没有嵌套 RDD 转换;内部 .map()
被应用于 Scala Seq
,而不是 RDD.
This isn't an instance of SPARK-5063 because you're not nesting RDD transformations; the inner .map()
is being applied to a Scala Seq
, not an RDD.
我的预感是数据集中的某些行包含空列值,因此当您尝试计算 null.hashCode
col.hashCode 调用会抛出 NullPointerExceptions>.为了解决这个问题,您需要在计算哈希码时考虑空值.
My hunch is that some rows in your data set contain null column values, so some of the col.hashCode
calls are throwing NullPointerExceptions when you try to evaluate null.hashCode
. In order to work around this, you need to take nulls into account when computing hashcodes.
如果您在 Java 7 JVM 或更高版本上运行(来源),您可以这样做
If you're running on a Java 7 JVM or higher (source), you can do
import java.util.Objects
dataframe.map(row => row.toSeq.map(col => Objects.hashCode(col)))
或者,您可以在早期版本的 Java 上执行
Alternatively, on earlier versions of Java you can do
dataframe.map(row => row.toSeq.map(col => if (col == null) 0 else col.hashCode))
这篇关于将函数应用于 Spark DataFrame 的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!