将函数应用于 Spark DataFrame 的每一行

本文介绍了将函数应用于 Spark DataFrame 的每一行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的是 Spark 1.3.

I'm on Spark 1.3.

我想对数据帧的每一行应用一个函数.此函数对行的每一列进行散列并返回散列列表.

I would like to apply a function to each row of a dataframe. This function hashes each column of the row and returns a list of the hashes.

dataframe.map(row => row.toSeq.map(col => col.hashCode))

当我运行这段代码时，我得到了一个 NullPointerException.我认为这与 SPARK-5063 相关.

I get a NullPointerException when I run this code. I assume that this is related to SPARK-5063.

如果不使用嵌套映射，我想不出一种方法来实现相同的结果.

I can't think of a way to achieve the same result without using a nested map.

推荐答案

这不是 SPARK-5063 的实例，因为您没有嵌套 RDD 转换；内部 .map() 被应用于 Scala Seq，而不是 RDD.

This isn't an instance of SPARK-5063 because you're not nesting RDD transformations; the inner .map() is being applied to a Scala Seq, not an RDD.

我的预感是数据集中的某些行包含空列值，因此当您尝试计算 null.hashCodecol.hashCode 调用会抛出 NullPointerExceptions>.为了解决这个问题，您需要在计算哈希码时考虑空值.

My hunch is that some rows in your data set contain null column values, so some of the col.hashCode calls are throwing NullPointerExceptions when you try to evaluate null.hashCode. In order to work around this, you need to take nulls into account when computing hashcodes.

如果您在 Java 7 JVM 或更高版本上运行(来源)，您可以这样做

If you're running on a Java 7 JVM or higher (source), you can do

import java.util.Objects
dataframe.map(row => row.toSeq.map(col => Objects.hashCode(col)))

或者，您可以在早期版本的 Java 上执行

Alternatively, on earlier versions of Java you can do

    dataframe.map(row => row.toSeq.map(col => if (col == null) 0 else col.hashCode))

这篇关于将函数应用于 Spark DataFrame 的每一行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！