本文介绍了从 Spark DataFrame 中选择特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将 CSV 数据加载到 Spark DataFrame 中.

I have loaded CSV data into a Spark DataFrame.

我需要将此数据帧分割成两个不同的数据帧,其中每个数据帧都包含来自原始数据帧的一组列.

I need to slice this dataframe into two different dataframes, where each one contains a set of columns from the original dataframe.

如何根据列选择一个子集到 Spark 数据帧中?

How do I select a subset into a Spark dataframe, based on columns ?

推荐答案

如果要将数据框拆分为两个不同的数据框,请使用所需的不同列对其进行两次选择.

If you want to split you dataframe into two different ones, do two selects on it with the different columns you want.

 val sourceDf = spark.read.csv(...)
 val df1 = sourceDF.select("first column", "second column", "third column")
 val df2 = sourceDF.select("first column", "second column", "third column")

请注意,这当然意味着 sourceDf 将被评估两次,因此如果它可以适合分布式内存并且您跨两个数据帧使用大部分列,那么缓存它可能是个好主意.它有许多您不需要的额外列,然后您可以先对其进行选择以选择您需要的列,以便将所有额外数据存储在内存中.

Note that this of course means that the sourceDf would be evaluated twice, so if it can fit into distributed memory and you use most of the columns across both dataframes it might be a good idea to cache it. It it has many extra columns that you don't need, then you can do a select on it first to select on the columns you will need so it would store all that extra data in memory.

这篇关于从 Spark DataFrame 中选择特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 15:34