问题描述
我在以下使用Spark的场景中有问题,我有一个DataFrame,其中的一列包含一个具有开始和结束值的数组,例如
I have a problem with the following scenario using Spark, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
想知道如何创建&计算另一列包含一个数组,该数组保存给定范围的所有值?生成的范围值列的结果将是:
would like to know how to create & compute another column contains a array that holding all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
预先感谢
推荐答案
您将为此创建一个UDF.
You'll have to create a UDF for this.
df.show
+-----------+----------+
|Description| Accounts|
+-----------+----------+
| Range 1|[100, 105]|
| Range 2|[200, 203]|
+-----------+----------+
我试图在这里介绍一些可能的极端情况.如果发现缺少任何内容,可以添加更多.
I have tried to cover few of the possible edge cases here. You can add more if you see anything missing.
val createRange = udf{ (xs: Seq[Int]) =>
if(xs.length == 0 ) Array[Int]()
else if (xs.length == 1) (0 to xs(0) ).toArray
else (xs(0) to xs(1) ).toArray
}
在数据框上调用此UDF createRange
并传递数组 Accounts
Call this UDF createRange
on your Dataframe and pass the Array Accounts
df.withColumn("Range" , createRange($"Accounts") ).show(false)
+-----------+----------+------------------------------+
|Description|Accounts |Range |
+-----------+----------+------------------------------+
|Range 1 |[100, 105]|[100, 101, 102, 103, 104, 105]|
|Range 2 |[200, 203]|[200, 201, 202, 203] |
+-----------+----------+------------------------------+
这篇关于如何用其他列给定范围内的所有值创建列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!