问题描述
我有一个包含数百万行的大数据框,如下所示:
I have a big dataframe with millions of rows as follows:
A B C Eqn
12 3 4 A+B
32 8 9 B*C
56 12 2 A+B*C
如何计算Eqn
列中的表达式?
How to evaluate the expressions in the Eqn
column?
推荐答案
您可以创建一个自定义 UDF 来计算这些算术函数
You could create a custom UDF that evaluates these arithmetic functions
def evalUDF = udf((a:Int, b:Int, c:Int, eqn:String) => {
val eqnParts = eqn
.replace("A", a.toString)
.replace("B", b.toString)
.replace("C", c.toString)
.split("""\b""")
.toList
val (sum, _) = eqnParts.tail.foldLeft((eqnParts.head.toInt, "")){
case ((runningTotal, "+"), num) => (runningTotal + num.toInt, "")
case ((runningTotal, "-"), num) => (runningTotal - num.toInt, "")
case ((runningTotal, "*"), num) => (runningTotal * num.toInt, "")
case ((runningTotal, _), op) => (runningTotal, op)
}
sum
})
evalDf
.withColumn("eval", evalUDF('A, 'B, 'C, 'Eqn))
.show()
输出:
+---+---+---+-----+----+
| A| B| C| Eqn|eval|
+---+---+---+-----+----+
| 12| 3| 4| A+B| 15|
| 32| 8| 9| B*C| 72|
| 56| 12| 2|A+B*C| 136|
+---+---+---+-----+----+
正如你所看到的,这是有效的,但非常脆弱(空格、未知运算符等会破坏代码)并且不遵守操作顺序(否则最后一个应该是 92)
As you can see this works, but is very fragile (spaces, unknown operators, etc will break the code) and doesn't adhere to order of operations (otherwise the last should have been 92)
所以你可以自己编写所有这些,或者找到一些已经这样做的库(比如 https://gist.github.com/daixque/1610753)?
So you could write all that yourself or find some library that already does that perhaps (like https://gist.github.com/daixque/1610753)?
也许性能开销会非常大(尤其是当您开始使用递归解析器时),但至少您可以在数据帧上执行它而不是先收集它
Maybe the performance overhead will be very large (especially it you start using recursive parsers), But at least you can perform it on a dataframe instead of collecting it first
这篇关于如何评估作为列值的表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!