本文介绍了您需要在Pyspark SQL中的哪里使用lit()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚您需要在何处使用lit值,该值在文档中定义为literal column.

I'm trying to make sense of where you need to use a lit value, which is defined as a literal column in the documentation.

以下面的udf为例,它返回SQL列数组的索引:

Take for example this udf, which returns the index of a SQL column array:

def find_index(column, index):
    return column[index]

如果我将整数传递给它,我会得到一个错误.我需要将lit(n)值传递到udf中,以获得正确的数组索引.

If I were to pass an integer into this I would get an error. I would need to pass a lit(n) value into the udf to get the correct index of an array.

在哪里可以更好地学习何时使用lit以及可能使用col的严格规则?

Is there a place I can better learn the hard and fast rules of when to use lit and possibly col as well?

推荐答案

为简单起见,您需要一个Column(可以是使用lit创建的一个,但这不是唯一的选择),而JVM对应者则希望使用一个Column.列,并且Python包装器中没有内部转换,或者您想调用Column特定方法.

To keep it simple you need a Column (can be a one created using lit but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you wan to call a Column specific method.

在第一种情况下,唯一严格的规则是适用于UDF的on.仅可以使用Column类型的参数调用UDF(Python或JVM).它通常也适用于pyspark.sql.functions中的功能.在其他情况下,最好总是先检查文档和文档字符串,如果还不够,请检查对应的Scala对应文档.

In the first case the only strict rule is the on that applies to UDFs. UDF (Python or JVM) can be called only with arguments which are of Column type. It also typically applies to functions from pyspark.sql.functions. In other cases it is always best to check documentation and docs string firsts and if it is not sufficient docs of a corresponding Scala counterpart.

在第二种情况下,规则很简单.例如,如果您想将一列与一个值进行比较,则该值必须位于RHS上:

In the second case rules are simple. If you for example want to compare a column to a value then value has to be on the RHS:

col("foo") > 0  # OK

或值必须用文字包裹:

lit(0) < col("foo")  # OK

在Python中,许多运算符(<==<=&|+-*/)可以在其上使用非列对象LHS:

In Python many operators (<, ==, <=, &, |, + , -, *, /) can use non column object on the LHS:

0 < col("foo")

,但是Scala不支持此类应用程序.

but such applications are not supported in Scala.

如果要访问任何pyspark.sql.Column方法都将标准Python标量视为.例如,您需要

It goes without saying that you have to use lit if you want to access any of the pyspark.sql.Column methods treating standard Python scalar as a constant column. For example you'll need

c = lit(1)

不是

c.between(0, 3)  # type: pyspark.sql.Column

这篇关于您需要在Pyspark SQL中的哪里使用lit()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 18:28