一个过滤数据框Pyspark与类似SQL的IN子句 | 一个过滤数据框Pyspark与类似SQL的IN子句

本文介绍了一个过滤数据框Pyspark与类似SQL的IN子句的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想筛选Pyspark数据框用类似SQL的在条款，如

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')

其中， A 是元组（1，2，3）。我得到这个错误：

where a is the tuple (1, 2, 3). I am getting this error:

了java.lang.RuntimeException：[1.67]失败：``（''，但却标识符的发现

这基本上是说人们期待着什么样的的（1，2，3）的代替。
现在的问题是，因为它是从另一个作业中提取我不能手动编写的值。

which is basically saying it was expecting something like '(1, 2, 3)' instead of a.The problem is I can't manually write the values in a as it's extracted from another job.

我怎么会在这种情况下，过滤器？

How would I filter in this case?

推荐答案

字符串传递给 SQLContext 它在SQL环境的范围进行评估。它不捕获关闭。如果你想传递一个变量，你必须做明确使用字符串格式化：

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
##  2

显然，这是不是你会在一个真正的SQL环境中使用，由于安全方面的考虑，但这一问题并不应该在这里。

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

在实践中数据帧 DSL是当你想要创建动态查询一个多的选择：

In practice DataFrame DSL is a much choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()
## 2

这是很容易建立和撰写并处理HiveQL / SQL星火的所有细节你。

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

这篇关于一个过滤数据框Pyspark与类似SQL的IN子句的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！