问题描述
我想筛选Pyspark数据框用类似SQL的在
条款,如
I want to filter a Pyspark DataFrame with a SQL-like IN
clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
其中, A
是元组(1,2,3)
。我得到这个错误:
where a
is the tuple (1, 2, 3)
. I am getting this error:
了java.lang.RuntimeException:[1.67]失败:``('',但却标识符的发现
这基本上是说人们期待着什么样的的(1,2,3)的代替。
现在的问题是,因为它是从另一个作业中提取我不能手动编写的值。
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.The problem is I can't manually write the values in a as it's extracted from another job.
我怎么会在这种情况下,过滤器?
How would I filter in this case?
推荐答案
字符串传递给 SQLContext
它在SQL环境的范围进行评估。它不捕获关闭。如果你想传递一个变量,你必须做明确使用字符串格式化:
String you pass to SQLContext
it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
显然,这是不是你会在一个真正的SQL环境中使用,由于安全方面的考虑,但这一问题并不应该在这里。
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
在实践中数据帧
DSL是当你想要创建动态查询一个多的选择:
In practice DataFrame
DSL is a much choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
这是很容易建立和撰写并处理HiveQL / SQL星火的所有细节你。
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
这篇关于一个过滤数据框Pyspark与类似SQL的IN子句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!