问题描述
我有以下爆炸查询,效果很好:
I have the following explode query, which works fine:
data1 = sqlContext.sql("select explode(names) as name from data")
我想分解另一个字段颜色",因此最终输出可能是名称和颜色的笛卡尔积.所以我做到了:
I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:
data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")
但我得到了错误:
Only one generator allowed per select but Generate and and Explode found.;
有人知道吗?
我实际上可以通过执行两个步骤来使其工作:
I can actually make it work by doing two steps:
data1 = sqlContext.sql("select explode(names) as name from data")
data1.registerTempTable('data1')
data1 = sqlContext.sql("select explode(colors) as color from data1")
但是我想知道是否可以一步完成?非常感谢!
But I am wondering if it is possible to do it in one step? Thanks a lot!
推荐答案
正确的语法是
select name, color
from data
lateral view explode(names) exploded_names as name
lateral view explode(colors) exploded_colors as color
Rashid 的回答不起作用的原因是它没有命名"LATERAL VIEW
生成的表格.
The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW
.
这样想:LATERAL VIEW
的工作方式类似于隐式 JOIN
,并为 structs
中的每一行创建一个临时表正在查看"的集合.所以,解析语法的方式是:
Think of it this way: LATERAL VIEW
works like an implicit JOIN
with with an ephemeral table created for every row from the structs
in the collection being "viewed". So, the way to parse the syntax is:
LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...
多个输出列
如果您使用诸如 posexplode()
之类的表生成函数,那么您仍然有一个输出表,但具有多个输出列:
Multiple output columns
If you use a table generating function such as posexplode()
then you still have one output table but with multiple output columns:
LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
嵌套
您也可以嵌套"LATERAL VIEW
通过重复分解嵌套集合,例如,
Nesting
You can also "nest" LATERAL VIEW
by repeatedly exploding nested collections, e.g.,
LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item
性能注意事项
当我们讨论 LATERAL VIEW
的主题时,重要的是要注意通过 SparkSQL 使用它比通过 DataFrame
DSL 使用它更有效,例如,myDF.explode().原因是 SQL 可以准确地推理模式,而 DSL API 必须在语言类型和数据帧行之间执行类型转换.DSL API 在性能方面有所损失,但它获得了灵活性,因为您可以从 explode
返回任何受支持的类型,这意味着您可以一步执行更复杂的转换.
Performance considerations
While we are on the topic of LATERAL VIEW
it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame
DSL, e.g., myDF.explode()
. The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode
, which means that you can perform a more complicated transformation in one step.
在最新版本的 Spark 中,通过 df.explode()
进行的行级爆炸已被弃用,而支持通过 df.select(...,explode() 进行列级爆炸...).as(...))
.还有一个explode_outer()
,即使要分解的输入是null
,它也会产生输出行.列级爆炸不会受到上述行级爆炸的性能问题的影响,因为 Spark 可以完全使用内部行数据表示来执行转换.
In recent versions of Spark, row-level explode via df.explode()
has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...))
. There is also an explode_outer()
, which will produce output rows even if the input to be exploded is null
. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.
这篇关于SparkSQL:我可以在同一个查询中分解两个不同的变量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!