本文介绍了SparkSQL:我可以在同一个查询中分解两个不同的变量吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下爆炸查询,效果很好:

I have the following explode query, which works fine:

data1 = sqlContext.sql("select explode(names) as name from data")

我想分解另一个字段颜色",因此最终输出可能是名称和颜色的笛卡尔积.所以我做到了:

I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:

data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")

但我得到了错误:

 Only one generator allowed per select but Generate and and Explode found.;

有人知道吗?

我实际上可以通过执行两个步骤来使其工作:

I can actually make it work by doing two steps:

   data1 = sqlContext.sql("select explode(names) as name from data")
   data1.registerTempTable('data1')
   data1 = sqlContext.sql("select explode(colors) as color from data1")

但是我想知道是否可以一步完成?非常感谢!

But I am wondering if it is possible to do it in one step? Thanks a lot!

推荐答案

正确的语法是

select name, color 
from data 
lateral view explode(names) exploded_names as name 
lateral view explode(colors) exploded_colors as color

Rashid 的回答不起作用的原因是它没有命名"LATERAL VIEW 生成的表格.

The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW.

这样想:LATERAL VIEW 的工作方式类似于隐式 JOIN,并为 structs 中的每一行创建一个临时表正在查看"的集合.所以,解析语法的方式是:

Think of it this way: LATERAL VIEW works like an implicit JOIN with with an ephemeral table created for every row from the structs in the collection being "viewed". So, the way to parse the syntax is:

LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...

多个输出列

如果您使用诸如 posexplode() 之类的表生成函数,那么您仍然有一个输出表,但具有多个输出列:

Multiple output columns

If you use a table generating function such as posexplode() then you still have one output table but with multiple output columns:

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order

嵌套

您也可以嵌套"LATERAL VIEW 通过重复分解嵌套集合,例如,

Nesting

You can also "nest" LATERAL VIEW by repeatedly exploding nested collections, e.g.,

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item

性能注意事项

当我们讨论 LATERAL VIEW 的主题时,重要的是要注意通过 SparkSQL 使用它比通过 DataFrame DSL 使用它更有效,例如,myDF.explode().原因是 SQL 可以准确地推理模式,而 DSL API 必须在语言类型和数据帧行之间执行类型转换.DSL API 在性能方面有所损失,但它获得了灵活性,因为您可以从 explode 返回任何受支持的类型,这意味着您可以一步执行更复杂的转换.

Performance considerations

While we are on the topic of LATERAL VIEW it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame DSL, e.g., myDF.explode(). The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode, which means that you can perform a more complicated transformation in one step.

在最新版本的 Spark 中,通过 df.explode() 进行的行级爆炸已被弃用,而支持通过 df.select(...,explode() 进行列级爆炸...).as(...)).还有一个explode_outer(),即使要分解的输入是null,它也会产生输出行.列级爆炸不会受到上述行级爆炸的性能问题的影响,因为 Spark 可以完全使用内部行数据表示来执行转换.

In recent versions of Spark, row-level explode via df.explode() has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...)). There is also an explode_outer(), which will produce output rows even if the input to be exploded is null. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.

这篇关于SparkSQL:我可以在同一个查询中分解两个不同的变量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-24 21:38