问题描述
标题中几乎有问题.我找不到有关差异的详细文档.
Question is pretty much in the title. I can't find any detailed documentation regarding the differences.
我确实注意到了一个差异,因为在交换多维数据集和groupBy函数调用时,我得到了不同的结果.我注意到,对于使用多维数据集"的结果,我经常将其分组的表达式上有很多空值.
I do notice a difference because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using 'cube', I got a lot of null values on the expressions I often grouped by.
推荐答案
这些方法不能以相同的方式工作. groupBy
只是标准SQL中GROUP BY
子句的等效项.换句话说
These are not intended to work in the same way. groupBy
is simply an equivalent of the GROUP BY
clause in standard SQL. In other words
table.groupBy($"foo", $"bar")
等效于:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
cube
等效于GROUP BY
的CUBE
扩展名.它获取列列表,并将聚合表达式应用于分组列的所有可能的组合.假设您有这样的数据:
cube
is equivalent to CUBE
extension to GROUP BY
. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:
val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show
// +---+---+
// | x| y|
// +---+---+
// |foo| 1|
// |foo| 2|
// |bar| 2|
// |bar| 2|
// +---+---+
,并且您将cube(x, y)
计算为总数:
and you compute cube(x, y)
with count as an aggregation:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
与cube
相似的函数是rollup
,它从左到右计算层次结构小计:
A similar function to cube
is rollup
which computes hierarchical subtotals from left to right:
df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | foo|null| 2| <- count where x is fixed to foo
// | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2
// | foo| 1| 1| ...
// | foo| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | bar|null| 2| <- count where x is fixed to bar
// +----+----+-----+
为了比较,让我们看一下普通groupBy
的结果:
Just for comparison lets see the result of plain groupBy
:
df.groupBy($"x", $"y").count.show
// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+
总结:
- 使用普通
GROUP BY
时,每行在其相应的摘要中仅包含一次. -
对于
GROUP BY CUBE(..)
,每一行都包括在它表示的级别的每个组合的摘要中,包括通配符.从逻辑上讲,上面显示的内容等效于以下内容(假设我们可以使用NULL
占位符):
- When using plain
GROUP BY
every row is included only once in its corresponding summary. With
GROUP BY CUBE(..)
every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could useNULL
placeholders):
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y, COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
使用GROUP BY ROLLUP(...)
与CUBE
相似,但是通过从左到右填充列来进行分层工作.
With GROUP BY ROLLUP(...)
is similar to CUBE
but works hierarchically by filling colums from left to right.
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
ROLLUP
和CUBE
来自数据仓库扩展,因此,如果您想更好地了解其工作原理,还可以查看自己喜欢的RDMBS的文档.例如PostgreSQL在9.5和中都有介绍,这些文献相对完整.
ROLLUP
and CUBE
come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.
这篇关于多维数据集,汇总和groupBy运算符有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!