问题描述
我在一个表中有一个歪斜的数据,然后将其与其他较小的表进行比较.我知道在连接的情况下加盐是有效的-这是将随机数附加到具有来自一定范围随机数据的偏斜数据的大表中的键上,并将没有偏斜数据的小表中的行与相同范围的随机数重复.因此,之所以会发生匹配,是因为偏斜的特定特定键的重复值中会有一个命中我还读到,在进行groupby时加盐是有帮助的.我的问题是,将随机数附加到键上时,它不会破坏组吗?如果是,则按操作分组的含义已更改.
I have a skewed data in a table which is then compared with other table that is small.I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the the matching happens because there will be a hit in one among the duplicate values for particular slated key of skewed ableI also read that salting is helpful while performing groupby. My question is when random numbers are appended to the key doesn't it break the group? If if does then the meaning of group by operation has changed.
推荐答案
确实如此,为减轻这种情况,您可以按操作两次运行分组.首先用盐腌的钥匙,然后除去盐腌并再次分组.第二组将采用部分汇总的数据,从而显着减少偏斜影响.
Well, it does, to mitigate this you could run group by operation twice.Firstly with salted key, then remove salting and group again.The second grouping will take partially aggregated data, thus significantly reduce skew impact.
例如
import org.apache.spark.sql.functions._
df.withColumn("salt", (rand * n).cast(IntegerType))
.groupBy("salt", groupByFields)
.agg(aggFields)
.groupBy(groupByFields)
.agg(aggFields)
这篇关于spark:盐化如何处理倾斜的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!