本文介绍了Spark JDBC 中的伪列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用查询从 MYSQL 获取数据,如下所示:

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.format("jdbc")
         .option("url", "jdbc:mysql://10.0.0.192:3306/retail_db")
         .option("driver" ,"com.mysql.jdbc.Driver")
         .option("user", "retail_dba")
         .option("password", "cloudera")
         .option("dbtable", "orders")
         .option("partitionColumn", "order_id")
         .option("lowerBound", "1")
         .option("upperBound", "68883")
         .option("numPartitions", "4")
         .load() 

问题是,我可以使用带有 option 的伪列(如 Oracle 中的 ROWNUM 或 DB2 中的 RRN(employeeno))吗?指定 partitionColumn 吗?

Question is, can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) with option where I specify the partitionColumn ?

如果不是,我们可以指定一个不是主键的分区列吗?

If not, can we specify a partition column which is not a primary key ?

推荐答案

TL;DR 可能没有.

虽然 Spark 不考虑像 PRIMARY KEYUNIQUE 这样的约束,但对 partitionColumn非常重要的要求,文档中没有明确说明 - 它必须是确定性的.

While Spark doesn't consider constraints like PRIMARY KEY or UNIQUE there is very important requirement for partitionColumn, which is not explicitly stated in the documentation - it has to be deterministic.

每个执行器使用单独的事务获取自己的数据.如果数值列不是确定性的(稳定的,在事务之间保留),Spark 看到的数据状态可能会不一致,并且记录可能会重复或跳过.

Each executor fetches it's own piece of data using separate transaction. If numeric column is not deterministic (stable, preserved between transactions), the state of data seen by Spark might be inconsistent and records might be duplicated or skipped.

因为 ROWNUM 实现通常是易变的(取决于不稳定的排序并且可能会受到索引等功能的影响),所以 partitionColumn 没有安全的选择.出于同样的原因,您不能使用随机数.

Because ROWNUM implementations are usually volatile (depend on non stable ordering and can be affected by features like indexing) there not safe choice for partitionColumn. For the same reason you cannot use random numbers.

此外,一些供应商可能会进一步限制对伪列的允许操作,使它们不适合用作分区列.例如 Oracle ROWNUM

Also, some vendors might further limit allowed operations on pseudocolumns, making them unsuitable for usage as a partitioning column. For example Oracle ROWNUM

对大于正整数的 ROWNUM 值的条件测试总是错误的.

可能会默默地失败,导致错误的结果.

might fail silently leading to incorrect results.

我们可以指定一个不是主键的分区列吗

是的,只要它满足上述标准.

Yes, as long it satisfies criteria described above.

这篇关于Spark JDBC 中的伪列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 01:02