问题描述
我需要将数据插入给定的外部表,该表应按插入的日期进行分区.我的问题是Hive如何处理时间戳生成?当我为所有插入的记录选择时间戳时,如下所示:
I need to insert data to a given external table which should be partitioned by the inserted date.My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this:
WITH delta_insert AS (
SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp
FROM target_table trg
)
SELECT *
FROM delta_insert;
即使查询要花费很多时间才能使所有记录的时间戳始终相同?
Will the timestamp always be identical for all records, even if the query takes a lot of time to un?
还是我应该只选择一个实际的时间戳,然后将其与之后选择的所有内容一起加入?
Or should I alternatively only select an actual timestamp and join it with everything that is selected afterwards?
WITH insert_timestamp AS (
SELECT from_unixtime(unix_timestamp()) AS insert_timestamp
),
delta_insert AS (
SELECT trg.*, insert_timestamp.insert_timestamp
FROM target_table trg, insert_timestamp
)
SELECT *
FROM delta_insert;
由于不建议在Hive中进行交叉联接,所以我想知道哪种方法最好,因为我没有足够的测试数据来模拟长时间运行的查询,以确保将它们插入到增量的同一分区中加载.
Since it's not recommended to do cross joins in Hive, I wonder what would be the best approach, since I don't have enough test data to simulate long running queries to be sure they get inserted within the same partition of a delta load.
推荐答案
否,使用unix_timestamp()生成的时间戳记对于所有记录都不总是相同的.此函数是不确定性的,会阻止查询的适当优化-自2.0版以来已弃用此函数,而推荐使用CURRENT_TIMESTAMP常量(对于查询仅计算一次).查看此文档:日期函数这不适用于带有参数的 unix_timestamp(字符串日期)
. unix_timestamp(字符串日期)
函数是确定性的.
No, the timestamp generated using unix_timestamp() WILL NOT always be identical for all records. This function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant (calculated only one time for the query). See this docs: Date functionsThis not applies to unix_timestamp(string date)
with argument. unix_timestamp(string date)
function is deterministic.
current_timestamp
返回查询评估开始时的当前时间戳(自Hive 1.2.0起).同一查询中对current_timestamp的所有调用均返回相同的值.如果需要日期,请使用 current_date
函数.
current_timestamp
returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.If you need date, use current_date
function.
这篇关于通过确切的插入时间戳在Hive中进行动态分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!