问题描述
当前,我们有一个DMS任务,该任务将获取MySQL DB的内容并将文件以拼花格式转储到S3。
Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format.
镶木地板中时间戳的格式最终为TIMESTAMP_MICROS。
The format for the timestamps in parquet ends up being TIMESTAMP_MICROS.
这是Presto(雅典娜的底层实现)
This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision.
这不会直接导致任何错误,但会使时间显示为某个极端的未来日期,因为它解释了
This does not cause any errors directly but it makes the times display as some extreme future date as it is interpreting the number of microseconds as number of milliseconds.
我们目前正在通过在采用这些时间戳的Athena表之上创建Athena视图并将其转换为unix来解决此问题。时间戳(即数字),除以1000可转换为毫秒分辨率,然后转换回时间戳。
We are currently working around this by creating Athena views on top of the Athena tables that take these these timestamps, convert them to a unix timestamp (i.e. number), divide by 1000 to convert to millisecond resolution and then convert back to a timestamp.
这对于概念验证和确定问题很好,但是这对我们不起作用s我们有很多带有很多时间戳列的表,并且不想在顶部引入这个脆弱的视图层。
This was fine for a proof of concept and identifying the problem but this won't work for us as we have lots of tables with lots of timestamp columns and don't want to introduce this fragile view layer on top.
Athena和DMS都创建了支持票团队和双方都确认了问题并在其端创建了功能请求,以针对此不兼容的问题创建本机解决方案,但任何支持工程师都无法提供内置的解决方法。 DMS似乎不支持将列类型更改为毫秒精度或varchar字符串。对于雅典娜来说,这只是Presto记录在案的限制。
Support tickets were created for both the Athena and DMS teams and both confirmed the problem and created feature requests on their end to create a native solution for this incompatible but neither support engineer could offer a built-in workaround. DMS does not seem to support changing the column type to be millisecond precision or a varchar string. For Athena, this is simply a documented Presto limitation.
有人以前遇到过这种情况并找到了解决方案吗?
Has anyone run into this before and found a solution?
更改MySQL数据库的精度并不是真正的选择,因为许多客户端都在使用它,我们也不想冒险破坏接口。
Changing the precision in the MySQL database is not really an option as it is used by many clients and we don't want to risk breaking the interface.
我正在考虑创建一个作业,该作业将在创建新镶木地板时运行,以查找TIMESTAMP_MICRO字段并将其列为TIMESTAMP_MILLIS。不确定是否要执行此操作的最佳方法(胶水/火花工作?)
I am considering creating a job that will run anytime a new parquet is created to look for TIMESTAMP_MICRO fields and column them to TIMESTAMP_MILLIS. Not exactly sure the best way to go about this (Glue/Spark job?)
DMS非常棒,并且使转至拼花地板的超级简单,可能还有其他方法从MySQL中提取数据。无需其他大量定制开发就可以执行此操作的其他工具吗?
DMS was great and has made the dump to parquet super simple, there may be other ways to extract the data from MySQL. Any other tools that can do this without much custom development?
推荐答案
使用目标端点,可以设置额外的连接属性 。如果将 parquetTimestampInMillisecond
设置为 true
,则时间戳将以毫秒为单位。
With the target endpoint, you can set "Extra Connection Attributes". If you set parquetTimestampInMillisecond
to true
, the timestamps will be in milliseconds.
这篇关于如何在Presto / Athena中处理TIMESTAMP_MICROS拼花地板字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!