问题描述
当前,我使用的体系结构采用了一些数据源,因为其中的数据源托管在云中,因此这些数据源是在本地暂存的。无论如何,其他都在本地托管,因此我执行的ETL直接从源头获取。我看不出为其他来源创建舞台的意义。
Currently, the architecture I work with takes a few data sources out of which one is staged locally because it's hosted in the cloud. The others are hosted locally anyway, so the ETL I perform takes it directly from the source. I don't really see the point in creating a stage for the other sources.
1)将本地托管的源复制到本地阶段是否有明显的好处?
1) Is there a distinct benefit to duplicating the locally hosted source into a local stage?
2)将舞台托管在单独的计算机上还是与Warehouse托管在同一台计算机上是个更好的主意?
2) Is it a better idea to host the stage on a separate machine or the same one as the Warehouse?
3)如果我想减少ETL时间,那是什么?这样做的好方法?我正在考虑对数据进行分区,以便比存档数据更频繁地提取重要信息。这是一个好方法吗,我有什么选择?
3) If I'm trying to reduce my ETL time, what's a good way to do so? I was considering partitioning my data so that the important information is pulled more frequently than the "archived data". Is this a good approach, and what are my alternatives?
推荐答案
@omgitsdev我想澄清一些概念。
@omgitsdev There are a few concepts I would like to clarify.
您的文件可以托管在任何地方-本地或云中
文件被加载到一个临时表中,以加载到您的数据仓库中。这个过程称为暂存。
Your files can be hosted anywhere - locally or on cloudThe files are loaded into a temporary table to be loaded into your Data Warehouse. This process is called staging.
从概念上讲,您可以将暂存区放在任何地方;但是,为了减少连接性问题,我们在同一数据库中创建了一个单独的架构并进行了阶段化。
Conceptually you can have your staging area anywhere; however to reduce connectivity issues, we create a separate schema in the same database and stage them. This will ensure that your performance is not hampered by connectivity issues.
通常,您将事实表按保存日期的列进行分区;
you generally partition your fact table by the column which holds the date; this is easier and also the most recent partitions hold the latest data;
基于卷,您可以将其设为每月,每季度或每年的分区;在某些情况下,我们还会创建每日或每小时分区。
Based on the volume, you either make it a monthly, quarterly or yearly partition; there are situations where we also create daily or hourly partitions.
通过确保临时表与数据仓库表位于单独的磁盘中,还可以提高性能。
Your performance can also be accelerated by ensuring that the staging tables are in a separate disk from the data warehouse tables.
这篇关于在ETL中进行分阶段:最佳实践?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!