问题描述
我正在使用 ExecuteSQLRecord 处理器转储具有 100 多万条记录的大表 (100 GB) 的内容.
I am using the ExecuteSQLRecord processor to dump the contents of a large table (100 GB) with 100+ million records.
我已经设置了如下属性.但是,我注意到的是,在我看到来自该处理器的任何流文件之前需要 45 分钟的时间?
I have set up the properties like below. However, what I am noticing is that it takes a good 45 minutes before I see any flow files coming out of this processor?
我错过了什么?
我使用的是 NiFi 1.9.1
I am on NiFi 1.9.1
谢谢.
推荐答案
ExecuteSQL(Record) 甚至 GenerateTableFetch -> ExecuteSQL(Record) 的替代方案是使用没有 Max-Value Column 的 QueryDatabaseTable.它有一个 Fetch Size 属性,它尝试设置每次从数据库拉取时返回的行数.例如,Oracle 的 默认值为 10,所以每个流文件有 10000 行,ExecuteSQL 必须访问数据库 1000 次,一次提取 10 行.我建议将 Fetch Size 设置为 Max Rows Per Flow File 作为一般规则,然后每个传出流文件进行一次提取.
An alternative to ExecuteSQL(Record) or even GenerateTableFetch -> ExecuteSQL(Record) is to use QueryDatabaseTable without a Max-Value Column. It has a Fetch Size property that attempts to set the number of rows returned on each pull from the database. Oracle's default is 10 for example, so with 10000 rows per flow file, ExecuteSQL has to make 1000 trips to the DB, fetching 10 rows at a time. I recommend setting Fetch Size to Max Rows Per Flow File as a general rule, then one fetch is made per outgoing flow file.
ExecuteSQL 处理器也应该可以使用 Fetch Size 属性,我写了 Apache Jira NIFI-6865 以涵盖此改进.
The Fetch Size property should be available to the ExecuteSQL processors as well, I wrote up Apache Jira NIFI-6865 to cover this improvement.
这篇关于为什么 ExecuteSQLRecord 需要很长时间才能开始在大表上输出流文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!