问题描述
我需要一些帮助来提高查询速度。
I need some help to improve the speed of a query.
我有3个表:
1- pairTable2一个4列的表:
-genomic_accession:分组列(不在乎这个问题)
-汇编:分组列(不在乎这个问题)
- product_accession:用于在其他表中搜索的列
-tmpcol:用于在其他表中搜索的列
2- SBPDB一个1列的表:
-product_accession :用于搜索其他表的列
3- cacheDB一个1列的表:
-product_accession:用于搜索其他表的列
I have 3 tables:
1- pairTable2 a 4 columns table:
- genomic_accession: grouping column (don't care for this question)
- assembly: grouping column (don't care for this question)
- product_accession: column used to search in other table
- tmpcol: column used to search in other table
2- SBPDB an 1 column table:
- product_accession: column used to search in other table
3- cacheDB an 1 column table: - product_accession: column used to search in other table
这个想法是在表1中创建一个名为SBP的布尔列,如果
列中的值 product_accession $,则该列为
TRUE
。 c $ c>和/或 tmpcol
在 SBPDB
的唯一列中;
并且,在表1中创建一个名为SBP的布尔列,如果
列中的值 produc,则该列为
和/或 TRUE
t_accession tmpcol
在 cacheDB
的唯一列中。
The idea is to create a boolean column called SBP in table 1 which is TRUE
ifthe value on columns product_accession
and/or tmpcol
is within the unique column in SBPDB
;
And, create a boolean column called SBP in table 1 which is TRUE
ifthe value on columns product_accession
and/or tmpcol
is within the unique column in cacheDB
.
我正在将R与 DBI
和 dplyr
一起使用作为后端,则查询可能看起来很奇怪。
但是,我要执行的查询是:
I'm using R with DBI
and dplyr
as backend, then the query may look weirdo.But, the query I'm trying to do is:
SELECT "genomic_accession",
"assembly",
"product_accession",
"tmpcol",
"product_accession" IN (SELECT product_accession
FROM "cachedb")
OR "tmpcol" IN (SELECT product_accession
FROM "cachedb") AS "CACHE",
"product_accession" IN (SELECT product_accession
FROM "sbpdb")
OR "tmpcol" IN (SELECT product_accession
FROM "sbpdb") AS "SBP"
FROM (SELECT *
FROM "pairtable2"
LIMIT 500000) "dbplyr_031";
(检查解释)
QUERY PLAN
----------------------------------------------------------------------------------------
Subquery Scan on dbplyr_031 (cost=3242.27..3846856408.45 rows=500000 width=59)
-> Limit (cost=0.00..10666.17 rows=500000 width=57)
-> Seq Scan on "pairTable2" (cost=0.00..781515.16 rows=36635216 width=57)
SubPlan 1
-> Seq Scan on "cacheDB" (cost=0.00..1394.91 rows=90491 width=14)
SubPlan 2
-> Seq Scan on "cacheDB" "cacheDB_1" (cost=0.00..1394.91 rows=90491 width=14)
SubPlan 3
-> Materialize (cost=0.00..7001.57 rows=276838 width=14)
-> Seq Scan on "SBPDB" (cost=0.00..4265.38 rows=276838 width=14)
SubPlan 4
-> Materialize (cost=0.00..7001.57 rows=276838 width=14)
-> Seq Scan on "SBPDB" "SBPDB_1" (cost=0.00..4265.38 rows=276838 width=14)
(13 rows)
因此,这只是50万行的示例,并且在运行1小时后仍然如此。
行的总数是:
So, this is just a sample of 500k rows, and after 1 hour still running.The total of rows is:
genomes=> select count(*) from "pairTable2";
count
----------
36633962
(1 row)
我至少需要一些建议来找出更好的查询来加快我的需求。
I need at least some suggestions to figure out a better query to speed up my needs.
表格示例:
(1)
Example of the tables:
(1)
genomic_accession | assembly | product_accession | tmpcol
-------------------+-----------------+-------------------+----------------
NC_007777.1 | GCF_000013345.1 | WP_011437108.1 | WP_011437109.1
NC_007777.1 | GCF_000013345.1 | WP_011437109.1 | WP_011437110.1
NC_007777.1 | GCF_000013345.1 | WP_011437110.1 | WP_011437113.1
NC_007777.1 | GCF_000013345.1 | WP_011437113.1 | WP_011437114.1
NC_007777.1 | GCF_000013345.1 | WP_011437114.1 | WP_011437116.1
NC_007777.1 | GCF_000013345.1 | WP_011437116.1 | WP_011437117.1
NC_007777.1 | GCF_000013345.1 | WP_011437117.1 | WP_011437118.1
NC_007777.1 | GCF_000013345.1 | WP_011437118.1 | WP_011437120.1
NC_007777.1 | GCF_000013345.1 | WP_011437120.1 | WP_011437121.1
NC_007777.1 | GCF_000013345.1 | WP_011437121.1 | WP_011437123.1
(10 rows)
(2)
product_accession
-------------------
WP_005887071.1
WP_005913801.1
WP_002804432.1
WP_010366489.1
WP_012444785.1
NP_636898.1
WP_046342269.1
WP_074057745.1
WP_039420813.1
WP_005932253.1
(10 rows)
(3 )
product_accession
-------------------
ABG85315.1
ABG85570.1
ABG86033.1
ABG86301.1
ABG87594.1
ACX82524.1
ACX83274.1
ACX83416.1
ADX79866.1
ADX79880.1
(10 rows)
输出示例:
genomic_accession | assembly | product_accession | tmpcol | CACHE | SBP
-------------------+-----------------+-------------------+----------------+-------+-----
NC_007899.1 | GCF_000009945.1 | WP_011457581.1 | WP_011457582.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457582.1 | WP_011457583.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457583.1 | WP_011457584.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457584.1 | WP_011457585.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457585.1 | WP_011457586.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457586.1 | WP_011457587.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457587.1 | WP_011457588.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457588.1 | WP_011457589.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457589.1 | WP_011457590.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457590.1 | WP_011457592.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457592.1 | WP_011457593.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457593.1 | WP_011457594.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457594.1 | WP_011457596.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457596.1 | WP_011457597.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457597.1 | WP_011457598.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457598.1 | WP_011457600.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457600.1 | WP_011457601.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457601.1 | WP_011457602.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457602.1 | WP_011457603.1 | f | f
NC_007899.1 | GCF_000009945.1 | WP_011457603.1 | WP_011457604.1 | f | f
预先感谢
推荐答案
这是您的查询:
SELECT "genomic_accession", "assembly", "product_accession", "tmpcol",
("product_accession" IN ( SELECT product_accession FROM "cacheDB" ) OR
"tmpcol" IN ( SELECT product_accession FROM "cacheDB")
) AS "CACHE",
("product_accession" IN ( SELECT product_accession FROM "SBPDB" ) OR
"tmpcol" IN ( SELECT product_accession FROM "SBPDB" ) AS "SBP"
FROM (SELECT * FROM "pairTable2" LIMIT 500000) "dbplyr_031";
我会删除所有双引号。不要创建列名以及需要转义的表名。然后,具有正确索引的 EXISTS
通常会表现更好:
I would get rid of all the double quotes. Don't create column names and table names that need to be escaped. Then, EXISTS
with the right indexes often performs better:
SELECT "genomic_accession", "assembly", "product_accession", "tmpcol",
(EXISTS (SELECT 1
FROM "cacheDB" c
*WHERE c.product_accession IN (pt.product_accession, pt.tmpcol )
)
) AS CACHE,
(EXISTS (SELECT 1
FROM "SBPDB" s
WHERE s.product_accession IN (pt.product_accession, pt.tmpcol )
)
) AS SBP
FROM (SELECT * FROM "pairTable2" LIMIT 500000) pt;
然后,为了提高性能,您希望在 cachedb(product_accession)上建立索引
和 sbpdb(product_accession)
。
Then, for performance, you want indexes on cachedb(product_accession)
and sbpdb(product_accession)
.
这篇关于加快PostgreSQL查询(检查另一个表中是否存在条目)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!