问题描述
我有以下psql表.它总共约有20亿行.
I have the following psql table. It has roughly 2 billion rows in total.
id word lemma pos textid source
1 Stuffing stuff vvg 190568 AN
2 her her appge 190568 AN
3 key key nn1 190568 AN
4 into into ii 190568 AN
5 the the at 190568 AN
6 lock lock nn1 190568 AN
7 she she appge 190568 AN
8 pushed push vvd 190568 AN
9 her her appge 190568 AN
10 way way nn1 190568 AN
11 into into ii 190568 AN
12 the the appge 190568 AN
13 house house nn1 190568 AN
14 . . 190568 AN
15 She she appge 190568 AN
16 had have vhd 190568 AN
17 also also rr 190568 AN
18 cajoled cajole vvd 190568 AN
19 her her appge 190568 AN
20 way way nn1 190568 AN
21 into into ii 190568 AN
22 the the at 190568 AN
23 home home nn1 190568 AN
24 . . 190568 AN
.. ... ... .. ... ..
我想创建一个下表,该表并排显示所有"way"结构以及"source","lemma"和"pos"列中的一些数据.
I would like to create the following table, which shows all "way"-constructions with the words side-by-side and some data from the columns "source", "lemma" and "pos".
source word word word lemma pos word word word word word lemma pos word word
AN lock she pushed push vvd her way into the house house nn1 . she
AN had also cajoled cajole vvd her way into the home home nn1 . A
AN tried to force force vvi her way into the palace palace nn1 , officials
在这里您可以看到我使用的代码:
Here you can see the code I use:
copy(
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM
orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11
WHERE
c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%'
AND
c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id
ORDER BY c1.id
)
TO
'/home/postgres/Results/OUTPUT.csv'
DELIMITER E'\t'
csv header;
对于20亿行,该查询将花费近9个小时来执行(结果大约有19,000行).
The query takes almost 9 hours to execute for the two billion rows (the result has about 19,000 rows).
我该怎么做才能提高性能?
What could I do to improve performance?
word,pos和lemma列已经具有btree索引.
The word, pos and lemma columns already have btree indices.
我应该坚持我的代码,仅使用功能更强大的服务器,具有更多的内核/更快的CPU和更多的RAM(我的RAM只有8 GB,只有2个内核和2.8 GHz)吗?还是您会建议使用其他更有效的SQL查询?
Should I stick to my code and simply use a more powerful server with more cores/a faster CPU and more RAM (mine has only 8 GBs of RAM, a mere 2 cores and 2.8 GHz) ? Or would you recommend a different, more efficient SQL query?
谢谢!
推荐答案
我建议使用现代的连接语法,这很可能会解决问题:
I recommend using modern join syntax, which may well fix the problem:
SELECT
c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus AS c1
JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id
JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id
JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id
JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id
JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id
JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id
JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id
JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id
JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id
JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id
WHERE c4.pos LIKE 'v%'
AND c5.pos = 'appge'
AND c6.lemma = 'way'
AND c7.pos LIKE 'i%'
AND c8.word = 'the'
AND c9.pos LIKE 'n%'
注意:
- 多余的
LIKE
已删除 已删除 -
ORDER BY
,因为它非常昂贵. CSV(如表格行)不需要排序即可生效.如果您绝对需要排序,请在执行查询后使用命令行工具对其进行排序.
- redundant
LIKE
s removed ORDER BY
removed, because it's very expensive. CSV (like table rows) don't need ordering to be valid. If you absolutely need ordering, use command line tools to order it after the execution of the query.
这篇关于在许多相邻行上使用WHERE的Postgres查询痛苦得很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!