




I have the following psql table. It has roughly 2 billion rows in total.

 id  word      lemma     pos              textid  source
 1  Stuffing   stuff      vvg             190568  AN
 2  her        her        appge           190568  AN
 3  key        key        nn1             190568  AN
 4  into       into       ii              190568  AN
 5  the        the        at              190568  AN
 6  lock       lock       nn1             190568  AN
 7  she        she        appge           190568  AN
 8  pushed     push       vvd             190568  AN
 9  her        her        appge           190568  AN
10  way        way        nn1             190568  AN
11  into       into       ii              190568  AN
12  the        the        appge           190568  AN
13  house      house      nn1             190568  AN
14  .                     .               190568  AN
15  She        she        appge           190568  AN
16  had        have       vhd             190568  AN
17  also       also       rr              190568  AN
18  cajoled    cajole     vvd             190568  AN
19  her        her        appge           190568  AN
20  way        way        nn1             190568  AN
21  into       into       ii              190568  AN
22  the        the        at              190568  AN
23  home       home       nn1             190568  AN
24  .                     .               190568  AN
..  ...        ...        ..              ...     ..


I would like to create the following table, which shows all "way"-constructions with the words side-by-side and some data from the columns "source", "lemma" and "pos".

source     word   word       word       lemma      pos        word       word     word       word       word       lemma      pos        word       word
AN         lock   she        pushed     push       vvd        her        way      into       the        house      house      nn1        .          she
AN         had    also       cajoled    cajole     vvd        her        way      into       the        home       home       nn1        .          A
AN         tried  to         force      force      vvi        her        way      into       the        palace     palace     nn1        ,          officials


Here you can see the code I use:

SELECT   c1.source, c1.word,  c2.word, c3.word,  c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word


orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11


c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%'


c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id

ORDER BY c1.id
csv header;


The query takes almost 9 hours to execute for the two billion rows (the result has about 19,000 rows).


What could I do to improve performance?


The word, pos and lemma columns already have btree indices.

我应该坚持我的代码,仅使用功能更强大的服务器,具有更多的内核/更快的CPU和更多的RAM(我的RAM只有8 GB,只有2个内核和2.8 GHz)吗?还是您会建议使用其他更有效的SQL查询?

Should I stick to my code and simply use a more powerful server with more cores/a faster CPU and more RAM (mine has only 8 GBs of RAM, a mere 2 cores and 2.8 GHz) ? Or would you recommend a different, more efficient SQL query?




I recommend using modern join syntax, which may well fix the problem:

  c1.source, c1.word,  c2.word, c3.word,  c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus AS c1
JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id
JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id
JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id
JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id
JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id
JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id
JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id
JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id
JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id
JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id
WHERE c4.pos LIKE 'v%'
AND c5.pos = 'appge'
AND c6.lemma = 'way'
AND c7.pos LIKE 'i%'
AND c8.word = 'the'
AND c9.pos LIKE 'n%'


  • 多余的LIKE已删除
  • 已删除
  • ORDER BY,因为它非常昂贵. CSV(如表格行)不需要排序即可生效.如果您绝对需要排序,请在执行查询后使用命令行工具对其进行排序.
  • redundant LIKEs removed
  • ORDER BY removed, because it's very expensive. CSV (like table rows) don't need ordering to be valid. If you absolutely need ordering, use command line tools to order it after the execution of the query.


08-22 16:53