speeding up a postfix wildcard search even more

Question

I recently asked a question about speeding up postfix wildcard text lookups such as SELECT a, b, c FROM t WHERE a LIKE 'abcde%' in Pg. Finally, by implementing the following index, I am able to get between 200 ms and 800 ms per query.

CREATE INDEX idxa ON t (Lower(a) varchar_pattern_ops);

I am now interested in speeding up the query by an order of magnitude, if possible; perhaps between 200-800 microseconds. Could this be done?

The entire table is about 1 GB of raw text (~8 million+ rows), and can be made even smaller, so it could easily fit in memory. Could I implement a cache on top of Pg, a cache that would seed over time? Perhaps memcached or something else. Since most caches have an exact key lookup, how would I do a wildcard search from a cache?

Btw, as a point of info, I did load the entire table in Mongodb, and while I got very fast lookups on exact searches a = 'abcdefg', Mongodb's wildcard search as above was actually inferior to that of Postgres.

score 4 · Answer 1 · edited May 23 '17 at 12:25

You can still squeeze out some more.

Firstly, I would generally advise to use the data type text instead of varchar. So text_pattern_ops instead of varchar_pattern_ops. This won't affect performance though.

Next, as your column has up to 100 characters, but you only use the first n (20?) characters, the index will be much smaller with lower(left(a, 20) instead of lower(a) as I already suggested in my answer to your prequel question.

The index search itself performs the same, but the server has to visit many more pages on disk or in RAM. Fewer rows will fit per RAM or disk page, so more pages have to be visited for every lookup. Also, pages will drop out of your cache sooner, etc. This is especially important with big tables like yours. Limit the range of letters one can search for to the required minimum. This leaves you with something like:

CREATE INDEX t_a_lower_left_idx ON t (lower(left(a, 20)) text_pattern_ops);

Also, you can use the special operators ~>=~ and ~<~ in your query like I demonstrate in the answer I linked to:

SELECT * FROM tbl WHERE lower(a) ~>=~ 'abcde' AND lower(a) ~<~ ('abcdf')

Note the 'f' instead of the 'e' in the second expression. Question is: how do you get the "next" character according do locale 'C'?

SELECT chr(ascii('é')+1));

So you can:

SELECT * FROM tbl WHERE lower(a) ~>=~ 'abcde'
                    AND lower(a) ~<~ ('abcd' || chr(ascii('e')+1))

I ran a test with a natural table holding half a million rows. A search term yielding 650 rows took 4 ms with the first query and 3 ms with the second. It very much depends how many rows are found. A search term yielding only 1 row takes 0.044 ms here.

Therefore, limit the minimum length of the search term to prohibit useless queries that would yield too many rows anyway. Like 3 or 4 characters minimum.

Next, you can cluster your table like this:

CLUSTER tbl USING t_a_lower_left_idx

After that, my testcase took 2.5 ms instead of 3 ms.

Of course, all the basic advice for performance optimization applies.

If the above is not enough, you might want to think about creating a tablespace on a ramdisk or a tmpfs partition (Linux) and create indexes there or even put your whole table there. I am sure you are aware of the security implications of a volatile medium for a database. Only do this if you can afford losing all your data.

CREATE INDEX t_a_lower_left_idx ON t (lower(left(a, 20)) text_pattern_ops)
TABLESPACE indexspace;

If your database is set up properly and your machine has enough RAM and the table is read heavily, the standard caching algorithms may provide most of the performance gain automatically, and you won't gain much with this.

speeding up a postfix wildcard search even more

1 Answers1

Linked