Best way to exclude outdated data from a search in PostgreSQL

Question

I have a table containing the following columns:

an integer column named id
a text column named value
a timestamp column named creation_date

Currently, indexes have been created for the id and value columns.

I must search this table for a given value and want to make search as fast as I can. But I don't really need to look through records that are older than one month. So, ideally I would like to exclude them from the index.

What would be the best way to achieve this:

Perform table partitioning. Only search through the subtable for the appropriate month.
Create a partial index including only the recent records. Recreate it every month.
Something else?

(PS.: "the best solution" means the solution that is the most convenient, fast and easy to maintain)

Maybe a composite index on `creation_date, value`? It will be the simplest one (but not the fastest one). — Ihor Romanchenko, Apr 23 '13 at 14:07
Why include creation_date into index? I search for value, not for creation date. — boqapt, Apr 23 '13 at 14:43
`creation_date` in composite index can be used by postgresql to exclude records older than one month. Indexes are used not only for the fields you search. They are usefull for the filter and join fields too. — Ihor Romanchenko, Apr 23 '13 at 19:14
@IgorRomanchenko: If you make it a composite index (good enough for small to medium tables), make it on `(value, creation_date)`. The rule of thumb is: *equality first — ranges last*. [More in this related anser on dba.SE.](http://dba.stackexchange.com/questions/33196/multicolumn-index-and-performance/33220#33220) But the Q specifically asks for `search as fast as I can`. — Erwin Brandstetter, Apr 23 '13 at 19:43
@ErwinBrandstetter Yes, I wrote the fields of composite index fields in wrong order. As for the question - it asks for `the most convenient and simple and fastest` (last sentence of the question). + I've mentioned, that composite index isnt the fastest way. — Ihor Romanchenko, Apr 23 '13 at 19:55
@IgorRomanchenko: Well you got a point there, a plain multicolumn index is certainly simpler and more convenient to create. Partial will make it faster. *Even if you create it once and never update.* But come to think of it, the best solution would be a partial multicolumn index to take care of false positives. I'll update my answer. — Erwin Brandstetter, Apr 23 '13 at 20:05

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

Partial index

A partial index would be perfect for that, or even a partial multicolumn index. But your condition

don't need to search value in records older than one month

is not stable. The condition of a partial index can only work with literals or IMMUTABLE functions, i.e., constant values. You mention Recreate it every month, but that would not agree with your definition older than one month. You see the difference right?

If you should only need a the current (or last) month, index recreation as well as the query itself become quite a bit simpler!

I'll got with your definition "not older than one month" for the rest of this answer. I had to deal with situations like this before. The following solution worked best for me:

Base your index conditions on a fixed timestamp and use the same timestamp in your queries to convince the query planner it can use the partial index. This kind of partial will stay useful over an extended period of time, only its effectiveness deteriorates as new rows are added and older rows drop out of your time frame. The index will return more and more false positives that an additional WHERE clause has to eliminate from your query. Recreate the index to update its condition.

Given your test table:

CREATE TABLE mytbl (
   value text
  ,creation_date timestamp
);

Create a very simple IMMUTABLE SQL function:

CREATE OR REPLACE FUNCTION f_mytbl_start_ts()
  RETURNS timestamp AS
$func$
SELECT '2013-01-01 0:0'::timestamp
$func$ LANGUAGE sql IMMUTABLE;

Use the function in the condition of the partial index:

CREATE INDEX mytbl_start_ts_idx ON mytbl(value, creation_date)
WHERE (creation_date >= f_mytbl_start_ts());

value comes first. Explanation in this related answer on dba.SE.
Input from @Igor in the comments made me improve my answer. A partial multicolumn index should make ruling out false positives from the partial index faster - it's in the nature of the index condition that it's always increasingly outdated (but still a lot better than not having it).

Query

A query like this will make use of the index and should be perfectly fast:

SELECT value
FROM   mytbl
WHERE  creation_date >= f_mytbl_start_ts()            -- !
AND    creation_date >= (now() - interval '1 month')
AND    value = 'foo';

The only purpose of the seemingly redundant WHERE clause: creation_date >= f_mytbl_start_ts() is to make the query planner use the partial index.

You can drop and recreate function and index manually.

Full automation

Or you can automate it in a bigger scheme with possibly lots of similar tables:

Disclaimer: This is advanced stuff. You need to know what you are doing and consider user privileges, possible SQL injection and locking issues with heavy concurrent load!

This "steering table" receives a line per table in your regime:

CREATE TABLE idx_control (
   tbl text primary key  -- plain, legal table names!
  ,start_ts timestamp
);

I would put all such meta objects in a separate schema.

For our example:

INSERT INTO idx_control(tbl, value)
VALUES ('mytbl', '2013-1-1 0:0');

A "steering table" offers the additional benefit that you have an overview over all such tables and their respective settings in a central place and you can update some or all of them in sync.

Whenever you change start_ts in this table the following trigger kicks in and takes care of the rest:

Trigger function:

CREATE OR REPLACE FUNCTION trg_idx_control_upaft()
  RETURNS trigger AS
$func$
DECLARE
   _idx  text := NEW.tbl || 'start_ts_idx';
   _func text := 'f_' || NEW.tbl || '_start_ts';
BEGIN

-- Drop old idx
EXECUTE format('DROP INDEX IF EXISTS %I', _idx);

-- Create / change function; Keep placeholder with -infinity for NULL timestamp
EXECUTE format('
CREATE OR REPLACE FUNCTION %I()
  RETURNS timestamp AS
$x$
SELECT %L::timestamp
$x$ LANGUAGE SQL IMMUTABLE', _func, COALESCE(NEW.start_ts, '-infinity'));

-- New Index; NULL timestamp removes idx condition:    
IF NEW.start_ts IS NULL THEN 
   EXECUTE format('
   CREATE INDEX  %I ON %I (value, creation_date)', _idx, NEW.tbl);
ELSE
   EXECUTE format('
   CREATE INDEX  %I ON %I (value, creation_date)
   WHERE  creation_date >= %I()', _idx, NEW.tbl, _func);
END IF;

RETURN NULL;

END
$func$ LANGUAGE plpgsql;

Trigger:

CREATE TRIGGER upaft
AFTER UPDATE ON idx_control
FOR EACH ROW
WHEN (OLD.start_ts IS DISTINCT FROM NEW.start_ts)
EXECUTE PROCEDURE trg_idx_control_upaft();

Now, a simple UPDATE on the steering table calibrates index and function:

UPDATE idx_control
SET    start_ts = '2013-03-22 0:0'
WHERE  tbl = 'mytbl';

You can run a cron job or call this manually.
Queries using the index don't change.

-> SQLfiddle.
I updated the fiddle with a small test case of 10k rows to demonstrate it works. PostgreSQL will even do an index-only scan for my example query. Won't get any faster than this.

Did you compare the query time and index size for such partial indexes with composite index time and size ? — Ihor Romanchenko, Apr 23 '13 at 19:17
With *big tables* with lots of outdated rows (which is the common situation) a partial index should be *much* faster, because the size of the index is only a small fraction of a comparable composite index on the whole table. For small tables, a composite index on `(value, creation_date)` (sequence of columns is relevant) would be almost as fast. And yes, I ran extensive tests. Note, I fixed some typos in the answer and in the fiddle. — Erwin Brandstetter, Apr 23 '13 at 19:33
@ErwinBrandstetter Having exactly the same question: is your proposed solution still the most efficient now in postgresql 9.5 or any new tools/strategies possible now ? — Mathieu, Jan 12 '16 at 18:45
@Mathieu: This is still good in pg 9.5. But also consider the new [BRIN indexes](http://www.postgresql.org/docs/9.5/static/brin-intro.html) that may offer a simpler alternative. — Erwin Brandstetter, Jan 13 '16 at 08:47

Best way to exclude outdated data from a search in PostgreSQL

1 Answers1

Partial index

Query

Full automation

Linked