3

How would you search for the longest match within a varchar variable? For example, table GOB has entries as follows:

magic_word |  prize
===================
         sh|  $0.20
        sha|  $0.40
       shaz|  $0.60
      shaza|  $1.50

I would like to write a plpgsql function that takes amongst other arguments a string as input (e.g. shazam), and returns the 'prize' column on the row of GOB with the longest matching substring. In the example shown, that would be $1.50 on the row with magic_word shaza.

All the function format I can handle, it's just the matching bit. I can't think of an elegant solution. I'm guessing it's probably really easy, but I am scratching my head. I don't know the input string at the start, as it will be derived from the result of a query on another table.

Any ideas?

2 Answers2

5

Simple solution

SELECT magic_word
FROM   gob
WHERE  'shazam' LIKE (magic_word || '%')
ORDER  BY magic_word DESC
LIMIT  1;

This works because the longest match sorts last - so I sort DESC and pick the first match.

I am assuming from your example that you want to match left-anchored, from the beginning of the string. If you want to match anywhere in the string (which is more expensive and even harder to back up with an index), use:

...
WHERE  'shazam' LIKE ('%' || magic_word || '%')
...

SQL Fiddle.

Performance

The query is not sargable. It might help quite a bit if you had additional information, like a minimum length that you could base an index on, to reduce the number of rows to consider. It needs to be criteria that gets you less than ~ 5% of the table to be effective. So, initials (a natural minimum pick) may or may not be useful. But two or three letters at the start might help quite a bit.

In fact you could optimize this iteratively. Something along the line of:
Try a partial index of words with 15 letters+
If not found, try 12 letters+
If not found, try 9 letters+
...

A simple case of what I outlined in this related answer on dba.SE:

Another approach would be to use a trigram index. You'd need the additional module pg_trgm for that. Normally you would search with a short pattern in a table with longer strings. But trigrams work for your reverse approach, too, with some limitations. Obviously you couldn't match a string with just two characters in the middle of a longer string using trigrams ... Test for corner cases.
There are a number of answers here on SO with more information. Example:

Advanced solution

Consider the solution under this closely related question for a whole table of search strings. Implemented with a recursive CTE:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • You seen you know your stuff, so Which is LIKEly to be quicker: the nested SELECT below, or this one? I can test it, but the size of the table makes the difference negligible. However, eventually the table will be significantly larger, and such performance issues may make a difference. – Bren McGuire May 04 '13 at 14:38
  • @BrenMcGuire: As long as indexes are not involved the simplest form wins, which should be my first query. But testing >> guessing. The key to better performance is index usage. – Erwin Brandstetter May 04 '13 at 15:14
1

How about

1

     select max(FOO.matchingValue)
     from
      (
        select magic_word as matchingValue
        from T
        where substr( "abracadabra", 1, length(magic_word)) = magic_word 
      )
      as FOO

2

select prize from
T
  join
  (
  select max(FOO.matchingValue) as MaxValue
     from
      (
         select magic_word as matchingValue
        from T
        where substr( "abracadabra", 1, length(magic_word)) = magic_word 
      )
      as FOO
) as BAR
on BAR.MaxValue = T.magic_word
Community
  • 1
  • 1
Tim
  • 8,669
  • 31
  • 105
  • 183
  • How do I retrieve the `cost` field from that result set? – Bren McGuire May 04 '13 at 01:46
  • Just turn the above into an inline view following the same principle. I'll edit in a second. – Tim May 04 '13 at 01:47
  • It does the job okay, but something about it says 'non-optimal' to me. – Bren McGuire May 04 '13 at 02:15
  • As to cost: presumably in a real-world application you'd have another indexed column, e.g. [CONTEST], that could be used to reduce the innermost set of candidate magic_words to an easily managed number. The gathering of the innermost candidate set is where the optimzation, if any, must occur. – Tim May 04 '13 at 10:21