4

I have a table notifications which contain a payload column of type jsonb with a gin index on this column. The talbe currently contains 2,742,691 rows

The table looks something like this:

id payload created_at
1 {"customer": {"email": "foo@example.com", "externalId": 111 } 2022-06-21
2 {"customer": {"email": "foo@example.com", "externalId": 222 } 2022-06-20
3 {"customer": {"email": "bar@example.com", "externalId": 333 } 2022-06-20
4 {"customer": {"email": "baz@example.com", "externalId": 444 } 2022-04-14
5 {"customer": {"email": "baz@example.com", "externalId": 555 } 2022-04-12
6 {"customer": {"email": "gna@example.com", "externalId": 666 } 2022-06-10
7 {"customer": {"email": "gna@example.com", "externalId": 666 } 2022-06-11

I am trying to query a list of email addresses that match the following condition:

  • multiple rows for the same email address exist
  • one of those rows does have a different externalId than one of the previous ones
  • created_at is within the last month

For the example table contents, this should only return foo@example.com because

  • bar@example.com only appears once
  • baz@example.com doesnt have a row that was created within the last month
  • gna@example.com has multiple rows but all of them have the same externalId

What I was trying is using a LEFT JOIN LATERAL like this:

select
  n.payload -> 'customer' -> 'email'
from
  notifications n
  left join lateral (
    select
      n2.payload -> 'customer' ->> 'externalId' tid
    from
      notifications n2
    where
      n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('email', n.payload -> 'customer' -> 'email')
      )
      and not (n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
      ))
      and n2.created_at > NOW() - INTERVAL '1 month' 
    limit
      1
  ) sub on true
where
  n.created_at > NOW() - INTERVAL '1 month'
  and sub.tid is not null;

however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB

QUERY PLAN
Nested Loop  (cost=0.17..53386349.38 rows=259398 width=32)
  ->  Index Scan using index_notifications_created_at on notifications n  (cost=0.09..51931.08 rows=259398 width=514)
        Index Cond: (created_at > (now() - '1 mon'::interval))
  ->  Subquery Scan on sub  (cost=0.09..205.60 rows=1 width=0)
        Filter: (sub.tid IS NOT NULL)
        ->  Limit  (cost=0.09..205.60 rows=1 width=32)
              ->  Index Scan using index_notifications_created_at on notifications n2  (cost=0.09..53228.33 rows=259 width=32)
                    Index Cond: (created_at > (now() - '1 mon'::interval))
                    Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Any pointers what I'm doing wrong here / how to optimize this?

Pascal
  • 5,879
  • 2
  • 22
  • 34
  • 1
    Did you try an `EXISTS` condition instead of the left join? Do you have an index on the `payload` column? –  Jun 22 '22 at 12:56
  • Do both of the different externalId need to be within the last month, or just one of them in the last month and the other possibly older? – jjanes Jun 22 '22 at 16:00
  • Force it to not use the index_notifications_created_at index on n2 by rewriting part of the query to be `n.created_at + INTERVAL '1 month' > NOW()`, and then show is what it does instead. Also, please replace your current plan with the output of `EXPLAIN (ANALYZE, BUFFERS)` – jjanes Jun 22 '22 at 16:11
  • @jjanes @a_horse_with_no_name `index_notifications_payload" gin (payload)` – Pascal Jun 22 '22 at 17:38

4 Answers4

1

Here is my suggestion. It uses function array_unique by @klin from this SO post.

select email from
(
  select 
         payload -> 'customer' ->> 'email' email, 
         array_agg(payload -> 'customer' ->> 'externalId') externalid_arr
  from notifications 
  where created_at >= current_date - interval 'P1M'
  -- created_at is within the last month
  group by email
  having count(*) > 1 
  -- multiple rows for the same email address exist
) t 
where array_length(array_unique(externalid_arr), 1) > 1; 
-- one of those rows does have a different externalId than ...
Stefanov.sm
  • 11,215
  • 2
  • 21
  • 21
1

Simple nested queries will do the trick without any need for joining or functions:

SELECT email
FROM (
    SELECT email, exid
    FROM (
        SELECT payload -> 'customer' -> 'email' AS email,
               payload -> 'customer' -> 'externalId' AS exid
        FROM notifications
        WHERE created_at > CURRENT_DATE - INTERVAL '1 month' ) recent 
    GROUP BY 1, 2 ) emails
GROUP BY 1
HAVING count(*) > 1

This will use your index on created_at and should thus be reasonable quick.

Patrick
  • 29,357
  • 6
  • 62
  • 90
  • `WHERE` condition should be moved outside subquery, otherwise you ignores all rows not in range, those could have a different `externalId` value and should be included while counting unique `externalId` values – Alexey Jun 22 '22 at 13:47
1

Your main mistake is that when you've joined notifications laterally you create 4 JSON objects per row in the table limited by condition created_at > NOW() - INTERVAL '1 month'. This condition limits row count to 259398, and now your subquery need to create 259398 * 4 = 1 037 592 JSONS. And finally LIMIT is used to get only 1 row when all rows in JOIN are processed.

You should refactor your query.

You can use CTE to obtain email, distinct external_id count and maximum created_at per email value like this

WITH cte(email, ext_id_count, max_created_at) AS (
    SELECT
        DISTINCT payload -> 'customer' -> 'email', 
        COUNT(DISTINCT payload -> 'customer' -> 'externalId'), 
        MAX(created_at)
    FROM notifications
    GROUP BY payload -> 'customer' -> 'email'
)
SELECT email FROM cte 
WHERE ext_id_count > 1 AND max_created_at > CURRENT_DATE - INTERVAL '1 month'

Please, check a demo

And it can be made even simplier, like this to make index on created_at to work

SELECT
    DISTINCT payload -> 'customer' -> 'email'
FROM notifications
WHERE EXISTS (
    SELECT 1 FROM notifications n
    WHERE n.created_at > current_date - interval '1 month' 
        AND n.payload->'customer'->>'email' = notifications.payload->'customer'->>'email'
)
AND EXISTS (
    SELECT 1 FROM notifications n
    WHERE n.payload->'customer'->>'email' = notifications.payload->'customer'->>'email'
    AND n.payload->'customer'->>'externalId' != notifications.payload->'customer'->>'externalId'
)

Please, check this demo

Meanwhile I agree with @jjanes in that if you extract email and externalId values of payload into separate columns and create indexes for them, than they can affect query performance.

Alexey
  • 2,439
  • 1
  • 11
  • 15
1

PostgreSQL's planner has no real insight into the internals of the JSON values, so it doesn't know how many rows from n2 are expected to have the same email as some row from n does. For that matter, it doesn't even know that that is the question being considered, as it doesn't understand how @> interacts with the inner workings of jsonb_build_object. So it just uses some very generic row estimates for the planning, and probably overestimates the number of rows substantially.

Your best bet is probably to pull the email and externalId out of the JSONB and put them into real columns. This will make it easier both to make better plans available, and make better info available so that the planner can choose those better plans.

The better plan would be to use a composite index on (email, created_at) so it can jump directly to the part of the index that satisfies both the email equality condition and the created_at inequality simultaneously.

If you can't refactor the data, then you could at least use an multi-column expressional index to get much of the benefit, for example on notifications ((payload -> 'customer' ->> 'email'), created_at) Then you would need to rewrite the first @> condition in your query to look like:

n2.payload->'customer'->>'email' = n.payload -> 'customer' ->> 'email'

You could also include the externalId expression as the 3rd column in the index, in which case the other @> condition would need to be rewritten in an analogous way.

After building an expressional index, you should immediately manually ANALYZE the table, otherwise the planner won't have the expression statistics it might need to make the best choice.

jjanes
  • 37,812
  • 5
  • 27
  • 34