poor performance of distinct count on array column postgress

Question

I have a permissions table with 3 column, id, tenantId, and serviceIds, which is text[] array column with ~100 values in average

id	tenantId	actionServiceIds
247584	748391	["azure","aws,"kfc","facebook"]
247584	748384	["google","kfc",facebook]

tenantId and id is my primary key. and there are around ~ 100 different tenants I want to get the distinct count of the service ids for all the ids while filtering on the tenantId column. so for the above example, with no tenant filter I should get

id	count
247584	3

a naïve version will be something like:

    select id, count(distinct actionServiceId)
    from permissions, unnest(permissions.actionServiceIds) AS 
    actionServiceId
    where tenantId in ('748384')
    group by id

the problem is that most of the time, the required tenants contain a large portion of the tenants, so when the table grows to 100k~ records the unnest and the distinct count group by turns extremally slow. I tried to create a different table with the result of the unnest, but still, with poor results, What is the best approach to this problem? is the unnest group by is the way to go? I can't create materialized view because of the different tenantIds combinations, and I need this table to be around 2~3 million rows long.

attached explained output for 800K rows table.

"GroupAggregate  (cost=35819.11..149428.28 rows=81335 width=28) (actual time=769.575..103786.215 rows=86002 loops=1)"
"  Group Key: permissions5.id"
"  ->  Nested Loop  (cost=35819.11..133225.28 rows=3077930 width=52) (actual time=767.163..15702.908 rows=34721705 loops=1)"
"        ->  Gather Merge  (cost=35819.11..71666.68 rows=307793 width=76) (actual time=767.111..1312.921 rows=308032 loops=1)"
"              Workers Planned: 2"
"              Workers Launched: 2"
"              ->  Sort  (cost=34819.09..35139.70 rows=128247 width=76) (actual time=673.584..801.961 rows=102677 loops=3)"
"                    Sort Key: permissions5.id"
"                    Sort Method: external merge  Disk: 8776kB"
"                    Worker 0:  Sort Method: external merge  Disk: 8744kB"
"                    Worker 1:  Sort Method: external merge  Disk: 8776kB"
"                    ->  Parallel Seq Scan on permissions5  (cost=0.00..18236.75 rows=128247 width=76) (actual time=0.165..165.462 rows=102677 loops=3)"
"                          Filter: ((tenantid)::text = ANY ('{d697dcc6-313c-4923-abe8-32b919932488d8861516-b846-49ae-9dda-30c1a3589fef,89a7b752-a12c-4a82-a5de-4b897762b387,7ef84b4c-8c08-43aa-8ee8-522ebd69c617,2a87b660-97e1-4229-8b97-a796874830d0,012fce86-f73e-4358-82ce-b040ffb44067}'::text[]))"
"                          Rows Removed by Filter: 142381"
"        ->  Function Scan on unnest actionserviceid  (cost=0.00..0.10 rows=10 width=32) (actual time=0.031..0.036 rows=113 loops=308032)"
"Planning Time: 0.194 ms"
"Execution Time: 103802.397 ms"

Which column is the primary key? you question explicitly shows neither tenantId nor id is the primary key./ — jian, Feb 27 '22 at 15:33
I have changed the example, combination of id and tenantid is the primary key — Amir Harari, Feb 27 '22 at 16:31
Your database schema is not good (`actionServiceIds` should be a table), don't expect good performance from it. — Blag, Feb 28 '22 at 13:37
@Blag meaning something like create table actionServiceIds AS select id, tenantId ,actionServiceId from permissions, unnest(actionserviceids) as actionServiceId ? — Amir Harari, Feb 28 '22 at 13:48
I don' this: "*I want to get the distinct count of the service ids for all the ids while filtering on the tenantId column*" - the expected output you are showing includes service IDs from a different tenant as well. tenant ID = 748384 only has three distinct service ID. Why do you expect 5? — , Mar 03 '22 at 12:23
Go back to the basics. Store `["google","kfc","facebook"]` as three rows. Just because the rdbms supports arrays does not mean you have to use them. — Salman A, Mar 04 '22 at 13:44

Salman A · Answer 1 · 2022-03-04T22:31:45.897

2

Change your database design. Store the data like so:

id	tenantId	actionServiceId
247584	748391	azure
247584	748391	aws
247584	748391	kfc
247584	748391	facebook
247584	748384	google
247584	748384	kfc
247584	748384	facebook

A unique key should be created consisting of all three columns (see below).

Now, since you are using tenantId in where clause to locate all corresponding id, you should create these indexes:

create index ix1 on permissions
(tenantId, id); -- pgsql seems to prefer this one

create index ix2 on permissions
(tenantId, id, actionserviceid); -- and this could be your unique index

When tested on dummy data, PostgreSQL used the first index and the time was reduced ~10 times.

Even if you stick with your original design (array + unnest), the first index should help.

edited Mar 04 '22 at 22:31

answered Mar 04 '22 at 14:01

Salman A

262,204
82
430
521

Just curious, how much improvement was made? Did you try the same index with array of IDs? – Salman A Mar 06 '22 at 09:09
1

yes on an Array of IDs it's still very slow, When I tried your solution, it was about ~10 times faster, but when I had around 120M rows the distinct count still takes around 40 seconds, I was also trying to convert the strings to hashed Int (because I only cared about the distinct count) but It still takes around 40 seconds. I guess that this is the best Postgres can do, It looks like it can't use parallel aggregation because of the distinct, can you think about another optimization that I can try? – Amir Harari Mar 06 '22 at 12:36
After filtering (which would use indexes), it has to sort the ids for grouping, then sort the actionserviceid to count distinct. I thought it could use the three column index for this but it does not. – Salman A Mar 07 '22 at 12:37

Nikhil S · Answer 2 · 2022-02-22T17:19:22.943

1

This is it:

Select permissions.id,count(distinct mn) from permissions, 
unnest(permissions."serviceId") as mn
where permissions."tenantId" in ('748384') group by permissions.id;

edited Feb 22 '22 at 17:19

answered Feb 22 '22 at 16:07

Nikhil S

3,786
4
18
32

thanks for the answer, but I need to group by the id and not by the tenant Id – Amir Harari Feb 22 '22 at 17:04
replace the tenantid with id and you will have your solution!!! i have edited the answer – Nikhil S Feb 22 '22 at 17:18
so this is exactly like my query isn't it? – Amir Harari Feb 22 '22 at 17:20
No..you will have to select the `id` in the first select as well to have all the `serviceIds` with that `id` and belonging to a particular tenant!!! You can also remove the tenantID filter if you dont want ot select the rows but want for all tenants and grouping only by `ids` – Nikhil S Feb 22 '22 at 17:21
thanks. I have got the query wrong, I have fixed it but my question was regarding performance – Amir Harari Feb 22 '22 at 17:24
how much time is it taking?? whats the explain?? – Nikhil S Feb 22 '22 at 17:25
I have attached explained output for 800k rows table, it's take 105 second. thanks for the help – Amir Harari Feb 23 '22 at 08:21

score 1 · Answer 3 · answered Feb 27 '22 at 20:15

I think I understand your intention.
Since array is a container type, so I guess first aggregate the matched rows array column then do the distinct count would be faster.
How to find the size of an array in postgresql
How to get distinct array elements with postgres?

create a function count the unique element in the array.

create or replace function public.array_unique_count(arr anyarray)
     returns bigint
     language sql
 as $function$
 select cardinality (array_agg(distinct elem))
 from unnest(arr) as arr(elem)
 $function$;

--aggregate all the matched result. select id, array_agg(actionServiceIds) from permissions where tenantId in ('748384') group by id;

final version.

with a as(
         select id, array_agg(actionServiceIds) as thearray
         from permissions
         where tenantId in ('748384')
         group by id)
select  id,array_unique_count(a.thearray) from a;

thanks for the answer, in the creation of a ,you can't use array_agg on the actionServiceIds because its an array. have you missed unnest? — Amir Harari, Feb 28 '22 at 13:25
https://www.postgresql.org/docs/current/functions-aggregate.html array_agg can aggregate the existing array. — jian, Feb 28 '22 at 15:17

score -1 · Answer 4 · answered Mar 03 '22 at 10:16

First PostGreSQL is well known to have the worst performances of all RDBMS about COUNT queries. Just have a look from the test I have done comparing PostGreSQL and Microsoft SQL Server "PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances"

Second There is no possibilities to have any performances of all kind when you are dealing with non atomic values... This is why the first normal form, that every relational dabase user should know, says : do not use non atomic values or datatype like ARRAYs, SETs, ROWs... Thoses datatypes cannot be indexed and it results in poor querying performances

Especialy using an ARRAY, which has te same strict behavior as a table, is a It's nonsense, because tables can be indexed while ARRAY cannot.

The only permanent solution to get performance for your problem is to remove the "actionServiceIds" column and create a table as follows:

CREATE TABLE permission_services
(id INT IDENTITY PRIMARY KEY,
 tenant_id INT NOT NULL REFERENCES permissions (tenant_id),
 service_name VARCHAR(64) NOT NULL);

CREATE INDEX tenant_id INCLUDE (service_name );

INSERT INTO permission_services, service_name 
SELECT tenant_id, actionServiceId
from permissions, unnest(permissions.actionServiceIds) AS actionServiceId

Then your query will be fast using :

SELECT id, count(service_name)
FROM   permission_services AS ps
       JOIN permissions AS p 
          ON ps.tenantId = p.tenant_id
WHERE ....
GROUP BY id

PS : you do not respect the posting chart, by omitting DDL of the table

poor performance of distinct count on array column postgress

4 Answers4