How to improve query performance (using explain command results f.e.)

Question

I'm currently running this query. However, when run outside phpMyAdmin it causes a 504 timeout error. I'm thinking it has to do with how efficient the number of rows is returned or accessed by the query.

I'm not extremely experienced with MySQL and so this was the best I could do:

SELECT
                    s.surveyId,
                    q.cat,
                    SUM((sac.answer_id*q.weight))/SUM(q.weight) AS score,
                    user.division_id,
                    user.unit_id,
                    user.department_id,
                    user.team_id,
                    division.division_name,
                    unit.unit_name,
                    dpt.department_name,
                    team.team_name
                FROM survey_answers_cache sac
                    JOIN surveys s ON s.surveyId = sac.surveyid
                    JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
                    JOIN cluster c ON sc.cluster_id = c.cluster_id
                    JOIN user ON user.user_id = sac.user_id
                    JOIN questions q ON q.question_id = sac.question_id
                    JOIN division ON division.division_id = user.division_id
                    LEFT JOIN unit ON unit.unit_id = user.unit_id
                    LEFT JOIN department dpt ON dpt.department_id = user.department_id
                    LEFT JOIN team ON team.team_id = user.team_id
                WHERE c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
                GROUP BY user.team_id, s.surveyId, q.cat
                ORDER BY s.surveyId, user.team_id, q.cat ASC

The problem I get with this query is that when I get a correct result returned it runs quickly (let's say +-500ms) but when the result has twice as much rows, it takes more than 5 minutes and then causes a 504 timeout. The other problem is that I didn't create this database myself, so I didn't set the indices myself. I'm thinking of improving these and therefore I used the explain command:

I see a lot of primary keys and a couple double indices, but I'm not sure if this would affect the performance this greatly.

EDIT: This piece of code takes up all the execution time:

$start_time = microtime(true);
$stmt = $conn->query($query); //query is simply the query above.
while ($row = $stmt->fetch_assoc()){
    $resultSurveys["scores"][] = $row;
}
$stmt->close();
$end_time = microtime(true);
$duration = $end_time - $start_time; //value typically the execution time #reallyHigh...

So my question: Is it possible to (greatly?) improve the performance of the query by altering the database keys or should I divide my query into multiple smaller queries?

I don't believe the timeout is because of your database. Even the 500ms for a positive result is way too much time for a dataset this size. Sure, you can add an index to the `user` table, but this will most definitely not solve your problem. Go hunt for bugs inside the php code. — rollstuhlfahrer, Jan 25 '18 at 23:32
@Strawberry my thoughts exactly... Although the complete number of rows is a little bigger without the group property (1800 rows approx.), it still is a little request... Therefore, I thought something might be wrong with the database. Another thing, as mentioned in the question, the query **does** work in the phpmyadmin environment. However, when called from php it seems to slow down, any thoughts on that? — Evochrome, Jan 25 '18 at 23:41
1800 rows is still a small data size. Have you tried to run the query on its own and just retrieve data and see how long that takes without the rest of the PHP code being executed? — Ice76, Jan 26 '18 at 00:37
@Ice76 does the recent question update answer your question? It takes up the (minus a few ns) whole execution time. — Evochrome, Jan 26 '18 at 00:40
Sorting is probably killing you, often you can do a sort in an outer query, select only the ID's you need then join and sort in an outer query, this minimizes the size of the temp table that is created. — ArtisticPhoenix, Jan 26 '18 at 00:53
Try using a tool like MySQL Workbench. This way you don't have to deal with web requests and can directly test the database. — Ibu, Jan 26 '18 at 01:12
How long does it take? How long does it take without either the `GROUP BY` or the `ORDER BY`? — Rick James, Jan 29 '18 at 22:51

ArtisticPhoenix · Answer 1 · 2018-01-26T01:12:39.277

You can try something like this ( although its not practical for me to test this )

SELECT
    sac.surveyId,
    q.cat,
    SUM((sac.answer_id*q.weight))/SUM(q.weight) AS score,
    user.division_id,
    user.unit_id,
    user.department_id,
    user.team_id,
    division.division_name,
    unit.unit_name,
    dpt.department_name,
    team.team_name
FROM survey_answers_cache sac
    JOIN
    (
        SELECT
            s.surveyId,
            sc.subcluster_id
        FROM
            surveys s
            JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
            JOIN cluster c ON sc.cluster_id = c.cluster_id
        WHERE
            c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0
    ) AS v ON v.surveyid = sac.surveyid
    JOIN user ON user.user_id = sac.user_id
    JOIN questions q ON q.question_id = sac.question_id
    JOIN division ON division.division_id = user.division_id
    LEFT JOIN unit ON unit.unit_id = user.unit_id
    LEFT JOIN department dpt ON dpt.department_id = user.department_id
    LEFT JOIN team ON team.team_id = user.team_id
GROUP BY user.team_id, v.surveyId, q.cat
ORDER BY v.surveyId, user.team_id, q.cat ASC

So I hope I didn't mess anything up.

Anyway, the idea is in the inner query you select only the rows you need based on your where condition. This will create a smaller tmp table as it only pulls 2 fields both ints.

Then in the outer query you join to the tables that you actually pull the rest of the data from, order and group. This way you are sorting and grouping on a smaller dataset. And your where clause can run in the most optimal way.

You may even be able to omit some of these tables as your only pulling data from a few of them, but without seeing the full schema and how it's related that's hard to say.

But just generally speaking this part (The sub-query)

SELECT
    s.surveyId,
    sc.subcluster_id
FROM
    surveys s
    JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
    JOIN cluster c ON sc.cluster_id = c.cluster_id
WHERE
    c.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0

Is what is directly affected by your WHERE clause. See so we can optimize this part then use it to join the rest of the data you need.

An example of removing tables can be easily deduced from the above, consider this

SELECT
    s.surveyId,
    sc.subcluster_id
FROM
    surveys s
    JOIN subcluster sc ON s.subcluster_id = sc.subcluster_id
WHERE
    sc.cluster_id=? AND sc.subcluster_id=? AND s.active=0 AND s.prepare=0

The c table cluster is never used to pull data from, only for the where. So is not

    JOIN cluster c ON sc.cluster_id = c.cluster_id
 WHERE
    c.cluster_id=?

The same as or equivalent to

WHERE
    sc.cluster_id=?

And therefore we can eliminate that join completely.

Your query runs smoothly, (about 1.5x faster than mine :) ). Although it didn't solve my problem, I do really love this answer because of the thorough explanation, thanks! — Evochrome, Jan 26 '18 at 01:19
another tip is OR will kill your performance, I find it better to do a sub-query with UNIONs then doing an OR. — ArtisticPhoenix, Jan 26 '18 at 02:25

LMC · Accepted Answer · 2018-01-26T01:00:03.650

1

The EXPLAIN result is showing signs of problem

Using temporary;using filesort: the ORDER BY needs to create temporary tables to do the sorting.

On 3rd row for user table type is ALL, key and ref are NULL: means that it needs to scan the whole table each time to retrieve results.

Suggestions:

add indexes on user.cluster_id and all fields involved on the ORDER BY and GROUP by clauses. Keep in mind that user table seems to be under changein database (cross database query).
Add indexes on user columns involved on JOINs.
Add index to s.survey_id
If possible, keep the same sequence for GROUP BY and ORDER BY clauses
According to the accepted answer in this question move the JOIN on user table to the first position in the join queue.
Carefully read this official documentation. You may need to optimize the server configuration.

PS: query optimization is an art that requires patience and hard work. No silver bullet for that. Welcome to the fine art of optimizing MySQL!

edited Jan 26 '18 at 01:00

answered Jan 25 '18 at 23:41

LMC

10,453
2
27
52

I agree with you. However, after I added these indices (except for the one at q.cat because that one is certainly not an index). The "NULL" values still remain. – Evochrome Jan 25 '18 at 23:56
why q.cat does not need and index? Also, try adding indexes on the other user columns referenced on JOINS. Post the new explain result keeping the original and mention the added indexes so it helps others with the process and also works as documentation for you. – LMC Jan 26 '18 at 00:01
I've added indices to all JOIN, ORDER BY and GROUP BY colums, also an updated query. The result is shown in my question, please have a look. Any suggestions? The execution time unfortunately hasn't changed. Thanks for your help in any case by the way! – Evochrome Jan 26 '18 at 00:32
don't give up ;) your explain has changed for better. You might be close to what you want. – LMC Jan 26 '18 at 00:43
I just noticed you mentioned that it seems to be cross database. This shouldn't be, though. All tables are within `changin`. Also, unfortunately the sequence within the order by and group by need to be this way for the correct result. Lastly, I can't manage the server this specifically, as it is managed by another company :/ – Evochrome Jan 26 '18 at 00:50
After a bit of fiddling around, I found that the established connection provided by my host (which seems to be terrible...) in combination with the NULL values caused the problem. Thanks for the help! – Evochrome Jan 26 '18 at 01:21
you are welcome. Please update your question with your findings since the optimization was successful ;). – LMC Jan 26 '18 at 01:22

score 0 · Answer 3 · answered Jan 25 '18 at 23:39

0

i think the problem happends when you add this:

JOIN user ON user.cluster_id = sc.subcluster_id 
JOIN survey_answers_cache sac ON (sac.surveyId = s.surveyId AND sac.user_id = user.user_id)

the extra condition sac.user_id = user.user_id can be easily not consistent.

Can you try do a second join with user table?

pd. can you add a "SHOW CREATE TABLE"

answered Jan 25 '18 at 23:39

Sergio Susa

249
3
7

I'm sorry but I am not creating tables here, so I can't use show create table or what do you mean? What do you specifically need? – Evochrome Jan 25 '18 at 23:49
Also, what do you mean by can you try a "second join with the user table". I would like to try but I don't fully understand what you mean. – Evochrome Jan 26 '18 at 00:01
I altered my query so it doesn't need the double join. Ref: my question – Evochrome Jan 26 '18 at 00:22
i mean with create table something like: `SHOW CREATE TABLE user`, this show the sql sentence used to create that table. – Sergio Susa Jan 30 '18 at 00:40

How to improve query performance (using explain command results f.e.)

3 Answers3

Linked