1

I want to group by b.hash however it is too costly to do it. Is there any way to have this query but also with group by but faster?

SELECT STRAIGHT_JOIN b.hash, b.page, n.favorite
FROM behaviour b, new_table n
WHERE b.timestamp >= NOW( ) - INTERVAL 20 SECOND 
AND b.hash = n.hash

This is the EXPLAIN

enter image description here

PROFILING

enter image description here

O. Jones
  • 103,626
  • 17
  • 118
  • 172
EnexoOnoma
  • 8,454
  • 18
  • 94
  • 179
  • Add example data and expected ouput – juergen d Feb 02 '17 at 12:54
  • If I'm not mistaken, there's ~5.6 million records that are returned by your query. What's costly is to 1) find the records 2) read them from disk/memory 3) send through network. Which step is exactly slow for you? Use `SET PROFILING = 1; ; SHOW PROFILE FOR QUERY 1;` – Mjh Feb 02 '17 at 13:00
  • @Mjh thank you for this. i have updated my answer. it seems like `sending data` is the one that is slow for me. What measures can I take to speed this? – EnexoOnoma Feb 02 '17 at 13:15
  • 2
    Well, none sadly. You are sending 5.6 million rows. That's a quite big chunk of data. First you need to send it and the receiving end needs to buffer it as well to receive it. Solution is to send less data. Do you need 5.6 million rows sent back? – Mjh Feb 02 '17 at 13:20
  • @Mjh certainly not. the returned rows because of the `where clause` are about 30. however my limited knowledge (and several trials-errors) lead me to the above query. If you think that there is a better way (and I am 100% sure that there is), i would appreciate it. – EnexoOnoma Feb 02 '17 at 13:23
  • 1
    There's actually a small bug in MySQL when it comes to the profiler output. It displays the time taken from the *previous* step. That means it was executing for some ~7 seconds. Now, the reason why - because to find the records and then reduce them, it had to inspect 5.6 million rows. Now we get back to the "it has to find the records on the disk" story, and that part is slow because the disk is slow. When that happens, we know I/O is at fault and we want to shift I/O to RAM and this is where famous `innodb_buffer_pool_size` variable kicks in. Sometimes you simply have to inspect all the data. – Mjh Feb 02 '17 at 13:29
  • 1
    There's always a point to which you can write SQL - in our line of work, there exists a limit to how good your code / SQL can be. If you hit that limit, if you wrote an excellent piece of code or SQL and it's *still* slow then that's it - you either hope you get better hardware or you try to split the work across more machines (if possible). If it isn't, you succumb to physical limitations of what's possible and what you need to do. In your case - I'd shift I/O to RAM by increasing `innodb_buffer_pool_size`, if it's possible of course. – Mjh Feb 02 '17 at 13:31
  • @Mjh thank you for all this educational info. One question, I have my table to myISAM instead of Innodb... – EnexoOnoma Feb 02 '17 at 13:34
  • 1
    I think straight_join is far more likely to be your problem than grouping. http://stackoverflow.com/questions/512294/when-to-use-straight-join-with-mysql – GordonM Feb 02 '17 at 13:36
  • 1
    Well, that's a bit of a problem then, MyISAM, while it was excellent for its time, is sadly a bit out of date for today's servers and users. I'm afraid I can't really help you do this with MyISAM, but [I'd suggest you give this question a read](http://dba.stackexchange.com/questions/14775/performance-settings-for-myisam-tables-keep-everything-in-memory) – Mjh Feb 02 '17 at 13:37
  • @Mjh I have added `timestamp` as index for testing purposes, and the result is returned under 1 second. Shall I keep it like this, or convert it to innodb (and remove the index)? The `behaviour` table is a table that they system inserts/updates continously – EnexoOnoma Feb 02 '17 at 13:41
  • 1
    I don't want to advise anything that might be dangerous, since we all lack a lot of information. Your system apparently does work, and you use it for something. If you added an index and you're getting results within ~1 second, I'd say that's ok. If I were you, I'd create a copy of the database, convert it to InnoDB, tune it and then test to see whether it operates faster. I'd never do the changes to actual live system that's being used. From this point on, you need to measure the performance and come to a conclusion on your own. Good luck! – Mjh Feb 02 '17 at 13:43
  • @Mjh thank you. If you want post something as an answer to accept it. – EnexoOnoma Feb 02 '17 at 13:45
  • 1
    Thanks, but I don't think I posted anything that's good enough to be an answer. The main point is that you came to a satisfiable solution on your own :) – Mjh Feb 02 '17 at 13:48
  • For future reference, you may wish to read this note about how to write an answerable SQL question. http://meta.stackoverflow.com/a/271056/ – O. Jones Feb 02 '17 at 14:05

2 Answers2

0

I'm guessing you want this query

SELECT b.hash, b.page, n.favorite
  FROM behaviour b
  JOIN new_table n ON b.hash = n.hash
 WHERE b.timestamp >= NOW( ) - INTERVAL 20 SECOND 
   AND b.hash = n.hash
 ORDER BY b.hash

I don't understand your remark about GROUP BY in your question. It looks like you want the most recent third of a minute's worth of items.

At any rate, if you create a compound index on your behaviour table containing the following columns, your query will probably start running acceptably fast.

 (timestamp, hash, page)

Why? MySQL's query planner can random-access the index to start at the timestamp value in the query. That's O(log n) quick. Then it can scan the index sequentially for the information needed by your query. That's almost instantaneous.

Is it possible you want this?

SELECT b.hash, b.page, MAX(n.favorite)
  FROM behaviour b
  JOIN new_table n ON b.hash = n.hash
 WHERE b.timestamp >= NOW( ) - INTERVAL 20 SECOND 
   AND b.hash = n.hash
 GROUP BY b.hash, b.page

Or this?

SELECT b.hash, b.page, GROUP_CONCAT(n.favorite)
  FROM ...
O. Jones
  • 103,626
  • 17
  • 118
  • 172
0

Please provide SHOW CREATE TABLE for both tables. Meanwhile, I will guess...

b does not have an index starting with timestamp, but it needs such.

As usual, Profiling is 'useless' -- 99% of the indicated time is spend in the nebulous "Sending data".

Rick James
  • 135,179
  • 13
  • 127
  • 222