0

I have a medium size database (400,000 rows at the time) containing a Measurement table with the following schema:

CREATE TABLE `Measurements` (
`timestamp` timestamp,
`timetick`  INTEGER,
`Sensor1`   REAL,
`Sensor2`   REAL,
PRIMARY KEY(timestamp));

As timestamp increases (timestamp increases are not constant there are gaps and delays but timestamps are guaranteed to be monotonic), normally timetick increases too, but there cases where it resets to a small but unpredictable value. I need to find all such rows. I have used the following query (inspired by Finding the difference in rows in query using SQLite):

select r0,r1,a,b,rd,d from 
(select M0.rowid as r0,
  M1.rowid as r1,
  M0.timestamp as a,
  M1.timestamp as b,
  min(M1.timestamp)-M0.timestamp as rd,
  M1.timetick-M0.timetick as d
  from Measurements M0,Measurements M1
  where M1.timestamp>M0.timestamp group by M0.timestamp
) where d<0;

This works but takes hours, while the same job in python finishes in 30 seconds. Yet it is a a very common task, scientists calculate derivatives all the time and financial professionals calculate price differences. There should be an efficient way to do it.
I will appreciate your help and comments.

Community
  • 1
  • 1
NameOfTheRose
  • 819
  • 1
  • 8
  • 18
  • Show the output of [EXPLAIN QUERY PLAN](http://www.sqlite.org/eqp.html). – CL. Feb 12 '16 at 08:28
  • @cl 0|0|0|SCAN TABLE Measurements AS M0 USING INDEX sqlite_autoindex_Measurements_1 (~1000000 rows) 0|1|1|SEARCH TABLE Measurements AS M1 USING INDEX sqlite_autoindex_Measurements_1 (timestamp>?) (~250000 rows) – NameOfTheRose Feb 12 '16 at 08:42
  • 1
    In SQL you'd use `LAG` for this with which you can look into a previous record. SQLite doesn't feature this analytic function though. That means it has no means to work on a sorted list and compare adjacent rows easily. It is possible, but would involve joins with large intermediate results, thus being hard work for the DBMS. You are better off using a programming language hence (as you already noticed). – Thorsten Kettner Feb 12 '16 at 08:56
  • @ Thorsten Kettner: Although in my case @CL offered an efficient query, while trying to understand it I noticed that slight variations create queries that the db engine is unable to optimize and therefore have very long execution times. So I fully agree with you that it is better to stick with the language I know better. – NameOfTheRose Feb 12 '16 at 15:29

1 Answers1

1

A join with a GROUP BY is hard to optimize.

Better use a correlated subquery to find the respective next row:

SELECT m0.rowid AS r0,
       m1.rowid AS rn,
       m0.timestamp AS a,
       m1.timestamp AS b,
       m1.timestamp - m0.timestamp AS rd,
       m1.timetick - m0.timetick AS d
FROM (SELECT rowid,     -- This is the core query attaching to each row
             timestamp, -- the rowid of its next
             timetick,
             (SELECT rowid
              FROM measurements
              WHERE timestamp > m.timestamp
              ORDER BY timestamp
              LIMIT 1
             ) AS r1
      FROM Measurements AS m
     ) AS m0
JOIN measurements AS m1 ON m0.r1 = m1.rowid
WHERE m1.timetick - m0.timetick < 0;

If the timestamp is an integer, make that column an INTEGER PRIMARY KEY to avoid an extra index lookup.

CL.
  • 173,858
  • 17
  • 217
  • 259
  • Thank you, timestamps are not integers. This query is very fast but does not produce the correct result, probably because rows with ascending rowids do not always have increasing timestamps - measurements are collected in different computers and merged. I would prefer not to copy the whole table to a new one each time measurements are added to ensure that increasing rowids imply increasing timestamps (unless I have to, or may be do this in python) – NameOfTheRose Feb 12 '16 at 09:15
  • Edit the question to show some example data and the desired result. – CL. Feb 12 '16 at 09:23
  • I feel I need to include all the data to show the issue, I can do it (they are room temperature measurements) but the db is 40MB. There are just 14 cases where timetick resets – NameOfTheRose Feb 12 '16 at 09:31
  • I apologize, the query is working properly, the seemingly incorrect result was due to rows with missing timetick values. Execution time is less than 5sec, 5 times faster than the python version. I was also able to filter out rows missing timeticks. – NameOfTheRose Feb 12 '16 at 14:32
  • Will you please explain what purpose serves the definition AS r1 in the innermost select, and if it is related to r1 in the top select. Also name m0 is given to 2 different entities, is this necessary? Thank you. – NameOfTheRose Feb 12 '16 at 14:38
  • The `AS r1` is necessary because otherwise, the column would be named after the entire subquery. Using `m0` two times is probably not a good idea, but I was too lazy to make too many changes to the original query. – CL. Feb 12 '16 at 16:15
  • Thank you again so much. Also the r1 in the second line is not the same as the r1 in the other lines, it could be renamed to, say, rn (for Rowid of Next row in timestamp sequence). The core is: `SELECT rowid, timestamp, timetick, (SELECT rowid FROM measurements WHERE timestamp > m0.timestamp ORDER BY timestamp LIMIT 1 ) AS r1 FROM Measurements as m0;` which attaches to each row two additional fields containing its rowid and the rowid of the next row in ascending timestamp sequence – NameOfTheRose Feb 12 '16 at 19:37