Getting a lineage of linked rows with details

Question

I'm trying to get a "lineage" or similar, and also information about the first and last links (at least; all would be good), out of a table that has self-referential links between rows that have been "replaced" and rows that have replaced them. The table has a structure along these lines:

CREATE TABLE Thing (
    Id         INT PRIMARY KEY,
    TStamp     DATETIME,
    Replaces   INT NULL,
    ReplacedBy INT NULL
);

I'm stuck with this structure. :-) It's sort of doubly-linked (yes, it's a bit silly): Each row has a unique Id, and then a row that has been "replaced" by another will have a non-NULL ReplacedBy giving the Id of the replacement row, and the replacement row will also have a link back to what it replaces in Replaces. So we can use either Replaces or ReplacedBy (or both) if we like.

Here's some sample data:

INSERT INTO Thing
(Id, TStamp,       Replaces, ReplacedBy)
VALUES
(1,  '2017-01-01', NULL,       11),
(2,  '2017-01-02', NULL,       12),
(3,  '2017-01-03', NULL,     NULL),
(4,  '2017-01-04', NULL,     NULL),
(11, '2017-01-11',    1,     NULL),
(12, '2017-01-12',    2,       22),
(22, '2017-01-22',   12,     NULL);

So 1 was replaced by 11, 2 was replaced by 12, and 12 was replaced by 22.

I'd like to get the following information for each chain of links from this table in a reasonable way:

Details of the row that started the chain
Details of the final row in the chain
Details of the links in-between or at least how many links (total) there are in the chain

...filtered by a date range applied to the last row in the chain.

In an ideal universe, I'd get back something like this:

+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links |   TStamp   |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+
|       1 |     11 |  1 |     2 | 2017−01−01 |
|       1 |     11 | 11 |     2 | 2017−01−11 |
|       2 |     22 |  2 |     3 | 2017−01−02 |
|       2 |     22 | 12 |     3 | 2017−01−12 |
|       2 |     22 | 22 |     3 | 2017−01−22 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+

So far I have this query, which I could post-process to get the above:

WITH Data AS (
    SELECT  Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
    FROM    Thing
    UNION ALL
    SELECT  Thing.Id, Thing.TStamp, Thing.Replaces, Thing.ReplacedBy, Depth + 1
    FROM    Data
    JOIN    Thing
    ON      Thing.Replaces = Data.Id
)
SELECT  *
FROM    Data
WHERE   ReplacedBy IS NOT NULL OR Depth > 0
ORDER BY
        Id, Depth;

That gives me:

+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+
| Id | TStamp     | Replaces | ReplacedBy | Depth |
+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+
|  1 | 2017−01−01 |     NULL |         11 |     0 |
|  2 | 2017−01−02 |     NULL |         12 |     0 |
| 11 | 2017−01−11 |        1 |       NULL |     1 |
| 12 | 2017−01−12 |        2 |         12 |     0 |
| 12 | 2017−01−12 |        2 |         12 |     1 |
| 22 | 2017−01−13 |       12 |       NULL |     1 |
| 22 | 2017−01−13 |       12 |       NULL |     2 |
+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+

And I could use something like this to figure out (for instance) the final row of each chain:

WITH Data AS (
    SELECT  Id, Replaces, ReplacedBy, 0 AS Depth
    FROM    Thing
    UNION ALL
    SELECT  Thing.Id, Thing.Replaces, Thing.ReplacedBy, Depth + 1
    FROM    Data
    JOIN    Thing
    ON      Thing.Replaces = Data.Id
),
MaxData AS (
    SELECT  Data.Id, Data.Depth
    FROM    Data
    JOIN    (
        SELECT  Id, MAX(Depth) AS MaxDepth
        FROM    Data
        GROUP BY Id
    ) j ON data.Id = j.Id AND Data.Depth = j.MaxDepth
    WHERE   Depth > 0
)
SELECT  *
FROM    MaxData
ORDER BY
        Id;

...which gives me:

+−−−−+−−−−−−−+
| Id | Depth |
+−−−−+−−−−−−−+
| 11 |     1 |
| 12 |     1 |
| 22 |     2 |
+−−−−+−−−−−−−+

...but I've lost the starting point and the points along the way.

I have the strong feeling I'm missing something really straight-forward — but clever — that would let me get this largely with the query rather than post-processing, some kind of join with a "min" and "max" query (but not like my one above). What would it be?

The table doesn't have any indexes on Replaces or ReplacedBy, but we could add any needed. The table is only lightly used (roughly 300k rows and probably only a couple of hundred updates/inserts a day).

I'm limited to SQL Server 2008 features.

[This](http://stackoverflow.com/a/42495125/92546) answer demonstrates one way to handle replaced/replaces relationships in a recursive CTE. I think most of the other baggage that you want is a matter of adding columns, e.g in the anchor you can `select Id, Id as FirstId, ...` and just carry `FirstId` along through the recursion. — HABO, Mar 14 '17 at 16:28
@HABO: Thanks. I couldn't quite make out that answer, but the comment about grabbing and propagating `FirstId` was really helpful (I'd thought of it and tried to do it, but hadn't quite managed to). — T.J. Crowder, Mar 14 '17 at 17:33

score 4 · Accepted Answer · edited May 23 '17 at 12:17

Inspired by Gordon Linoff's answer and HABO's comment which highlighted something Gordon was doing that was critical, I:

Removed the SQL Server 2012+ FIRST_VALUE function, replacing it with a CROSS JOIN on an "overview" query of the data
Included the Links count in the overview query
Removed the reliance on t in Gordon's WHERE NOT EXISTS (SELECT 1 FROM Thing t2 WHERE t2.ReplacedBy = t.id), which (at last on SQL Server 2008) wasn't bound to anything
Filtered out rows that weren't replaced

Below, I also add the date filtering mentioned in the question

...filtered by a date range applied to the last row in the chain.

...which Gordon didn't cover at all, and changes our approach, but only in terms of the arrow of time.

So, first, without the date criteria, sticking fairly close to Gordon's answer:

WITH Data AS (
    SELECT  Id AS FirstId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
    FROM    Thing
    WHERE   Replaces IS NULL AND ReplacedBy IS NOT NULL
    UNION ALL
    SELECT  d.FirstId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth + 1
    FROM    Data d
    JOIN    Thing t ON t.Replaces = d.Id
),
Overview AS (
    SELECT  FirstId, MAX(Id) AS LastId, COUNT(*) AS Links
    FROM    Data
    GROUP BY
            FirstId
)
SELECT  d.FirstId, o.LastId, d.Id, o.Links, d.Depth, d.TStamp
FROM    Data d
CROSS APPLY (
    SELECT  LastId, Links
    FROM    Overview
    WHERE   FirstId = d.FirstId
) o
ORDER BY
        d.FirstId, d.Depth
;

The critical parts of that are grabbing the seed Id as FirstId here:

SELECT  Id AS FirstId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM    Thing
WHERE   Replaces IS NULL AND ReplacedBy IS NOT NULL

and then propagating it through the results of the recursive join:

SELECT  d.FirstId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth + 1
FROM    Data d
JOIN    Thing t ON t.Replaces = d.Id

Just adding that to my original query gives us most of what I wanted. Then we add a second query to get the LastId for each FirstId (Gordon did it as a FIRST_VALUE over a partition, but I can't do that in SQL Server 2008) and using an overview query also lets me grab the number of links. We cross-apply that on the basis of the FirstId value to get the overall results I wanted.

The query above returns the following for the sample data:

+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | Depth | TStamp     |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
|       1 |     11 |  1 |     2 |     0 | 2017-01-01 |
|       1 |     11 | 11 |     2 |     1 | 2017-01-11 |
|       2 |     22 |  2 |     3 |     0 | 2017-01-02 |
|       2 |     22 | 12 |     3 |     1 | 2017-01-12 |
|       2 |     22 | 22 |     3 |     2 | 2017-01-13 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+

...e.g., exactly what I wanted, plus Depth if I want (so I know what order the intermediary links were in).

If we wanted to include rows that were never replaced, we'd just change

WHERE   Replaces IS NULL AND ReplacedBy IS NOT NULL

to

WHERE   Replaces IS NULL

Giving us:

+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | Depth | TStamp     |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
|       1 |     11 |  1 |     2 |     0 | 2017-01-01 |
|       1 |     11 | 11 |     2 |     1 | 2017-01-11 |
|       2 |     22 |  2 |     3 |     0 | 2017-01-02 |
|       2 |     22 | 12 |     3 |     1 | 2017-01-12 |
|       2 |     22 | 22 |     3 |     2 | 2017-01-13 |
|       3 |      3 |  3 |     1 |     0 | 2017-01-03 |
|       4 |      4 |  4 |     1 |     0 | 2017-01-04 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+

But we've ignored the date criteria required by the question:

...filtered by a date range applied to the last row in the chain.

To do that without building a massive temporary result set, we have to work backward: Instead of selecting the starting point (the first entry in a chain, Replaces IS NULL), we need to select the ending point (the last entry in a chain, ReplacedBy IS NULL), and then invert our logic working back through the chain. It's largely a matter of:

Swapping FirstId with LastId
Swapping Replaces with ReplacedBy (convenient the table had both!)
Using MIN to get the first ID in the chain rather than MAX to get the last
Using d.Depth - 1 rather than d.Depth + 1
Then fixing-up Depth based on Links once we know it in our final select, to get those nice values where 0 = first link rather than some varying negative number: o.Links + d.Depth - 1 AS Depth

All of which gives us:

WITH Data AS (
    SELECT  Id AS LastId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
    FROM    Thing
    WHERE   ReplacedBy IS NULL AND Replaces IS NOT NULL
    -- Filtering by date of last entry would go here
    UNION ALL
    SELECT  d.LastId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth - 1
    FROM    Data d
    JOIN    Thing t ON t.ReplacedBy = d.Id
),
Overview AS (
    SELECT  LastId, MIN(Id) AS FirstId, COUNT(*) AS Links
    FROM    Data
    GROUP BY
            LastId
)
SELECT  o.FirstId, d.LastId, d.Id, o.Links, o.Links + d.Depth - 1 AS Depth, d.TStamp
FROM    Data d
CROSS APPLY (
    SELECT  FirstId, Links
    FROM    Overview
    WHERE   LastId = d.LastId
) o
ORDER BY
        o.FirstId, d.Depth
;

So for instance, if we used

AND     TStamp BETWEEN '2017-01-12' AND '2017-02-01'

where I have

-- Filtering by date of last entry would go here

above, with our sample data we'd get this result:

+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | Depth |   TStamp   |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
|       2 |     22 |  2 |     3 |     0 | 2017−01−02 |
|       2 |     22 | 12 |     3 |     1 | 2017−01−12 |
|       2 |     22 | 22 |     3 |     2 | 2017−01−13 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+

...because the last link the Id = 1 chain is outside the date range, so we don't include it.

Gordon Linoff · Answer 2 · 2017-03-15T02:27:31.577

2

This is a little tricky. Arrange the CTE to start at the beginning of each list. That makes the subsequent processing easier:

WITH Data AS (
      SELECT Id as FirstId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
      FROM Thing t
      WHERE NOT EXISTS (SELECT 1 FROM Thing t2 WHERE t2.ReplacedBy = t.id)
      UNION ALL
      SELECT  d.FirstId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth + 1
      FROM Data d JOIN
           Thing t
           ON t.Replaces = d.Id
     )
SELECT d.*,
       FIRST_VALUE(id) OVER (PARTITION BY FirstId ORDER BY Depth DESC) as LastId
FROM Data d;

Then, you can use FIRST_VALUE() with a reverse sort to get the last value in the chain.

This returns chains that have no links. You can add a filter to remove these.

edited Mar 15 '17 at 02:27

answered Mar 14 '17 at 16:36

Gordon Linoff

1,242,037
58
646
786

Thanks for this. I'm off to see what an SQL Server 2008 equivalent for `FIRST_VALUE` might be... :-) (If you happen to know an older formation...) – T.J. Crowder Mar 14 '17 at 16:40
I think I've worked out the `FIRST_VALUE` part, but there's no `t` in scope for `WHERE NOT EXISTS (SELECT 1 FROM Thing t2 WHERE t2.ReplacedBy = t.id)`. – T.J. Crowder Mar 14 '17 at 16:48
The problem now is the *"...filtered by a date range applied to the last row in the chain."* part of the question. I can do it if I build up the whole lineage **first**, but even at 300k rows, that seems...wrong... – T.J. Crowder Mar 14 '17 at 18:52

Getting a lineage of linked rows with details

2 Answers2