I'm trying to get a "lineage" or similar, and also information about the first and last links (at least; all would be good), out of a table that has self-referential links between rows that have been "replaced" and rows that have replaced them. The table has a structure along these lines:
CREATE TABLE Thing (
Id INT PRIMARY KEY,
TStamp DATETIME,
Replaces INT NULL,
ReplacedBy INT NULL
);
I'm stuck with this structure. :-) It's sort of doubly-linked (yes, it's a bit silly): Each row has a unique Id
, and then a row that has been "replaced" by another will have a non-NULL
ReplacedBy
giving the Id
of the replacement row, and the replacement row will also have a link back to what it replaces in Replaces
. So we can use either Replaces
or ReplacedBy
(or both) if we like.
Here's some sample data:
INSERT INTO Thing
(Id, TStamp, Replaces, ReplacedBy)
VALUES
(1, '2017-01-01', NULL, 11),
(2, '2017-01-02', NULL, 12),
(3, '2017-01-03', NULL, NULL),
(4, '2017-01-04', NULL, NULL),
(11, '2017-01-11', 1, NULL),
(12, '2017-01-12', 2, 22),
(22, '2017-01-22', 12, NULL);
So 1 was replaced by 11, 2 was replaced by 12, and 12 was replaced by 22.
I'd like to get the following information for each chain of links from this table in a reasonable way:
- Details of the row that started the chain
- Details of the final row in the chain
- Details of the links in-between or at least how many links (total) there are in the chain
...filtered by a date range applied to the last row in the chain.
In an ideal universe, I'd get back something like this:
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+ | FirstId | LastId | Id | Links | TStamp | +−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+ | 1 | 11 | 1 | 2 | 2017−01−01 | | 1 | 11 | 11 | 2 | 2017−01−11 | | 2 | 22 | 2 | 3 | 2017−01−02 | | 2 | 22 | 12 | 3 | 2017−01−12 | | 2 | 22 | 22 | 3 | 2017−01−22 | +−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+
So far I have this query, which I could post-process to get the above:
WITH Data AS (
SELECT Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM Thing
UNION ALL
SELECT Thing.Id, Thing.TStamp, Thing.Replaces, Thing.ReplacedBy, Depth + 1
FROM Data
JOIN Thing
ON Thing.Replaces = Data.Id
)
SELECT *
FROM Data
WHERE ReplacedBy IS NOT NULL OR Depth > 0
ORDER BY
Id, Depth;
That gives me:
+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+ | Id | TStamp | Replaces | ReplacedBy | Depth | +−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+ | 1 | 2017−01−01 | NULL | 11 | 0 | | 2 | 2017−01−02 | NULL | 12 | 0 | | 11 | 2017−01−11 | 1 | NULL | 1 | | 12 | 2017−01−12 | 2 | 12 | 0 | | 12 | 2017−01−12 | 2 | 12 | 1 | | 22 | 2017−01−13 | 12 | NULL | 1 | | 22 | 2017−01−13 | 12 | NULL | 2 | +−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+
And I could use something like this to figure out (for instance) the final row of each chain:
WITH Data AS ( SELECT Id, Replaces, ReplacedBy, 0 AS Depth FROM Thing UNION ALL SELECT Thing.Id, Thing.Replaces, Thing.ReplacedBy, Depth + 1 FROM Data JOIN Thing ON Thing.Replaces = Data.Id ), MaxData AS ( SELECT Data.Id, Data.Depth FROM Data JOIN ( SELECT Id, MAX(Depth) AS MaxDepth FROM Data GROUP BY Id ) j ON data.Id = j.Id AND Data.Depth = j.MaxDepth WHERE Depth > 0 ) SELECT * FROM MaxData ORDER BY Id;
...which gives me:
+−−−−+−−−−−−−+ | Id | Depth | +−−−−+−−−−−−−+ | 11 | 1 | | 12 | 1 | | 22 | 2 | +−−−−+−−−−−−−+
...but I've lost the starting point and the points along the way.
I have the strong feeling I'm missing something really straight-forward — but clever — that would let me get this largely with the query rather than post-processing, some kind of join with a "min" and "max" query (but not like my one above). What would it be?
The table doesn't have any indexes on Replaces
or ReplacedBy
, but we could add any needed. The table is only lightly used (roughly 300k rows and probably only a couple of hundred updates/inserts a day).
I'm limited to SQL Server 2008 features.