How to eliminate duplicate of subquery in ".. where X in (S) or Y in (S)"?

Question

I have a query where I need to get rows from a table where any of two foreign keys exists in another query. Here is the simplified SQL:

Select MainID From MainTable Where
Key1 In (Select SubID From SubTable Where UserID=@UserID) Or
Key2 In (Select SubID From SubTable Where UserID=@UserID)

As you can see, the sub-query is duplicated. Is the SQL compiler intelligent enough to recognize this and run the sub-query once only or does it run twice?

Is there a better way I can write this SQL?

Update: I should have mentioned this originally - SubID is the primary key on SubTable.

score 3 · Accepted Answer · answered Feb 18 '15 at 11:37

3

You would replace the IN clause with an EXISTS clause:

Select MainID From MainTable 
Where Exists
(
  Select * 
  From SubTable 
  Where UserID = @UserID 
  And SubID in (MainTable.Key1, MainTable.Key2)
);

answered Feb 18 '15 at 11:37

Thorsten Kettner

89,309
7
49
73

I would avoid changing it to a *dependent* subquery .. SQL Server is smart and will likely undo this inversion; but the conceptual inversion remains. – user2864740 Feb 18 '15 at 17:12
Well *I* wouldn't :-) To ask for records where some data *exists* for a record's keys with an EXISTS clause seems the straight forward way. Easy to read and understand and usually very fast on execution (provided there is an index on `SubTable.SubID` or even better on `SubTable.UserID + SubTable.SubID`). Few physical reads and no big intermediate result. Seems perfect to me. – Thorsten Kettner Feb 18 '15 at 18:04
It's only straightforward when not thinking in RA - it's effectively a query that mimics "calling a procedure for each row" - eg. `WHERE exists_in_subtable(MainTable.Key1, MainTable.Key2)` - instead of a query written in terms of "joining sets". It's only as fast as the Query Planner will untangle (and I give full provision that SQL Server will do such, as it has a solid query planner/engine) the dependent nature. – user2864740 Feb 18 '15 at 18:07
To a "non-expert", the first impression this query gives that it would run the sub-query for each row in the main table. But after testing this, I see it runs equally well as the original query (if not better). I guess the Query Planner does its job well on figuring out what the user wants. But still, from the point of code readability and maintainability, I'd rather use a simpler query of two if it was just marginally slower. (Though I have only about 200 thousand rows on the Main Table and just 10 thousand on SubTable.) Marked as Answer! – navigator Feb 19 '15 at 05:59
@navigator SQL Server is smart (which is why I love it) - however not all database query planners can/will make such a transformation (and SQL Server itself can't always make such a transformation).. as always, look at the actual query plan to see the selected process. Even an "expert" should do this as relevant: (ideally) SQL DQL says *what* to do, not *how* to do it. – user2864740 Feb 19 '15 at 16:01

score 1 · Answer 2 · answered Feb 18 '15 at 11:35

1

You can use a common table expression:

with subid_data as (
  Select SubID 
  From SubTable 
  Where UserID=@UserID
)
Select MainID 
From MainTable 
Where Key1 In (select SubID from subid_data)  
   Or Key2 In (select SubID from subid_data);

answered Feb 18 '15 at 11:35

The CTE is fundamentally no different than leaving the identical query in both locations - but it *does* eliminate the human-duplication (and accidental changes of only one subquery, etc). If continuing to use IN / EXISTS this is probably the approach I would take as it does not change the subquery to a dependent query. – user2864740 Feb 19 '15 at 16:04

score 0 · Answer 3 · answered Feb 18 '15 at 11:26

I don't think compiler is intelligent enough to do a table scan or index seek once.

If you have a complicated where clause then you can push the sub-query results into temp table. Now use the temp table in where clause which will have a better performance.

SELECT SubID
INTO   #SubTable
FROM   SubTable
WHERE  UserID = @UserID

SELECT MainID
FROM   MainTable M
WHERE  EXISTS (SELECT 1
               FROM   #SubTable
               WHERE  M.Key1 = S.SubID)
        OR EXISTS (SELECT 1
                   FROM   #SubTable
                   WHERE  M.Key2 = S.SubID)

This forces an explicit non-required materialization: it requires additional overhead and will likely be slower. It also cannot be used in a VIEW or otherwise composed with other RA DQL. — user2864740, Feb 18 '15 at 17:07

score 0 · Answer 4 · edited Feb 18 '15 at 22:41

0

Please try the following query:

Select MainID 
From MainTable m
Where exists 
( select 1 from SubTable s Where s.UserID=@UserID and s.sub_id in (m.key1,m.Key2))

edited Feb 18 '15 at 22:41

kenorb

155,785
88
678
743

answered Feb 18 '15 at 15:57

A ツ

1,267
2
9
14

score 0 · Answer 5 · edited May 23 '17 at 12:28

tldr; both the original and the following JOIN proposal, with less "looks redundant", should generate equivalent query plans. View the actual query plans if there are any doubts as to how SQL Server is [currently] treating a query. (See IN vs. JOIN vs. EXISTS for a taste of the magic.)

Is the SQL compiler intelligent enough to recognize this and run the sub-query once only or does it run twice?

Yes, SQL Server is smart enough to handle this. It does not need to "run twice" (nit: the subquery does not "run" at all in a procedural sense). That is, there is no mandated explicit materialization stage - much less two. The JOIN transformation below shows why such is not required.

Since these are independent (or non-correlated) sub-queries¹, as they do not depend on the outer query, then they can - and I dare say will - be optimized as they can be freely, and easily, moved under Relational Algebra (RA) rules.

As you can see, the sub-query is duplicated .. Is there a better way I can write this SQL?

However it still "looks redundant" visually because it is written so. SQL Server doesn't care - but a human might. Thus the following is how I would write it and what I consider "better".

I am a big fan of using JOINs over subqueries; once a JOIN approach is adopted it often "fits better" with RA. This simple transformation to a JOIN is possible because of the non-correlated nature of the original subqueries - the [SQL Server] query planner is capable of doing such RA rewrites internally; view the actual query plans to see what differences there are, if any.

Rewriting the query would then be:

Select MainID
From MainTable 
Join (
    Select Distinct SubID -- SubId must be unique from select
    From SubTable
    Where UserID=@UserID
) t
-- Joining on "A or B" may indicate an ARC relationship
-- but this obtains the original results
On    Key1 = t.SubID
   Or Key2 = t.SubID

The DISTINCT is added to the derived table query because of the unknown (to me) multiplicity of SubId column - it can be treated as a redundant qualifier by SQL Server if SubId is bound by a Unique Constraint so it's either required or "free". See IN vs. JOIN with large rowsets for why it matters that the joined table keys are unique.

Note: SQL Server does not necessarily rewrite an IN to the join as shown above, as discussed in IN vs. JOIN vs. EXISTS; but the fundamental concept of being able to move the RA operation (and being able to treat the query as a what and not a how) is still used.

¹ Some of the answers change the original subquery to a dependent/correlated subquery which is going the wrong way. It may still result in a respectable (or even equivalent) query plan as SQL Server will try to "undo" the changes - but that's going a step away from a clean RA model and JOINs! (And if SQL Server can't "undo" the added correlation then the query will be far inferior.)

DISTINCT is often a costly thing. And this query can still get duplicate MainIDs (say when Key1 is 1 and Key2 is 2 and both exist as SubID in SubTable for the user). You need one more DISTINCT (for the main query) to get your query straight. (Of course you can then go without a derived table at all and simply join SubTable.) Anyhow, this is a rather clumsy way to look for existence. — Thorsten Kettner, Feb 18 '15 at 18:24
If SubId is in a Unique Constraint (PK?) then there is no additional cost - a query plan would tell (and none of these require materialization). There is no additional distinct required on the main table to have the same multiplicities as the original. As far as "clumsy" .. that's subjective at best without a specific scenario to the contrary. It's also easily amendable to related queries. — user2864740, Feb 18 '15 at 18:28
@ThorstenKettner The IN vs JOIN link provides insight into generated query plans and, rightly so, defends *against* "avoiding IN" for performance. It points out the "less than ideal" case for a JOIN .. DISTINCT ("In fact, JOIN queries are less efficient *on non-indexed tables*, since Semi Join methods allow aggregation and matching against a single hash table, while a JOIN needs to do these two operations in two steps."), but I don't think such readily distracts from following an RA-centric approach. That's what query plan results are for: finding the actual problems (like missing indexes). — user2864740, Feb 18 '15 at 18:43
@ThorstenKettner And *if* leaving with an IN or EXISTS (not my preference), then leave it as the original - or create a VIEW and/or use a CTE to remove human-visible duplication - *without* turning it into a dependent subquery. — user2864740, Feb 18 '15 at 18:47

How to eliminate duplicate of subquery in ".. where X in (S) or Y in (S)"?

5 Answers5