Most Efficient Many-t-Many Query

Question

Let's say I'm implementing articles with article tags. I'm using SQL Server 2008.

TABLE Articles
ArtID INT
...

TABLE Tags
TagID INT
TagText VARCHAR(10)

TABLE ArticleTags
ArtID INT
TagID INT

I'm trying to figure out the most efficient way to query all articles with a specific tags. Here's two options, both of which I've read are most efficient.

Method A:

SELECT a.* FROM Articles
WHERE EXISTS (
    SELECT * FROM ArticleTags at
    INNER JOIN Tags t ON at.TagID = t.TagID
    WHERE at.ArtID = a.ID
    AND t.TagText IN ('abc', 'def')
)

Method B:

SELECT a.* FROM Articles a
INNER JOIN ArticleTags at ON a.ArtID = at.ArtID
INNER JOIN Tags t ON at.TagID = t.TagID
WHERE t.TagText IN ('abc', 'def')
GROUP BY a.ArtID

Can any SQL experts suggest which is more efficient and why? Or maybe I'm on the wrong track.

Take a look at the query plans for both. That should tell you which one is better for your particular situation. — Oded, Jan 21 '11 at 16:01
You should go to this SO question, http://stackoverflow.com/questions/1200295/sql-join-vs-in-performance, and take a look at @Quassnoi answer. And then you can go to the link he provided for a very comprehensive and detailed explanation and examples of this behavior.http://explainextended.com/2009/06/16/in-vs-join-vs-exists/ — Lamak, Jan 21 '11 at 16:17
@Oded: I'm an experienced developer but really am not an expert with SQL. I don't know how to see the query plan and probably wouldn't understand it if I could. — Jonathan Wood, Jan 21 '11 at 16:18
@Patrick: No, after thinking about it, it seems like full-text search wouldn't work for tags like "c++" and "c#" because the punctuation would be dropped. I'm willing to explore that approach if it could be made to work. — Jonathan Wood, Jan 21 '11 at 16:20
@Jonathan Wood - in SSMS there is an "estimated query plan" button. Try it... — Oded, Jan 21 '11 at 16:23
@Lamak: Thanks. I'm not sure how much of the article applies to my question. But it does seem to suggestion that my Method A is at least as efficient as my Method B. — Jonathan Wood, Jan 21 '11 at 16:26
@Oded: It seems like I looked at something like that once before and it didn't have meaning to me. But I will look again. — Jonathan Wood, Jan 21 '11 at 16:28
Jonathan - even if it means nothing, there is lots of documentation. It's worth trying to understand and learning about this. It will make you a better dev. — Oded, Jan 21 '11 at 16:43
@Oded: Sure, I agree. It's just that I need to balance that with MVC, WPF, jQuery, Entity Framework, HTML5/CSS3, etc. that I also need to improve at. Time is so limited, but I'll take a look. — Jonathan Wood, Jan 21 '11 at 16:54

score 2 · Accepted Answer · answered Jan 21 '11 at 19:52

As with almost all SQL performance questions, the answer is not the query, the answer is the data schema. What indexes you have, that is what drives the performance of your queries.

Usually many-to-many relations require two complementary indexes, one as (ID1, ID2) and the other as (ID2, ID1). One of them is clustered, it doesn't really matter which one. So lets create a test DB (100k articles, 1K tags, 1-10 tags per article):

:setvar dbname testdb
:setvar articles 1000000
:setvar tags 1000
:setvar articletags 10
:on error exit

set xact_abort on;
go

use master;
go

if db_id('$(dbname)') is not null
begin
    alter database [$(dbname)] set single_user with rollback immediate;
    drop database [$(dbname)];
end
go

create database [$(dbname)];
go

use [$(dbname)];
go

create TABLE Articles (
    ArtID INT not null identity(1,1),
    name varchar(100) not null, 
    filler char(500) not null default replicate('X', 500),
    constraint pk_Articles primary key clustered (ArtID));
go

create table Tags (
    TagID INT not null identity(1,1),
    TagText VARCHAR(10) not null,
    constraint pk_Tags primary key clustered (TagID),
    constraint unq_Tags_Text unique (TagText));
go

create TABLE ArticleTags (
    ArtID INT not null,
    TagID INT not null,
    constraint fk_Articles 
        foreign key (ArtID)
        references Articles (ArtID),
    constraint fk_Tags
        foreign key (TagID)
        references Tags (TagID),
    constraint pk_ArticleTags
        primary key clustered (ArtID, TagID));
go

create nonclustered index ndxArticleTags_TagID
    on ArticleTags (TagID, ArtID);
go          

-- populate articles
set nocount on;
declare @i int =0, @name varchar(100);
begin transaction
while @i < $(articles)  
begin
    set @name = 'Name ' + cast(@i as varchar(10));
    insert into Articles (name) values (@name);
    set @i += 1;
    if @i %1000 = 0
    begin
        commit;
        raiserror (N'Inserted %d articles', 0, 1, @i);
        begin transaction;
    end
end
commit
go


-- populate tags
set nocount on;
declare @i int =0, @text varchar(100);
begin transaction
while @i < $(tags)  
begin
    set @text = 'Tag ' + cast(@i as varchar(10));
    insert into Tags (TagText) values (@text);
    set @i += 1;
    if @i %1000 = 0
    begin
        commit;
        raiserror (N'Inserted %d tags', 0, 1, @i);
        begin transaction;
    end
end
commit
go

-- populate article-tags
set nocount on;
declare @i int =0, @a int = 1, @cnt int, @tag int;
set @cnt = rand() * $(articletags) + 1;
set @tag = rand() * $(tags) + 1;
begin transaction
while @a < $(articles)  
begin
    insert into ArticleTags (ArtID, TagID) values (@a, @tag);
    set @cnt -= 1;
    set @tag += rand()*10+1;
    if $(tags)<=@tag 
    begin
        set @tag = 1;
    end
    if @cnt = 0
    begin
        set @cnt = rand() * $(articletags) + 1;
        set @tag = rand() * $(tags) + 1;
        set @a += 1;
    end
    set @i += 1;
    if @i %1000 = 0
    begin
        commit;
        raiserror (N'Inserted %d article-tags', 0, 1, @i);
        begin transaction;
    end
end
commit
raiserror (N'Final: %d article-tags', 0, 1, @i);
go

Now lets compare the two queries:

set statistics io on;
set statistics time on;

select a.ArtID
from Articles a
where exists (
    select * 
    from ArticleTags at
    join Tags t on at.TagID = t.TagID
    where at.ArtID = a.ArtID
    and t.TagText in ('Tag 10', 'Tag 12'));

SELECT a.ArtID FROM Articles a
INNER JOIN ArticleTags at ON a.ArtID = at.ArtID
INNER JOIN Tags t ON at.TagID = t.TagID
WHERE t.TagText IN ('Tag 10', 'Tag 12')
GROUP BY a.ArtID

Result:

Table 'Articles'. Scan count 0, logical reads 3561, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ArticleTags'. Scan count 2, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tags'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Articles'. Scan count 0, logical reads 3561, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ArticleTags'. Scan count 2, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tags'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Surprise! (well, not really). They're IDENTICAL. In fact, they have exactly the same execution plan.

I'm guessing the reason for the question might be because of the [SQLCMD](http://www.simple-talk.com/sql/sql-tools/the-sqlcmd-workbench/) syntax? — Martin Smith, Jan 21 '11 at 20:09
Keep in mind though that this kind of queries are notoriously tricky. The execution plan could be perfect for the case when you seek the tag that occurs on 1% of the articles, but fail miserable when you seek for the ubiquitous tag that is on 99% of the articles. There are many dragons under that bridge, but going over the topic is beyond my time now. Quassnoi's site, linked by Lamak, has a *lot* of nice tricks. — Remus Rusanu, Jan 21 '11 at 20:11
see http://msdn.microsoft.com/en-us/library/ms174187.aspx for how to enable the SQLCMD extensions (`:setvar`) in Management Studio — Remus Rusanu, Jan 21 '11 at 20:12

ThomasMcLeod · Answer 2 · 2011-01-21T16:50:22.430

1

Your Method B has a GROUP BY clause, but your returning all columns from Articles, even presumably non-aggregateable columns. This would throw an error. The GROUP BY is probably unnecessary.

Without the GROUP BY, the queries have roughly the same execution plan. However Method B is a more standard SQL query statement.

Edit: DISTINCT is usually preferable to a GROUP BY in this case, and has the same function

SELECT DISTINCT 
    a.* 
FROM 
    Articles a 
INNER JOIN 
    ArticleTags at 
ON 
    a.ArtID = at.ArtID 
INNER JOIN 
    Tags t 
ON 
    at.TagID = t.TagID 
WHERE 
    t.TagText IN ('abc', 'def')

edited Jan 21 '11 at 16:50

answered Jan 21 '11 at 16:32

ThomasMcLeod

7,603
4
42
80

Without the GROUP BY, it would return the same article multiple times if the article contained more than one tag being sought. This would not be the results I want and would hog bandwidth as well. – Jonathan Wood Jan 21 '11 at 16:35
I see, that's fine as long as you can ensure ArtID is unique. I would put a PRIMARY KEY constraint on it. – ThomasMcLeod Jan 21 '11 at 16:45
ArtID would definitely be the PK. – Jonathan Wood Jan 21 '11 at 16:52

score 1 · Answer 3 · answered Jan 21 '11 at 17:12

1

I would create an indexed view based on the 3 tables on the columns artID and TagText. That way you can use:

SELECT * 
FROM Articles 
WHERE artID IN 
(SELECT artID 
FROM ArticleTagTextView 
WHERE TagText IN ('abc', 'def'))

answered Jan 21 '11 at 17:12

Tweety123

11
1

Dercsár · Answer 4 · 2011-01-21T16:53:16.420

0

Shortly: no difference. Both will be translated to the same execution plan.

Edit: haven't noticed the GROUP BY. This way the query most likely wont' compile. Remove the GROUP BY clause or list all fields of the table like GROUP BY Id, Name, ...

edited Jan 21 '11 at 16:53

answered Jan 21 '11 at 16:19

Dercsár

1,636
2
14
26

3

@ThomasMcLeod Only in MySQL not SQL Server! – Martin Smith Jan 21 '11 at 18:06

Most Efficient Many-t-Many Query

4 Answers4