How to minimize primary key conflicts on non-identity column when several sources can concurrently insert into a table?

Question

There is a table in our SQL Server 2012 to generate and send emails. Its simplified structure is as follows:

CREATE TABLE [dbo].[EmailRequest]
(
    [EmailRequestID] [int] NOT NULL,
    [EmailAddress] [varchar](1024) NULL,
    [CCEmailAddress] [varchar](1024) NULL,
    [EmailReplyToAddress] [varchar](128) NULL,
    [EmailReplyToName] [varchar](128) NULL,
    [EmailSubject] [varchar](max) NULL,
    [EmailBody] [varchar](max) NULL,
    [Attachments] [varchar](max) NULL,
    [CreateDateTime] [datetime] NULL,
    [_EmailSent] [varchar](1) NULL,
    [_EmailSentDateTime] [datetime] NULL,

    CONSTRAINT [PK_EmailRequest] 
        PRIMARY KEY CLUSTERED ([EmailRequestID] ASC)
)

I don't have any control over that table or the database where it sits; it is provided "as is".

Different programs and scripts insert records into the table at random intervals. I suspect most of them do this with queries like this:

INSERT INTO [dbo].[EmailRequest] ([EmailRequestID], ... <other affected columns>)
    SELECT MAX([EmailRequestID]) + 1, <constants somehow generated in advance>
    FROM [dbo].[EmailRequest];

I run a big SQL script which at some conditions must send emails as well. In my case the part responsible for emails looks like this:

INSERT INTO [dbo].[EmailRequest] ([EmailRequestID], ... <other affected columns>)
    SELECT MAX([EmailRequestID]) + 1, <values collected from elsewhere>
    FROM [dbo].[EmailRequest]
    JOIN db1.dbo.table1 ON ...
    JOIN db1.dbo.table2 ON ... and so on;

The "select" part takes its time, so when it actually inserts data the calculated MAX([EmailRequestID]) + 1 value may become redundant and cause primary key violation (rare event, but nevertheless annoying one).

The question: is there a way to design the query so it calculates MAX([EmailRequestID])+1 later, just before insert?

One of the options might be:

INSERT INTO [dbo].[EmailRequest] ([EmailRequestID], ... <other affected columns>)
    SELECT
        (SELECT MAX([EmailRequestID]) + 1 
         FROM [dbo].[EmailRequest]), <values collected from elsewhere>
    FROM db1.dbo.table1 
    JOIN db1.dbo.table2 ON ... and so on;

but I am not sure if it brings any advantages.

So there may be another question: is there a way to see "time-lapse" of query execution?

Testing is a challenge, because no one sends request to the test database, so I will never get PK violation in there.

Thank you.

Some amazing results from testing the accepted answer. The elapsed time for original (real) query - 2000...2800 ms; same query without "insert" part - 1200...1800 ms. Note: the "select" statement collects information from three databases.

The test query retains real "select" statement (removed below):

Declare @mailTable table
  (mt_ID int,
   mt_Emailaddress varchar(1024),
   mt_CCEmailAddress varchar(1024),
   mt_EmailSubject varchar(max),
   mt_EmailBody varchar(max)
  );

 insert into @mailTable
 select row_number() over (ORDER BY (SELECT NULL)),
  am.ul_EMail, ... -- EmailAddress - the rest is removed
 FROM <real live tables>;

 insert into dbo.EmailRequest
   (EmailRequestID, _MessageID, EmailType, EmailAddress, CCEmailAddress,
    BulkFlag, EmailSubject, EmailBody, EmailReplyToAddress,
    CreateDateTime, SQLServerUpdated, SQLServerDateTime, _EmailSent)
 select (select Max(EmailRequestID)+1 from dbo.EmailRequest),
   0, '*TEXT',  -- _MessageID, EmailType
   mt_Emailaddress,
   mt_CCEmailAddress,
   'N',  -- BulkFlag
    mt_EmailSubject, -- EmailSubject
    mt_EmailBody, -- EmailBody
    '', GetDate(), '0', GetDate(), '0'
  FROM @mailTable;

Elapsed time on 10 runs for first part - 48 ms (worst), 8 (best); elapsed time for second part, where collision may occur - 85 ms (worst), 1 ms (best)

use the expensive guid, or just [auto-increment](https://stackoverflow.com/questions/10991894/auto-increment-primary-key-in-sql-server-management-studio-2012) (so the db handles the `primary key` creation) — Bagus Tesa, Mar 21 '18 at 01:01
Use an identity or a sequence object to auto increment for you instead of incrementing it yourself. Alternatively you could use lock hints to ensuire no other processes can read that value while you're working on it — Xedni, Mar 21 '18 at 02:01
So if this table doesn't have an identity, how do other processes which insert into this table do it without running into the issue? — Xedni, Mar 21 '18 at 02:13
The column in question is an `int` - is there any chance you can use negative numbers for the ID values you generate? If so you could use a sequence to generate your negative IDs and there should not be any collision. — David Faber, Mar 21 '18 at 03:04
I cannot lock anything, because I don't know how it may affect other processes. @David Faber. The idea of out-of-order ID may be good for prevention of collisions, but I am afraid my emails can be missed and not sent at all. — user6150786, Mar 21 '18 at 20:56

score 3 · Answer 1 · answered Mar 21 '18 at 01:01

You don't have any good options, if you cannot fix the table. The table should be defined as:

CREATE TABLE [dbo].[EmailRequest](
    [EmailRequestID] [int] identity(1, 1) NOT NULL PRIMARY KEY,
    . . .

Then the database will generate a unique id for each row.

If you didn't are about performance, you can lock the table to prevent other threads from writing to the table. That's a lousy idea.

Your best bet is to capture the error and try again. No guarantee of when things will finish, and you could end up with different threads all deadlocking.

Wait, there is one thing you could do. You could use a sequence instead of the max id. If you control all the inserts into the table, then you could create a sequence and insert from that value rather than from the table. This would solve the performance problem and the need for a unique id. To really effect this, you would want to take the database down, bring it back up, set up all the code using the sequence, and then let'er rip.

That said, much the better solution is an identity primary key.

Unfortunately I can neither alter the table nor add anything to the database. And I don't control all inserts, my part is a tiny percentage of what is being pushed through this table. — user6150786, Mar 21 '18 at 01:35

Soukai · Accepted Answer · 2018-03-21T03:06:31.567

I know this might not be the most ideal solution, but I wanted to add it for completeness sake. Unfortunately, sometimes we don't have much of a choice in how we deal with certain problems.

Let me preface this with a disclaimer:

This may not work well in extremely high concurrency scenarios since it will hold an exclusive lock on the table. In practice, I've used this approach with up to 32 concurrent threads interacting with the table across 4 different machines and this was not the bottleneck. Make sure that the transaction here runs separately if at all possible.

The basic idea is that you perform your complex query first and store the results somewhere temporarily (a table variable in this example). You then take a lock on the table while locating the max ID, insert your records based on that ID, and then release the lock.

Assuming your table is structured like this:

CREATE TABLE EmailRequest (
    EmailRequestID INT,
    Field1 INT,
    Field2 VARCHAR(20)
);

You could try something like this to push your inserts:

-- Define a table variable to hold the data to be inserted into the main table:
DECLARE @Emails TABLE(
    RowID INT IDENTITY(1, 1),
    Field1 INT,
    Field2 VARCHAR(20)
);

-- Run the complex query and store the results in the table variable:
INSERT INTO @Emails (Field1, Field2)
    SELECT Field1, Field2
    FROM (VALUES
            (10, 'DATA 1'),
            (11, 'DATA 2'),
            (15, 'DATA 3')
        ) AS a (Field1, Field2);

BEGIN TRANSACTION;

-- Determine the current max ID, and lock the table:
DECLARE @MaxEmailRequestID INT = (
    SELECT ISNULL(MAX(EmailRequestID), 0)
    FROM [dbo].[EmailRequest] WITH(TABLOCKX, HOLDLOCK)
);

-- Insert the records into the main table:
INSERT INTO EmailRequest (EmailRequestID, Field1, Field2)
    SELECT
        @MaxEmailRequestID + RowID,
        Field1,
        Field2
    FROM @Emails;

-- Commit to release the lock:
COMMIT;

If your complex query returns a large number of rows (thousands), you might want to consider using a temp table instead of a table variable.

Honestly, even if you remove the BEGIN TRANSACTION, COMMIT, and locking hints (WITH(TABLOCKX, HOLDLOCK)), this still has the potential to dramatically reduce the frequency of the issue you described. In that case, the disclaimer above would no longer apply.

Thanks! I came to the same conclusion to break the statement with long running "select into " query, which would give me single row dataset (I need one email at a time) and quick "insert ... from [dbo].[EmailRequest] JOIN ON 1 = 1". I am not going to apply any locks, as I have no idea how other applications insert their data into email table and what issues they may run into. I needed the idea, and you answer is that idea, so it is marked as accepted. — user6150786, Mar 21 '18 at 20:49

How to minimize primary key conflicts on non-identity column when several sources can concurrently insert into a table?

2 Answers2