Retrieving Identity Values When Inserting Multiple Rows

Mon Jul 2, 2007 by Jeff Smith in t-sql, database-design, imports-exports

Suppose you have the following tables:

create table Customers
(
    CustomerID int identity primary key,
    CustomerName varchar(100) not null
)

create table AddressTypes
(
    AddressType varchar(10) primary key
)

create table CustomerAddress
(
    CustomerID int references Customers(CustomerID),
    AddressType varchar(10) references AddressTypes(AddressType),
    Street varchar(100),
    City varchar(100),
    State varchar(2),
    ZIP varchar(20),
    primary key (CustomerID, AddressType)
)

This is a simple schema for which a Customer can have multiple addresses, one per AddressType. Notice that the CustomerID column, the customer's primary key, is an identity.

Now, suppose you are importing data from a flat file or some other system, and what you get is data with the following columns:

CustomerName, BillingStreet, BillingCity, BillingState, BillingZIP, ShippingStreet, ShippingCity, ShippingState, ShippingZIP

You'd like to take this un-normalized data, of course, and put it into your tables. Ideally, you'd like to use simple set-based INSERT statement as well. However, this is often considered tricky because the CustomerID column is an identity that is generated at the time of insert, and we need to know the CustomerID before we can insert related rows into the CustomerAddress table. It is commonly known that if we insert a single row, we can use scope_identity() to get back the identity value created, but how do you get back a set of the identity values for multiple rows that are inserted all at once? Is a dreaded cursor required?

Here's the number one consideration that you must remember when inserting data from any outside system into your database, especially if your database generates identity primary keys:

Your import data must have a primary key and there must be a relation between your import data and your destination data.

This is absolutely crucial! Otherwise, you will just be adding row after row blindly to your data, and identity primary keys will happily be generated, and your data will be a big, redundant mess. Even if the system that generates these files cannot provide any clean primary key for the data, you need to decide upon a specification that dictates what happens when the data is imported into your system.

For example, suppose the import file contains the following:

CustomerName

-----------

ABC Company

ABC Company

XYZ Organization

ABC Company

What does this mean? Should the system blindly import this data and create 3 Customer rows, all labeled "ABC Company", with 3 different CustomerIDs? What happens if the 3 different ABC Company rows in the import data all have different addresses? Which one do we use? Is this good data to add to our system?

It is up to you to determine and document a rule that states how the import data should be integrated into your system. Perhaps you can alter the export to retrieve more columns, which either include a primary key from the external system, or contain enough data to uniquely identify each row. But often, we cannot control the data we receive; it is what it is, so the best we can do is come up with a specification that clearly indicates how our data will be imported. As I often preach here, you cannot write a single line of code until what you are trying to accomplish is clearly and completely defined.

In our case, let us define the following rule:

"The CustomerName column must be unique for all Customers in the system. The import file will only contain unique Customer names -- if it doesn't, the import process will return an error. If a customer Name already exists in the system, it will not be imported but the address information will be synchronized."

Now, that is a clear and simple specification, and with that established, we can write our code and easily handle the multiple-identity situation. Again, it bears repeating: when you have trouble writing code, it may be because you are a poor programmer or that you don't understand the tools, but often it is because you simply do not have clear, logical specifications that completely define what you need to do.

With these specifications in place, let's create a formal definition for our import staging table:

create table Staging_Customers
(
    CustomerName varchar(100) primary key,
    BillingStreet varchar(100),
    BillingCity varchar(100),
    BillingState varchar(100),
    BillingZIP varchar(100),
    ShippingStreet varchar(100),
    ShippingCity varchar(100),
    ShippingState varchar(100),
    ShippingZIP varchar(100)
)

Notice that we have defined a primary key of CustomerName in our staging table, since our specifications allow us to do so. As I mentioned before, it is crucial to establish a primary key to any imported data, since if it doesn't have a clear, logical primary key before you import it, how can you logically normalize the data into your tables? In short, you can't. Thus, we must have a primary key on our staging table, and there must be some relation to our final Customers table as well. The primary key of the Customers table, however, is CustomerID -- not CustomerName. So, we must also establish a unique index on the CustomerName column to ensure data integrity:

create unique index Customers_alt_pk on Customers(CustomerName)

Inserting new Customer rows to our Customer table is now just a matter of using EXISTS() or a LEFT OUTER JOIN to insert rows that aren't already in our table. We can use our CustomerName columns to determine this:

insert into Customers (CustomerName)
select s.CustomerName
from staging_Customers s
left outer join Customers c on s.CustomerName = c.CustomerName
where c.CustomerName is null

That only adds new customers, determined by their name, to our Customers table. So far so good, but how can we add our address rows? We need to know the identity values generated by the insert. Well, I know what you are thinking: it's about time you got the point! But everything I mentioned before was mentioned for a reason, and because we established a clear primary key in our imported data, and we clearly indicated how that imported data should be merged into our system, we can just a simple JOIN from our staging table to our Customers table to return the CustomerID for all of the rows we just added.

In other words, because we have a clear relation between the import and destination data, we can "magically" display the newly created CustomerID column for the import data:

select c.customerID, s.*
from customers c
inner join staging_Customers s on c.customerName = s.customerName

Notice that the join is on CustomerName, since we have previously established that column as determining the relation between our import data on our final data. If we had a key column from the external system, we would store that in our Customers table and relate on that, instead. It all depends on how you can map your external data to your system. But the key is, in order to do things in a set-based manner, and for your data to have meaning, there must be some way to map to the import data. If not, then you have no choice but to use cursors and to blindly INSERT one row at a time, and your data will be garbage.

So, let's synch up the addresses. Since we are going to add all new addresses and also update any existing addresses, we can use a simple DELETE statement to remove all addresses that are present in our staging data like this:

delete
from CustomerAddress
where CusomerID in
    (select c.CustomerID from
    Customers c
    inner join Staging_Customers s on c.CustomerName = s.CustomerName
    )

Simply deleting all matching addresses and then adding them all back in is easier than adding only new ones, updating changed ones, and deleting ones that are no longer valid. It's up to you to determine the easiest and best way to ultimately synch the data, this is just one approach.

We can now insert all of our Billing addresses like this:

insert into CustomerAddress (CustomerID, AddressType, Street, City, State, ZIP)
select C.CustomerID, 'Billing', s.BillingStreet, s.BillingCity, s.BillingState, s.BillingZIP
from customer c
inner join staging_Customers s on c.CustomerName = s.CustomerName
where s.BillingStreet is not null

Using almost the exact same code, we can insert our Shipping addresses as well:

insert into CustomerAddress (CustomerID, AddressType, Street, City, State, ZIP)
select C.CustomerID, 'Shipping', s.ShippingStreet, s.ShippingCity, s.ShippingState, s.ShippingZIP
from customer c
inner join staging_Customers s on c.CustomerName = s.CustomerName
where s.ShippingStreet is not null

The key to everything working is that we established a clear relation between our staging table in our Customer table. Sometimes, you might need to use a combination of CustomerName and a city or state, or perhaps a primary contact name. Other times, if you are lucky, your import data will already contain a unique key value that you can store in your Customers table as "ExternalKey" or something like that, and you can use that to relate the data.

Overall, when you want to do things in a set-based manner, you must always remember the golden rule: Your data must have integrity, it must have meaning, and it must have relations. With all those in place, importing data and retrieving identity values that are auto-generated is just a matter of a simple JOIN. Without those in place, you are stuck using cursors, and even worse than that you end up with bad, poorly-defined redundant data.

Legacy Comments


Mladen 2007-07-03	re: Retrieving Identity Values When Inserting Multiple Rows > we can use scope_identity() to get back the identity value created, but how do you get back a set of the identity values for > multiple rows that are inserted all at once? Is a dreaded cursor required? i see you've completly forgotten about using output clause :)

Jeff 2007-07-03	re: Retrieving Identity Values When Inserting Multiple Rows >>i see you've completly forgotten about using output clause :) You don't need it, it's irrelevant; the key point is that you need some way to join the newly inserted data to your imported data to add related rows to the other tables. Just getting a "list" of identity values created does no good if there is no way to relate those values to your import data.

srikar 2007-08-18	how to retrieve values radomly from a table how to do it

manjula 2007-09-10	re: Retrieving Identity Values When Inserting Multiple Rows I am little bit confused by seeing this

Lev 2007-09-10	books I think you managed to confuse everybody by placing a misleading title on the page :P You are not 'Retrieving' and Identity values, your just ensuring/inforcing (somewhat) the data consitsency.

Alexander Grau 2008-05-14	re: Retrieving Identity Values When Inserting Multiple Rows I have to insert a flat CSV file with 1 million customer records with serial numbers splitted into 2 tables (customers and serials) and have exactly the problem you gave me solution for! I wonder why your approach did not come to my mind :-) Thank you!

drew 2008-09-13	re: Retrieving Identity Values When Inserting Multiple Rows man, from the title, i thought this article would solve my issue. still i am left wondering how to retrieve identities when inserting multiple rows. damn

Benjamin Eidelman 2009-05-06	re: Retrieving Identity Values When Inserting Multiple Rows > you don't need it, it's irrelevant; the key point is that you need some way to join the newly inserted data to your > imported data to add related rows to the other tables. Just getting a "list" of identity values created does no good > if there is no way to relate those values to your import data. Jeff, the OUTPUT clause is not just for obtainng a "list" of identity values, take a look at this article: http://technet.microsoft.com/en-us/library/ms177564.aspx with the OUTPUT clause you can obtain a table that relates identity values, and any set of fields from the original/deleted/insert table. look this example from the link above: USE AdventureWorks; GO DECLARE @MyTableVar table( ScrapReasonID smallint, Name varchar(50), ModifiedDate datetime); INSERT Production.ScrapReason OUTPUT INSERTED.ScrapReasonID, INSERTED.Name, INSERTED.ModifiedDate INTO @MyTableVar VALUES (N'Operator error', GETDATE()); --Display the result set of the table variable. SELECT ScrapReasonID, Name, ModifiedDate FROM @MyTableVar; --Display the result set of the table. SELECT ScrapReasonID, Name, ModifiedDate FROM Production.ScrapReason; GO

Tim 2010-07-01	re: Retrieving Identity Values When Inserting Multiple Rows Agree. Change the title of the article. You do not solve the problem you present, you simply work around it for a very specific use case.