Surrogate Keys….The Devil’s spawn (OK Not really)

Wed Jun 9, 2004 by Brett Kaiser (x002548) in sql-server

Always seems to spark a very heated debate…So I’d like to make a list of Pros and Cons that I’d like to keep Adding to. All comments appreciated. So I’ll start with the CON’s, mostly because I believe there are a few reasons for a surrogate key, but it is usually WAY overused and also gets dbas/developers in more trouble .

EDIT: Ok a couple of response so far. I think I'd like to start a Poll (If that's possible). I'll add another pro for the devolpers in item #4.

EDIT: It's been suggested that surrogates are ANSI. I thought they were extensions. Anyone?

EDIT: OK, it seems everyone is “on the fence“, but the original question is preferance, not wether you wouldn't use them.

POLL:

Prefer to avoid surrogates: 2

On the Fence: 6

Prefer to only use surrogates: 4

CON

1. It’s not ANSI

2. It’s not defined anywhere in the rules for a relational model or Codd’s Rules

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=33785

http://en.wikipedia.org/wiki/Ted_Codd

3. Even SQL Server itself gets in trouble because of it. Had a bunch of fun with SELECT * in a view. When the table is altered problems arise in the catalog. See item 8:

http://weblogs.sqlteam.com/brettk/archive/2004/01/30/825.aspx

4. Excessive Need to Join to obtain values (None more glaring then one guy asking to set up identities for a zip code table.)

5. Outside of IDENTITY, all the problems of getting the next ID for an int (The locking alone…ok, there’s always the next id table solution…but why bother)

6. You need to add constraints of attributes that should not be duplicated anyway for data integrity.

7. People think that you can “clean up“ a table by reseting the seed...a conceptually bad idea in the relational model...AND it actualyy can be done...

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=30084

8. Peopel do stuff like

USE Northwind
GO

SET NOCOUNT ON
CREATE TABLE myPhone99 (PhoneID int IDENTITY(1,1), PhoneNumber varchar(10))
CREATE TABLE myLastName99 (LastNameID int IDENTITY(1,1), LastName varchar(50))
CREATE TABLE myFirstName99 (FirstNameID int IDENTITY(1,1), FirstName varchar(10))
CREATE TABLE myGender99 (GenderID int IDENTITY(1,1), Gender varchar(10))

CREATE TABLE myClient99(
ClientID int IDENTITY(1,1)
, PhoneID int
, LastNameID int
, FirstNameID int
, GenderId int)
GO

INSERT INTO myGender99(Gender) SELECT 'Male' UNION ALL SELECT 'Female'
INSERT INTO myFirstName99(FirstName) SELECT 'Brett' UNION ALL SELECT 'Boy5' UNION ALL SELECT 'Tara'
INSERT INTO myLastName99(LastName) SELECT 'Kaiser' UNION ALL SELECT 'CSE' UNION ALL SELECT 'Duggan'
INSERT INTO myPhone99(PhoneNumber) SELECT '9735551212' UNION ALL SELECT '800551212'
GO

INSERT INTO myClient99(PhoneID, LastNameID, FirstNameID, GenderID)
SELECT 1,1,1,1 UNION ALL
SELECT 2,2,2,1 UNION ALL
SELECT 3,3,3,2 UNION ALL
SELECT 4,3,3,2 UNION ALL
SELECT 3,4,3,2 UNION ALL
SELECT 3,4,4,2 UNION ALL
SELECT 3,3,4,4
GO

-- Very useful, huh
SELECT * FROM myClient99
GO

-- Fun with surrogates...

   SELECT c.ClientID, p.PhoneNumber, l.LastName, f.FirstName, g.Gender
     FROM myClient99 c
LEFT JOIN myPhone99 p ON c.PhoneID     = p.PhoneID
LEFT JOIN myLastName99 l ON c.LastNameID = l.LastNameID
LEFT JOIN myFirstName99 f ON c.FirstNameID = f.FirstNameID
LEFT JOIN myGender99 g ON c.GenderID    = g.GenderID
GO

SET NOCOUNT OFF
DROP TABLE myPhone99, myLastName99, myFirstName99, myGender99, myClient99
GO

PRO

1. Addresses. What else would be a good key? I haven’t seen one.

2. In a response it was suggested that for navigating many levels of parent child relationships it would be a good idea to use surrogate keys. The suggested number of columns was 5. I don’t consider that too unwieldy, my number would be much higher. But how deep have people seen these relationships…4, 5 or 6? I did remember long ago a logical model that a group tried to implement. Went to 15 levels. But only drives home my IDENTITY Point. A code for a code? Anyway I was honored by this reference. Please visit his site.

http://www.xml-blog.com/archives/000234.html

3. I guess I have to put that a paper thin Index would also be desirable.

4. Another response from developers indicating their preference. So a PRO I guess is the overwhelming fact the people prefer this method. I have yet to be convinced however...and this is what these blogging are all about anyway.

Please take a peek at his bolg

http://odetocode.com/Blogs/scott/archive/2004/06/09/275.aspx?Pending=true

4 1/2. Robert adds that the surogate changes so that's a plus. To me that's a debatable point...when was the last time a State changed it's name. And even if it did, isn't that a new state? So I'll Add it, but I have to make it item 4 1/2.

Legacy Comments


Christian Romney 2004-06-09	re: Surrogate Keys….The Devil’s spawn (OK Not really) Brett, Thanks for the nod! Christian

Robert 2004-06-10	re: Surrogate Keys….The Devil’s spawn (OK Not really) Pro: it never changes, so you don't have to worry about propagating updates.

DavidM 2004-06-10	re: Surrogate Keys….The Devil’s spawn (OK Not really) CON: If it has the IDENTITY property it can't be changed.

Brett 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) Peter, Well there's a lot of, how should we say this... OK Nonsense. My goal is to put a list of good reasons, on both sides of the coin, for the use or not use of surrogates... It is definetly worth the read..if you got an hour or so... David: Could you elborate? Peter: Where would you stand in the poll?

Peter Cook 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) Don't know yet how you do a trackback so heres my take on it http://dotnetjunkies.com/WebLog/petedoesstuffdotnet/archive/2004/06/11/16228.aspx Generally I am in favour of them - but its horses for courses as in depends on the situation.

Darrell 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) Chalk one up for on the fence! I use surrogate keys where they are useful, avoid use of them for lookup tables like state abbreviations, etc.

graz 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) Sounds like a good sqlteam poll!

Brett 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) ok bill...but where do I put you?

Satish Balusa 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) I work in the world of Finance and Banking Industry and one of our main tables is "Account". Out of few other columns there are three vital keys in the account table [Accountnumber(21 Bytes), AccountType (16 Bytes) & RTN(9 Bytes)] which uniquely identifies an Account. The Account Key is a FK in atleast 30 other tables. Now if I don't use a surrogate key for this I may have to carry 46 Bytes as opposed to a 4 or 8 byte integer to all the 30 tables. Also, there is tremendous amount CPU that consume each time I join the Account table with any other table if I am using the whole key, instead of a Surrogate Key.

Brett 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) I'll take that as a Pro vote. Thanks.

Enigma 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) Put me on the fence Brett ... use surrogate keys but only where i need more than two columns to form a primary key.

rudy 2004-06-11	re: Surrogate Keys….The Devil’s spawn (OK Not really) whaddya mean it's not ANSI? put me on the fence with one foot on either side of the issue i always design with natural keys always except when i don't ;o) usually, i just turn over my logical design to the DBA for implementation, and let him apply whatever clusterey raidable striped distributory replicatable partitionizing magical wizardry he has to perform in order to implement it as a physical database, and if he feels like declaring surrogates, fine, but it won't affect the application code, right? because the application code has already been written with natural keys heh heh

Brett 2004-06-14	re: Surrogate Keys….The Devil’s spawn (OK Not really) Rudy, Any link that you can provide us showing that surrogates are a standard, and/or are described by normal forms would be helpful... Let me ask you this...if you turn over your logical model, and let someone have free reign over your model...wouldn't you comprise that model? On the fence it is. But remember...the original poll whas preference...there are NEVER any absolutes (is that an absolute?) when it comes to these things...

Brett 2004-08-26	We don't use Natural keys....ever Well, what can I say? Except happy joining.... Do you have an identity column for your state code table? Do store and forward the stateId column in your address table? Someone give Donald Trump a ring...

Scott Jameson 2005-06-22	re: Surrogate Keys….The Devil’s spawn (OK Not really) In dimensional modeling a Type 2 dimensions absolutely requires the surrogate keys. But type 1 seems to be a judgement call. If you have large facts, more efficient surrogate keys can help considerable with performance though space savings and though index efficiency.

Brett Kaiser 2005-06-23	re: Surrogate Keys….The Devil’s spawn (OK Not really) Scott, Thanks for the comment. Can you elaborate as to what you mean by a Type 2 Dimension?

Dale Preston 2005-08-16	re: Surrogate Keys….The Devil’s spawn (OK Not really) The best primary key for an address is.....(drum roll)... The ADDRESS. If you create an artificial key with an identity column, what is to prevent you from adding the same address again? Primary keys, by definition, uniquely identify entities. If you want to use numbers in your foreign keys, go ahead and add an identity column with a unique index and have fun. You will still have to search the address itself before adding a new address to make sure it doesn't already exist. Therefore, in actual fact, the address is the primary key regardless of which column you clicked in your database manager.

Dale Preston 2005-08-17	re: Surrogate Keys….The Devil’s spawn (OK Not really) I am not sure why you would use the primary key in an address for a join. Wouldn’t the address table have a foreign key relationship to the person, business, or other entity to which the address belonged? In a foreign key relationship, it is the child table that has the reference to the parent, not the other way around. And that holds true in a one to one, or a one to many relationship. So the address table would have a person column. The person table would not have a reference to the address table. The unique constraint and index in your example are the proper items to be your primary key and, as I stated in my blog, they are the defacto primary key, regardless of what you configured in your table structure, because they are what you actually use to maintain data uniqueness in your data. If you really need to narrow your joins, and you have done performance testing to prove that, you have to violate good design to eek out a little better performance (am I editorializing too much here? LOL) - and there are times that has to be done - then don't use the identity column as your primary key; use your real primary key, the one that guarantees uniqueness, as the primary key and use the identity column to point your foreign key relationship in the child tables to. To use the two (the identity column and the unique constraint) in any other order is just plain backwards. I did work on a project for an organization that had an address table which the people table had a foreign key relationship to. Their idea was that if multiple people in their database shared the same address, they only had to store the address once. As I pointed out to them, multiple people share the first name Mike, too. Perhaps they should have a FirstName table and the people table could have a foreign key relationship to it. This whole thing about narrow indexes is just ludicrous. My coworker went on about the effects of wide indexes in joins and the effect on the indexes in those joins. My response to that should have been, but I didn't think of it until too late, “What is the purpose of a join?” The answer, almost without a doubt would have been, “to combine data.” Well, that is absolutely backwards from the purpose of a join. The purpose of a join is not at all to combine data. The purpose of the join is to allow you to separate data and, therefore, not have to store multiple copies. If all you want to do is to combine data, then put it in a flat file, or in a single, but wide, table. There you have it all combined, and no skinny little indexes to worry about. If you take a million row person table and a separate table of what kind of car they drive, the cars table only has a few thousand rows. Violating data integrity rules and making the data unreadable in order to shave a disk access or two off of reading a table of a few thousand rows is just not necessary. The leaf tables in a join, by the very definition of a join, will usually contain significantly less rows than the root table except in the case where the join is just a vertical partition - to which I go back to my first answer, put it back in a flat table if you are worried about disk access. So what difference does it make? Other than in perhaps a few very large databases, practically none. In my experience in troubleshooting and optimizing databases for clients, it is poorly written procedures, poorly designed tables, and no indexing or poor indexing, that always solves the performance problems. And most databases in the world are much too small to ever need to worry about whether or not they can get a few percent improvement in performance by narrowing keys or by merging tables. All this bad design is in order to solve a problem that almost never exists. Thanks for your reply. Dale

Brett 2005-08-18	WOW Dale, WOW, Thanks for the comments. ALl I would have to say, is that I have an address table I just looked up, and the address is around 150 bytes...and I didn't include the mailstop. Now we don't use an IDENTITY (It's DB2..but hey they even have it now), rather we use an ADDR_ID, which for our office buildings is a famaliar "common" name, and for employees homes or Agent offices, they are generated. And we carry a type to determine what type of address it is. In any case, it's still a surrogate (although I might make a case for the offices "common" name..but I won't). I would have a hard time selling a 150 byte PK, and I wouldn't. You would have to admit that a paper thin idex will outperform a 150 byte index, wouldn't you? I guess I could unload that and build both in SQL Server and perform a test with both methods and benchmark them. Again, thanks for reading, and the thorough reply.

TJ 2006-07-27	re: Surrogate Keys….The Devil’s spawn (OK Not really) The comparison of multiple people (physical entities) sharing the same name to multiple people sharing the same address (single physical location) is invalid. The same address always points to the same physical entity (the location), the same first name points to different physical entities (different people that share a common name). Arguably, finding people at the same location is of much greater and common use than looking up people with the same first names.