Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data
A Simple Single-Child Model
Consider the following logical data model:
- There are multiple Companies
- Each Company has many Projects
- Each Project has many Tasks
- Each Task has a Status, selected from a global list of pre-defined Statuses.
Status Companies
\ |
\ |
\ Projects
\ |
\ |
Tasks
\ |
\ |
\ Projects
\ |
\ |
Tasks
Let us say that we decide that the primary key of Companies, Projects, Tasks, and Status are all Identity (auto-number) columns, since we would like to auto-generate primary keys for those tables.
So, we have something like this:
Table Primary Key Foreign Key References
--------- ----------- ----------------------
Status StatusID n/a
Companies CompanyID n/a
Projects ProjectID Companies(CompanyID)
Task TaskID Projects(ProjectID), Status(StatusID)
--------- ----------- ----------------------
Status StatusID n/a
Companies CompanyID n/a
Projects ProjectID Companies(CompanyID)
Task TaskID Projects(ProjectID), Status(StatusID)
At this point, things work fine -- we can enforce full referential integrity with our identity primary keys. We can't necessarily eliminate duplicate data, but suppose that is not an issue and/or we have unique constraints in place. Choosing to follow the "every primary key must be an identity rule" did not have any adverse affects and we have a nice, clean data model.
Parents with Multiple Children
Now, allow me to add one little wrinkle. Suppose that the available Statuses for each Task is not defined globally, but at the Company level. That is, each Company has its own list of Statuses to which a Task can be assigned.
Companies
/ \
/ \
Status Projects
\ /
\ /
Tasks
/ \
/ \
Status Projects
\ /
\ /
Tasks
This means that the Status table now needs a Foreign Key reference to the Companies table (to indicate which Company each Status belongs to), giving us:
Table Primary Key Foreign Key References
--------- ----------- ----------------------
Companies CompanyID n/a
Status StatusID Companies(CompanyID)
Projects ProjectID Companies(CompanyID)
Task TaskID Projects(ProjectID), Status(StatusID)
--------- ----------- ----------------------
Companies CompanyID n/a
Status StatusID Companies(CompanyID)
Projects ProjectID Companies(CompanyID)
Task TaskID Projects(ProjectID), Status(StatusID)
Are there any other changes we need to make to this data model? Or is simply adding a CompanyID column to the Status table enough to facilitate this change? Remember, our goal as always is full referential integrity using primary and foreign key constraints wherever possible.
Well, there is an issue:
Nothing in this data model stops us from assigning a Status to a Task that is not defined for that Task's parent company. We have no way of enforcing this right now with our current constraints. Our physical data model is flawed. (This is also described a bit here.)
Identities Are Part of the Solution
This is very easily fixed, but it can only be done by violating the "all tables just need an identity primary key" rule.
First, remember this: just because an identity column is unique does not mean that that column cannot be part of a primary key. So, even though maybe you decide that you really like integer primary keys for your Status table (when probably a simple CHAR(2) code is all you need), if Status codes are child entities of Companies then we can instead create a composite primary key like this:
Table Primary Key Foreign Key References
--------- ----------- ----------------------
Status CompanyID,StatusID Companies(CompanyID)
So now, with the two column primary key and one of those columns referencing another table, Statuses are truly a child entity of Companies. Even though it seems that having a two-column primary key is not necessary since the StatusID column in itself is unique, you will see that to physically enforce the very simple logical model we are working with, it's required!--------- ----------- ----------------------
Status CompanyID,StatusID Companies(CompanyID)
Now, just as Statuses are child entities of Companies, so are Projects. And as with the StatusID column, we have a nice unique identity ProjectID column that is unique enough that it could be our primary key -- but it shouldn't be. Once again, being a child entity of a Company, and with the goal of wishing to enforce integrity throughout our entire model, we find that the true primary key of our Project table needs to be a composite of CompanyID/ProjectID:
Table Primary Key Foreign Key References
--------- ----------- ----------------------
Projects CompanyID,ProjectID Companies(CompanyID)
None of this seems necessary yet -- why are we messing with our perfectly fine keys? -- but soon it will all make sense.--------- ----------- ----------------------
Projects CompanyID,ProjectID Companies(CompanyID)
Looking at Tasks, we know we have a foreign key reference to Projects, and we know the primary key of Projects is now (CompanyID, ProjectID). Also, remember that we changed our Status table as well to have a composite primary key including the CompanyID column. Thus, we now know that we need to add a CompanyID column to our Tasks table to enforce these FK constraints:
Table Primary Key Foreign Key References
--------- ----------- ----------------------
Task TaskID Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)
OK, so we are making our primary keys bigger when it doesn't seem necessary, and now our foreign key constraints are getting more complicated as well since they now involve multiple columns.--------- ----------- ----------------------
Task TaskID Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)
Why did we bother doing all this?
Because now we have full, complete referential integrity, that's why!
Examining the Results
With these changes in place, only Status codes belonging to a Task's Company can be assigned to that Task. By propagating the CompanyID column down to our child tables, we have the column necessary to enforce full referential integrity as dictated by our logical model.
So, in the end, it looks like this:
Table Primary Key Foreign Key References
--------- ----------- ----------------------
Companies CompanyID n/a
Status CompanyID,StatusID Companies(CompanyID)
Projects CompanyID,ProjectID Companies(CompanyID)
Task TaskID Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)
--------- ----------- ----------------------
Companies CompanyID n/a
Status CompanyID,StatusID Companies(CompanyID)
Projects CompanyID,ProjectID Companies(CompanyID)
Task TaskID Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)
Just this little tweak in our logical model -- that each Company has their own list of Statuses -- required all of these changes in the physical database schema! What gives?
Well, remember: if the logical model changes, the physical data model should change. If it doesn't, then you are probably not modeling it correctly or accurately to maintain strict data integrity in the first place.
"All-identity-all-the-time" data modelers often claim that only using identities makes things easier and "more flexible". As we can see from this example, it can work in some cases (back when Statuses were global), but not in every case. Sometimes, if your goal is to enforce strict data integrity, you simply cannot create your physical schema that way. If you want to approximately model your data, then all identities can be nice. If you want to accurately model your data, then it simply is not always possible with only single-column identity keys as clearly demonstrated here.
"Identifying" When Using Only Identities Will Not Work
The Companies table originally had just one child: Projects. But once we determined that Status codes are also a child entity of Companies, we now had multiple children for the Companies table: Projects and Statuses.
Now, that in of itself is fine; at this point, there is no reason to change anything. We can have those entities both references the Companies table via CompanyID, and both of those tables can have single column identity primary keys.
However, also in our data model, we see that at some point a child entity of Projects (Tasks) requires that we relate Projects to Statuses. Now, suddenly the fact that Parents and Statuses have a common parent -- CompanyID -- is very important. This is because we cannot allow a Project from one company to be assigned to a Status from another. Once we identify this situation, we realize that the way to handle this is that the primary keys of both the Projects table and Status table must also include the CompanyID column.
With the composite primary keys to those children with the common parent in place, we can now relate them to each other as needed and we will have full data intregrity between those two entities and any of their children. You can see this in our new data model, as it will not allow you to assign a Status to a Task unless it has been defined for the Task's company. We would not be able to enforce this without ensuring that the Task table had a full normalized, non-redundant CompanyID column, and the way to do that was to ensure that the primary key of Projects was CompanyID/ProjectID and that the primary key of Status was CompanyID/Status. The rest follows naturally when assigning the foreign key constraints.
Another Example
Suppose we have Stores, and each Store has Departments, and each Department has Employees. That is a straight-forward, simple data model where no two entities share the same parent, and we can just use single-column identities all the way throughout and be fine.
But what if each Store has multiple Locations? And each Employee works at a Location. We cannot juse use LocationID as a primary key of Locations, otherwise we can assign any random Employee to any random Location, regardless of the Store they belong to.
So, our Location table needs to have a primary key of StoreID/LocationID. Now we also need the StoreID column in the Employee table to relate Employees to Locations. To get this, we need to also include StoreID in the primary key of the Department table. With that in place, we can now properly relate Employees to Locations and we are guaranteed they will be for the same Store.
Once again, once we saw that two entities had the same parent (Locations, Departments), and that somewhere "down the line" those two entities would be related (via Employees), it becomes clear that we need to incorporate the common parent's primary key (StoreID) into the primary key of those two child entities (Locations and Departments) to enable that relation (in the Employees table) to function properly.
Summary
There are 3 main points I am trying to make here today:
- The "all-identities-all-the-time" approach can work well in some cases, but not in all. If you insist on only using all identities for all of your primary keys, you're not always going to have strictly accurate data integrity.
- Don't forget that Identity columns can be part of a composite primary key.
- When a table has multiple child entities in the database, and there are any relations between any of those child entities, then all of those child entities should have a composite primary key that includes the parent table's primary key columns to facilitate those relations.
Legacy Comments
Ramon Leon
2008-08-14 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "Because now we have full, complete referential integrity, that's why!" And that's the big disconnect, you think that matters too much. I don't want to accurately model my data because for most systems there's *no such thing*, the model is constantly changing to meet ever changing and often illogical but none the less *required* business rules. I'd rather flexibly model my data with auto keys and use constraints, triggers, or the application itself to impose such *temporary* business rules upon the db. Making the relationship between projects companies and tasks and statuses explicit in the keys makes permanent something that very likely will change. I've worked with large schema modeled your way and they're just a real pain in the ass that require constant changing and tons of complex manually written queries that are very brittle to change. Your managers don't give a rats ass about relational integrity or proper models. When they decide it'd be neat if two people from two different projects can coordinate on a single task that appears to them to be in each of their projects, they want it done, and that task is made vastly easier by not making *proper modeling* the goal, but by making *flexibility* the goal, which is all the auto-keyers are after. Flexibility vastly outweighs referential integrity in many if not most cases. I'll trade relational integrity for flexibility and each of programming with a data model that's flexible to biz rule changes any day. Multi column keys infect the whole database and solidify the schema in way that just doesn't jive with the change heavy world many of us live in. |
Paul Keeble
2008-08-14 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data This is a great example of why the relational model is becoming increasingly difficult to deal with. Either you accept that the model itself doesn't perfectly represent what you need or you have to fudge the identity to implement a model. Objects are a better way to think about problems and the relational model will one day have to be retired. In the meantime putting the contraint in the code is much better than fudging the database identity model to compensate, its simplier cleaner and clearer that way. |
jeff
2008-08-14 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Ramon -- great points all around, thank you. Like I said -- it all depends if you want to "approximately" model your data or not. |
jeff
2008-08-14 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Paul -- I'm afraid I don't understand what "fudging the database identity model" means. You do understand that identities are just a tiny part of relational database design, right? Many large, complicated, well-designed relational databases have very few identities at all; some have none. And they all function fine. It is a big mistake to think that "a database is something with lots of tables that all have identities for primary keys". |
Mike Desjardins
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data I think another thing driving the "identity column all the time" mentality is the proliferation of ORM frameworks, which are often designed to work very will with identity columns, and not so well with composite primary keys or business primary keys. |
Andy
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Why not just introduce a new constraint system that allows a table to join another table based on a second column's value? # SELECT * FROM addresses WHERE (addresses.addressable_id = 7 AND addresses.addressable_type = 'User') # SELECT * FROM addresses WHERE (addresses.addressable_id = 17 AND addresses.addressable_type = 'Order') |
jw
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data What I see happened here is an incorrect implementation of a many-to-many relationship between the companies and the statuses. The way you described here would be fine until you get a lot of company records. After that, I think you'll have waaaay too much redundant data in the statues table to be easy to manage. Why not just added a relational table (Companies_Status) with two columns (CompanyId, StatusId [both as a composite primary key]) to manage all of this? |
Jeff S
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data jw -- it's a simple example; yes, you could have a global table of Statuses and then have a table of CompanyStatuses. That doesn't change anything, though, and really has no relevance to the overall topic discussed in the article. Everything else still applies, including the fact that you need to propagate CompanyID to other tables in order to enforce FK constraints on the CompanyStatus table. |
bd
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data JW makes a great point. Tables like Company_Status are what us old timers sometimes call 'relief tables' and they are a very nice compromise between data integrity and flexibility. |
Jeff S
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data bd -- again, that has no absolutely bearing on anything I am discussing. The Status table I present in my simple example is exactly the same as the "Company_Status" table you are proposing, except it doesn't have a foreign key reference to a master table of Statuses. Nothing else is affected; everything else is completely, 100%, exactly the same. I am missing something? Please feel free to elaborate. |
jw
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data I guess I don't agree with main point #3: "When a table has multiple child entities in the database, and there are any relations between any of those child entities, then all of those child entities should have a composite primary key that includes the parent table's primary key columns to facilitate those relations" Seems like a waste to me when in the systems descibed here. |
Jeff S
2008-08-15 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data jw -- did you read the article at all or just skim it? Did you see *why* point #3 is relevant? That's kinda what the entire thing is about -- ensuring referential integrity and that a Status from one Customer isn't assigned to a Task from another. And the way we do that is by following point #3. |
Sean T
2008-08-17 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Careful, careful. Storing redundant data is a deadly sin according to the Church of the Holy Surrogate Key. "Withhold not correction from thy data: for if thou strike it with the trigger, it shall not die. Thou shalt beat it with the trigger, and deliver thy data from hell." (II Identities 23:13-14) |
John H
2008-08-17 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Seems we have forgotten a thing or two about entity modeling and normalisaton. Sure status within a task (or even a project for that mater) is dependant on the company, but when you look at the simplest example you will find that you have broken normalisation rules. e.g. A status of Closed. Say all companies want closed - then you have multiple repeating data in the status name and description that breaks normalisation rules. What is my solution, renormalise - a status is singular of statuses it contains all available statuses that you can assign to a company. A new table CompanyStatus would be created with a link to status and company and additionaly have any other company constraining or defining fields (extra descriptions, active/inactive flags etc). This table could have either a single identity as its pk with a unique constraint on company and status or both of the above foreign keys, in either case uniqueness would be kept. In the task or project you then reference by the pk whether that is the identity or two part key. |
Jeff S
2008-08-17 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Sean T / John H -- Thanks for your feedback, but I recommend that you read the other comments! That's been already covered, and your point, while appreciated and valid, it has absolutely no bearing whatsoever in any way on the focus of the article. I guess I picked an example that is too overly simplified. |
Jeff S
2008-08-17 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data John H -- I just noticed this line from your comment: >>In the task or project you then reference by the pk whether that is the identity or two part key. The entire point of the article is why you should NOT reference by the identity!! You completely missed the entire point of what I was writing if you think that you should; if you do, you cannot maintain referential integrity between CompanyStatus and Tasks, as explained. Perhaps try reading again? |
Sean T
2008-08-18 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "The entire point of the article is why you should NOT reference by the identity!! You completely missed the entire point of what I was writing if you think that you should; if you do, you cannot maintain referential integrity between CompanyStatus and Tasks, as explained." What I was trying to get across in my attempt at humor is that referential integrity is easily enforced with a simple trigger that validates that the status company and the project company are the same, and without the need for breaking normalization. The problem you're describing here is a limitation in the implementation of foreign key constraints, not a weakness in using identity values. OFC, the real challenge is accommodating the seventeen quadrillion possible real-world situations that won't comply with your business logic because people, regardless of how much we hate them for it, will always try to put square pegs in round holes. |
jeff
2008-08-18 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Thanks, Sean, but if you can simply enforce cascading updates and deletes with RI, that is far superior to writing add/update/delete triggers for BOTH the primary table and the referenced table and doing it all manually. In short, I'll take a simple foreign-key constraint over 6 triggers any day of the week. >>OFC, the real challenge is accommodating the seventeen quadrillion possible real-world situations that won't comply with your business logic because people, regardless of how much we hate them for it, will always try to put square pegs in round holes. And that's why we create data models to not allow them to do that. Should we allow users to put the name of their dog in a "birthday" column? No. So, is it a bad design if we create something that doesn't allow the user to do that just because they "want to"? Of course not. |
Sean T
2008-08-18 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "In short, I'll take a simple foreign-key constraint over 6 triggers any day of the week." If the foreign key is the simplest way to handle it, yes. In this case, a trigger or a check constraint carried out by a udf (the latter probably being the more RI way to do it) is more appropriate. We're into territory where the database can't guess what we want from pure relational logic - add in an additional rule that the Cancelled status can't be assigned by anyone but certain users, and your FK logic starts becoming difficult for someone else to look at and understand. In this case, I firmly believe that logic outside of a FK constraint is the most appropriate. "And that's why we create data models to not allow them to do that." I'm talking about more than just an incorrect data type. The business rules state that cash-only customers cannot charge sales to their account. Customer 123 is a cash-only customer, but the store manager decides to ignore the rule and writes up a manual charge ticket anyway, then calls IT at 5:00 on Friday afternoon because the system won't let him input the ticket. Come 8:00 on Monday morning the executives are going to be much more upset that inventory, sales, and AR are off than because the store manager is a bonehead. |
jeff
2008-08-18 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Sean T -- Check constraints don't handle cascading updates or deletes, so you need triggers for that. You also need check constraints on BOTH tables -- the primary table AND the foreign key table -- to maintain your integrity because data can change in either. So, we have at least 2 triggers and 2 check constraints, or one simple FK constraint. Seems like an easy call to me. As for your business rule, I'm afraid I just don't understand what your example has to do with anything I am talking about. What you described is strictly an application issue, not a database issue, and has nothing to do with database design. Should your system accommodate a "bonehead" user who wants to back-date orders? Who wants to put in non-integer values for quantities? Who wants to put into 2 billing addresses for a single order? Where do you draw the line? |
Sean T
2008-08-18 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data It has everything to do with database design, because you somehow have to get that information in the database, regardless of whether it's right or not. Back-dating orders and incorrect data types can be handled at the user-interface; somebody handing a customer some product and expecting the system to just accept the transaction after the fact can't be handled by the app - you have to get that data in there, no matter how wrong it is. To relate it to your sample project - what happens when a project gets reassigned to a different company by upper management (or was simply created by user error under the wrong company to begin with)? Option A is to disable the foreign key constraints so you can perform the updates, and Option B is to duplicate the data. "Check constraints don't handle cascading updates or deletes" Nor does your sample. You can't enable cascading on both the reference to Status and the reference to Products - you can only do one or the other (in SQL2005, at any rate). It doesn't mean much for your sample, but if you depend on cascading, that would break RI in a larger application where you may have multiple tables involved in the relationship. |
Jeff S
2008-08-18 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data >>To relate it to your sample project - what happens when a project gets reassigned to a different company by upper management (or was simply created by user error under the wrong company to begin with)? Option A is to disable the foreign key constraints so you can perform the updates, and Option B is to duplicate the data. Or Option C: Copy the data to the new customer, delete the old data. Or Option D: Drop all data dependent on the old company, since by definition it is not valid under the new company (right?), then change the companyID to the new companyID, and then re-create the data dependent on Company. >> Nor does your sample. You can't enable cascading on both the reference to Status and the reference to Products - you can only do one or the other (in SQL2005, at any rate). Good point -- I had forgotten that in SQL 2005, you can't do both. Annoying, but true. However, >>It doesn't mean much for your sample, but if you depend on cascading, that would break RI in a larger application where you may have multiple tables involved in the relationship. Actually -- no! It doesn't. Why would the RI break? Sure, maybe we just cannot Add/Delete data and have it cascade automatically -- but that's because the system will STOP us from adding/deleting the data due to enforcing the RI. This is an odd debate. I'm not sure what your point is -- are you an "always use identities all the time no matter what for your primary key and relations" person? Do you think databases should never have composite keys? |
Ramon Leon
2008-08-19 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Chiming back in... Mostly, yes, with the exception of join tables I've completely banned composite keys from any database I work on and have gotten back much needed flexibility and ease of programming needed to actually accomplish any work in the change heavy real world. Keys are just too permanent to ever try and enforce any kind of biz rule with them, they exist for the sole purpose of serving the *physical* need to easily join tables and select/delete particular rows. Anything else belongs in code (app/trigger/constraint) that can be easily changed without requiring a change to the physical model. It's easier to update 20 triggers than it is to reorganize a schema because you changed a key and rewrite the program to deal with the new schema. Entity tables get an auto incremented integer key and join tables a multi column key to the tables it's joining. It's not the ideal relational model, but it sure as hell is the ideal flexible to change and easy to program to model that reality requires. If there is a natural candidate key, it gets a unique constraint and is considered a logical key, it is never allowed to be the physical primary key, that's simply suicide. Programs *require* stable physical models to survive without every change in rules causing a bunch of query rewrites. If you consider the keys part of the physical model, then it's pretty simple to settle on a stable generic style of datamodel that above all is flexible while not making too many compromises. |
Sean T
2008-08-19 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "This is an odd debate. I'm not sure what your point is -- are you an "always use identities all the time no matter what for your primary key and relations" person? Do you think databases should never have composite keys?" I think it's odd because of the approach. The 'solution' we have here treats the surrogate keys as though they were natural keys - but that's not consistent with the use of surrogate keys. Composite keys are consistent with the use of natural keys. When using surrogate keys based on identity or guid values, you accept this sort of referential weakness and avoid it through code to get greater flexibility and performance. When you use natural keys, you accept the complexity of composite keys and the inflexibilty of change in order to gain greater integrity and user-meaningful data. This sample mixes the two - the result is an interesting exercise in referential logic, but I'll go out on a limb and say that most people that use identity values would rather enforce RI through other means and keep their single-value keys. I won't get into the surrogate vs. natural keys debate - I've done enough of both to only care that it's consistent. |
Sean T
2008-08-19 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Oh, I never came right out and said what my point was! In a case like this, you're going to break a rule either way - either you lack referential integrity or you violate normalization. I'm personally going to favor normalization and let other code be responsible for enforcing the rule. |
Jeff
2008-08-19 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Sean -- Nothing about what I have written violates normalization in any way. What if we forgo ProductId as an identity, but make it a company-specific "product code"? so, the pk is companyid/productcode - right? Then everything else still applies and we are perfectly, 100% normalized. I think you must be getting hung up on using an identity as *part* of a PK -- and there's nothing wrong with doing that as I tried to state several times in the blog post. Just because we happen to use an identity to help autogenerate unique product "codes" doesn't mean we *must* use that identity as a single column PK! |
Sean T
2008-08-20 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data For both the Projects and Status tables, you have an attribute in the primary key that has a full functional dependency on another attribute (CompanyID in both cases). Those two tables don't satisfy 1NF. So, yes - I am hung up on the identity being part of the primary key. More explicitly: In the relation Projects, ProjectID->CompanyID, therefore (ProjectID,CompanyID) does not satisfy the requirements of being a candidate key because there is a subset of (ProjectID,CompanyID) that uniquely identifies each record. Since (ProjectID,CompanyID) is not a candidate key, it therefore cannot be a primary key for the purposes of satisfying 1NF. 1NF requires a such a primary key. All other normal forms explicitly or implicitly require 1NF. Therefore, the Projects table is not in any normal form if (ProjectID,CompanyID) is defined as the primary key. "Just because we happen to use an identity to help autogenerate unique product "codes" doesn't mean we *must* use that identity as a single column PK!" An identity doesn't have to be involved in the primary key (it can be a candidate key without being the primary key), but if it is, then it has to be the only participant in the PK to satisfy normalization rules. If we weren't using identities, this would be a non-issue - but because we are, we have to choose between referential integrity constraints or normalization. Because I can enforce referential integrity by other means, I would choose to normalize the tables to be consistent with the rest of the database and enforce RI from code. Obviously, you can enforce RI explicitly here and approximate normalization through other means to achieve the same result. I'd personally rather be consistent in table design and add some code to enforce rules or generate exceptions. |
jeff
2008-08-20 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Sean T -- I am sorry but, no, you are 100% wrong about your definition of normalization. Just because a column happens to *be* unique doesn't mean that you *must* use that as your PK! There are no laws of normalization that says an identity, if it exists in a table, MUST always be the primary key of that table. That is simply incorrect. A DateStamp column might also be unique for all rows in a table -- but that doesn't mean it needs to be the PK! That is exactly the same as using an identity. The identity is just like any other column, and we are just happening to use the fact that it auto-increments to make things easy, as a helpful feature. Again, it seems like you perhaps cannot get past the fact that an identity is just a database feature that helpfully can provide auto-generated values in a column; it is NOT a fundamental part of the relational database concepts! If I use a trigger to put a unique, auto-incrementing value in my ProductID, instead of using an identity, is that OK in your book? If I use application code to generate a unique, auto-incrementing *character* value in the productID column, what about that? I really appreciate your feedback, but you seem to be a bit confused about what an identity column actually is and how you can make use of it. Mayeb this will help you: If SQL Server provided a special type of IDENTITY feature where you could say "Increment for each CustomerID", we might very well use that feature instead. But, it doesn't have that, so we just use a standard IDENTITY, because it WORKS JUST AS WELL. We just don't get 1,2,3 for each CustomerID, but we do get nice, unique ProductID's for each Customer just the same. Does that help clear things up for you? |
Sean T
2008-08-20 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "I am sorry but, no, you are 100% wrong about your definition of normalization." You have a full functional dependency in one attribute of your primary key. That is not normalized by any definition of normalization. "Just because a column happens to *be* unique doesn't mean that you *must* use that as your PK!" I said exactly that - it's a candidate key, and a candidate key does not have to be the primary key. BUT, the primary key IS a candidate key - and your particular composite key, by virtue of being an identity, does not qualify. Fortunately for us, SQL Server does not enforce any particular kind of normalization rules on us - so we're free to fudge. "If SQL Server provided a special type of IDENTITY feature where you could say "Increment for each CustomerID", we might very well use that feature instead. But, it doesn't have that, so we just use a standard IDENTITY, because it WORKS JUST AS WELL." If 'works just as well' is the criteria, why do we need the inconsistency of a composite primary key at all? If we're not striving for the perfect referential integrity and normalization solution, then why wouldn't I just set it up this way?: (note, writing it in SQL instead of trying to do ASCII art without a preview) [code] -- make some tables create table Companies ( CompanyID int not null primary key, InsertMe char(1) not null ) go create table Statuses ( StatusID int not null identity primary key, CompanyID int not null references Companies(CompanyID) ) go create table Projects ( ProjectID int not null identity primary key, CompanyID int not null references Companies(CompanyID) ) go create table Tasks ( TaskID int not null identity primary key, ProjectID int not null, StatusID int not null, CompanyID int not null ) go -- mimic composite key functionality create unique index IX_statuses_composite on Statuses(StatusID,CompanyID) create unique index IX_projects_composite on Projects(ProjectID,CompanyID) go -- relational constraints alter table Tasks add constraint RI_task_status_company foreign key (StatusID,CompanyID) references Statuses(StatusID,CompanyID) go alter table Tasks add constraint RI_task_project_company foreign key (ProjectID,CompanyID) references Projects(ProjectID,CompanyID) go [/code] Same net result, but with identity PKs. Or if the requirements change and require an intermediate table... [code] create table Subprojects ( SubprojectID int not null identity primary key, ProjectID int not null references Projects(ProjectID) ) go create unique index IX_subprojects_composite on Subprojects(SubprojectID,ProjectID) go alter table Tasks add SubprojectID int go alter table Tasks add constraint RI_task_subproject foreign key (SubprojectID,ProjectID) references Subprojects(SubprojectID,ProjectID) go [/code] So, I've used the same relational logic, but with an identity PK in every table - and it works. I'm not finding the stated insufficiency of an identity primary key based on your example scenario - especially if it can't clear up the normalization bumps that are present in both. |
Jeff S
2008-08-20 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Sean -- Your Tasks table isn't normalized in your example, because you have the redundant CompanyID column there. CompanyID is fully dependent on ProjectID, right? I really appreciate your feedback, but I think you are just not going to quite understand that the identity column is just an arbitrary example of a column that has values we are using as part of our PK. Yes ,it is unique, because that's the SQL implementation of identities and we can't declare it as IDENTITY(1,1) PER CUSTOMERID, we just don't have that ability. So, maybe we WANT CustA, 1 CustA, 2 CustB, 1 CustB, 2 CustB, 3 CustC, 1 .. and so on, because that would be nice and neat, but it is JUST AS GOOD for our purposes if we can use a simple, existing SQL Server feature that gives us: CustA, 1 CustA, 2 CustB, 3 CustB, 4 CustB, 5 CustC, 6 Because that ALSO is unique per combo of Cust/ProdID. And the reason why we use a composite key is clearly outlined in the article. If this final explanation doesn't work, then I can't think of how else I can try to rephrase this. |
Michael Valentine Jones
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data There is another way to setup the FK reference without changing the PK of the parent table. The way I handle it is to leave the PK on the ID Identity column, and add a unique constraint on the ID, CompanyID column combination. The FK does not have to reference the PK of the parent table; it can reference any unique constraint. This eliminates the objection to the partial key dependency of the PK since it remains as only the ID column. Also, FK references that do not care about restrictions on CompanyID are not forced to carry the CompanyID column. You might object that it uses more disk space for the additional constraint. That's true, but I want the PK to be only the ID column, and in most situations the additional index on the parent table in not large. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Hi MVJ -- yes, you can do that; in the post I linked to, that's what I did. However, as I mentioned a few "feedback" posts back, now you are not normalized; CompanyID is fully dependant on ProjectID, so CompanyID should not be in the tasks table. |
Michael Valentine Jones
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data I am not sure I understand why you think the tables would be denormalized in my solution, especially since the columns in the Tasks, Status, and Projects table in my solution would be exactly the same as the tables in your solution. The only real difference is that the ID columns of the Status and Projects tables would remain as the primary keys of those tables, and Status and Projects would each have an additional unique constraint containing the ID and the CompanyID. This additional unique constraint is obviously the same as the PK in your solution. The FK reference from Tasks to the Status and Projects tables would be exactly that same as the FK reference in you solution, except it is not referencing the PK, it is referencing the unique constraint. This solution overcomes a possible objection to your solution: You might have another table that references Status that could reference any row in that table. In that case, you would be forced to carry the redundant CompanyID in the new child table, even though it is not required. In the case of the Tasks table, CompanyID is not redundant, since it is constraining the values in that row to a particular subset of rows in the Status and Projects tables. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data MVJ -- >> You might have another table that references Status that could reference any row in that table. No! Status is a fully dependent child entity of Company. I should have renamed it CompanyStatus, perhaps, to make this more clear. That is the whole point of this exercise. Now, if it is named CompanyStatus and you have a parent table of Status codes, then you would reference that parent table directly if the company is not important for that relation, right? Also, don't you see that by having CompanyID in the tasks table that it is not normalized? CompanyID is fully dependent on ProjectID, since ProjectID is the PK of the Projects table and CompanyID is just an attribute. So, copying down CompanyID to the tasks table, which has the ProjectID column in it, would denormalize your schema. |
Sean T
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "Sean -- Your Tasks table isn't normalized in your example, because you have the redundant CompanyID column there. CompanyID is fully dependent on ProjectID, right?" That's exactly the point. It's YOUR example, but with the declarative primary key changed - SQL Server doesn't care that what declared primary key isn't *really* the primary key, and doesn't force us to relate by it. So, by just moving the words "primary key", you can see the lack of normalization? Why would it be normalized just by moving the arbitrary words "primary key" to the columns we're actually using for the relationships? It wouldn't be. I took it a step further to highlight that it isn't normalized by adding the Subprojects table - not only does it require an additional redundant value in Tasks, it also demonstrates that for any table that doesn't have multiple relations to the Companies table, ProjectID is the obvious primary key - not the composite. That's by virtue of the identity being an identity - or rather, that's the virtue of a surrogate key being a surrogate key. So, neither of our solutions is normalized. They can't be. If neither of them are normalized, and every other table in the database has an identity PK, why would we give this one type of table a composite key for RI when we can accomplish the exact same thing while the identity is the declared PK? "Yes ,it is unique, because that's the SQL implementation of identities and we can't declare it as IDENTITY(1,1) PER CUSTOMERID, we just don't have that ability." What would be the purpose of such an identity? It no longer uniquely represents a row, and isn't a single-column candidate key. The whole point of this article (I thought) was to keep the surrogate key while still enforcing a certain type of referential integrity. "I really appreciate your feedback, but I think you are just not going to quite understand that the identity column is just an arbitrary example of a column that has values we are using as part of our PK." Your article supposes that the identity already exists and is being used for its intended purpose, which is hardly arbitrary. |
Michael Valentine Jones
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data I am afraid I really do not understand what you are saying about my design. My solution has column CompanyID and so does yours, so if it is redundant in my design, then it is redundant in your design. My design has exactly the same FK references as your design, so we are still the same there. Having CompanyID in the Tasks table is not redundant, since it’s purpose it to constrain the row to being able to reference only a combination to Status and Project that have the same CompanyID. This is unique information fully dependant only on the primary key of the Tasks table, since without it there is the possibility of the referenced Status and Project rows having different CompanyIDs. If the point of this is to accurately model the data, then my design does that. An objection I have to having CompanyID be part of the PK of the Status and Project tables is that it is physically possible to insert a row in these tables where the ID is a duplicate of another row, as long as they have a different CompanyID. This throws away a big plus of the surrogate key design, since you can no uniquely reference a row by the single column ID. Of course, you could add a unique constraint to Status and Project on the ID columns to prevent this. In that case, you end up with a design that is the same as mine except with the definitions of the PK and unique constraints reversed. So really in the end, these designs are logically the same, with the only difference being in the physical implementations of the PK and unique constraints. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data >> The whole point of this article (I thought) was to keep the surrogate key while still enforcing a certain type of referential integrity. No, that is completely NOT the point of the article. It's simply how you can enforce referential integrity by simply using composite keys in situations that require triggers and/or addtional (and redundant) unqiue constraints if you only ever use single-column keys. That's it! Once again, using an identity is just OPTIONAL and the only point is that they can be useful to help generate unique values as PART of a composite key, just as they can be useful to generate unique values for a single column primary key. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data MVJ -- No. In my design, ComanyID is NOT dependent on ProjectID. They are a composite primary key that TOGETHER defines a single project. In your design, a project is completely defined by the ProjectID column, and CompanyID is just an attribute of the Project. Surely you can see the difference? |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data >>Your article supposes that the identity already exists and is being used for its intended purpose, which is hardly arbitrary. Again, no ... the article is not about "updating an existing schema and maintaining the existing keys". The entire premise is CHANGING the primary keys completely! It is simply about designing a physical data model, and how one logical model may work great with single-column keys, but another might not. And we can greatly simplify and get full RI without triggers/check constraints/additional unique constraints by using composite keys. |
Sean T
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data We're going to have to agree to disagree then. You're suggesting a way to do it without a surrogate key, but the title and content of the article are misleading. Anyone reading it with great interest (as I did) will likely be someone who has run into this particular RI problem before, but didn't have the option to NOT use a surrogate key as the PK. Or if they happen to be a fan of surrogate keys, they're going to use them anyway. However, by breaking normalization a tiny bit in the Tasks table, it is possible to achieve the same level of declarative referential integrity while maintaining full use of surrogate keys (as per my example, which might be the same that Michael is proposing). The drawback is that cascading doesn't work because of SQL Server's cycle detection, and that duplication of data is necessary if a project is reassigned to another Company - the same issues that occur if a composite key is used, as per our earlier discussion. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Sean T -- I think the only thing we disagree about is that it is OK to use composite keys when modeling data. So, again, to summarize, we have two choices: 1) Use composite keys; have full RI; be fully normalized; relate tables using primary keys or 2) Use only single-column surrogate keys; need to also add unique constraints; have full RI; store redundant data and be de-normalized; ignore primary keys and use unique constraints when relating tables I'll choose option #1. |
Michael Valentine Jones
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Jeff: "In your design, a project is completely defined by the ProjectID column, and CompanyID is just an attribute of the Project." If you want to allow the ID on those tables to be non-unique, then your design is different. Sorry, I didn't get that from your blog post. However, that does not make my design invalid. CompanyID in the Tasks table is not redundant, because it serves the purpose of specifying that this Task is only for a specific company, and that information is not contained anywhere else in the database. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data >>However, that does not make my design invalid. CompanyID in the Tasks table is not redundant, because it serves the purpose of specifying that this Task is only for a specific company, and that information is not contained anywhere else in the database. It's already there. A Tasks is a child entity of a Project. Projects have a CompanyID attribute. Thus, Tasks are assigned to a Company via their Project. |
Sean T
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Jeff: There's nothing wrong with using composite keys - the mistake in the example is the participation of an identity value in a composite key if you're not intending to use it as a surrogate key. It serves no purpose when you treat it as a natural key - use the project name or some user-assigned identifier. It is not possible to achieve normalization with an identity value in a composite key, and worse, it's confusing. Michael: It should be related to the CompanyID via the ProjectID and the StatusID, which is why it's not perfectly normalized. However, since the DBMS doesn't have nested FK constraint we can use, most poeple will say it's good enough and move on. |
jeff
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data >> It is not possible to achieve normalization with an identity value in a composite key Again, that is wrong. The identity column doesn't define an entity, it is just ... argh ... never mind. This is going nowhere, I'm just repeating myself. >> , and worse, it's confusing. I guess I can't argue with you there. |
Michael Valentine Jones
2008-08-21 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data "Michael: It should be related to the CompanyID via the ProjectID and the StatusID, which is why it's not perfectly normalized. However, since the DBMS doesn't have nested FK constraint we can use, most poeple will say it's good enough and move on." In my design it is related to the CompanyID via ProjectID and the StatusID exactly the same way yours is. My design has the identical FK constraints that your design has; there is no direct reference to the Company table. The only difference is that I have retained ProjectID and the StatusID as the single column primary keys of those tables and added a unique constraint to allow a FK reference to Projects on (ProjectID, CompanyID) and a FK reference to Status on (StatusID, ProjectID). This is the exact same FK reference that you have defined. Logically, this is identical to your design with a slightly different physical implementation. |
Paul Brown
2008-08-26 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Hey Jeff, just a quick note to say thank you for this outstanding blog. I've been using SQL Server for years (though I happily consider myself something of a perpetual beginner) and everytime I come here I learn something new. |
CPS
2008-10-09 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Jeff, Is there any kind of constraint that says that a Task's Status and Project need to be of the same Company? For example: Adding a task with status of company ID = 2 and status ID =1 and then company ID = 1 and project ID = 1. After reading through the article and feedback, I haven't been able to come to a conclusion to this. I assume that this would need to be handled by the application using this database, and not the DBMS itself. Aside from that, I agree with your article wholeheartedly because it solves the problem of full referential integrity using primary and foreign key constraints. |
Fabiano
2008-10-23 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Jeff I found your blog last week. I’m surprised with the subjects and articles. Very nice. About this one, I think there are some great points, but agree with Ramon too. We really need more flexibility in OLTP databases. Fabiano Saffi |
Brian Tkatch
2009-02-09 |
re: Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data Jeff, Excellent article. There is a clear difference between using a database as the data authority, or using it as a data store. The former is what you described, the latter is described in some of the comments. To be a data authority, it requires rigidity, which means some things may be harder. To be a data store means giving up guaranteed integrity, but things are much easier. I think it should be noted that to use a database as a data store, is to ignore most of the database, and probably hints that the optimizer could have made use of. There is a comment above about duplication of data, The author is incorrect. Data duplication is only bad when it can be different or is redundant. In you example, the composite FK keeps the sub-id connect to the parent id. Thus it cannot be different. It is also useful for RI, meaning it is not redundant. |
Bugeo
2010-09-10 |
some danger on using compound keys? Hi, suppose that today i have and application that relate with a large db. In that Db i have a table that use a compound key (2 fields). My application can update, delete (and can do other many stufs) the record in this table using queryes that use the 2 fields that compose the compound key. Tomonrow the DBA add a field to the compound key. Now the compound key is mad of 3 fields (2 old ones + 1 new field) When the user of my application pres the button "Delete that record" and the software call the function func_delete_thatRecord(key1, key2) the software not give him a error! May be that hi update, delete, (or something else) more that 1 record! |