SQL Data Modeling: Entities versus Attributes

Tue Jun 19, 2007 by Jeff Smith in techniques, database-design, datetime-data

Introduction

<o:p>There’s a handy little rule of thumb I use when developing my database models. It comes up often in situations like this:

If we are storing phone numbers in the database, should AreaCode be a “free-form” char(3) column? Or should it be a foreign key reference to an AreaCodes table?

Well, like most things in life, it depends. There is no universal correct answer. Some people will say that if you are storing the same piece of information more than once in your tables, you don’t have a normalized design. Again, that is sometimes true, sometimes not. For example, I store the number 0 and the first name “John” lots of places in my data – should I have tables defining all of these values? Should every column in all of your tables, then, be a foreign key relation to some table that defines everything? Of course not.

It gets confusing because every database is different, depending on what it is modeling. When trying to determine how to handle the "area code situation", I have found that asking the following question is useful:

Is it an entity or an attribute?

The answer to that is the key to developing your model. Is the piece of information that you are storing simply an attribute of an entity, or is it an entity itself? A entity should be defined as a table, but an attribute should simply be a column of a table.

Area Codes

Let’s look two scenarios, using area codes:

Scenario #1: In an accounting database, we are keeping track of Vendors. For each vendor, we store contact information such as their mailing address and phone number.

In the above scenario, our accounting database doesn’t need to formally define a concept of “Area Codes”. There’s no reporting of area codes, no tables storing distances or town names or billing rates per area code. It is simply an attribute of a phone number, nothing more. Thus, an area code is simply an attribute, not an entity, and it doesn’t make any sense to have an Area Codes table.

Scenario #2: In a phone billing database, we track all phone calls made, the cost rates, whether or not it is long distance, and so on. We want to return the city of each phone call in the reports, and provide usage stats for different cities and area codes.

Here, area code is important – it is an entity. We might store tables with rates from area code to the next, or a list of towns per area code, and so on. Any phone number stored in this system should have a strong relation to the area code table for reporting and data integrity. Thus, we would certainly create an Area Codes table in this database.

Cities and States

Cities and States are other examples. We often want a State to be a formal entity because it is useful to reference them by either the full name or abbreviation, we’d like to provide lists of states in drop-down lists, there's a short, clearly defined list of them, and so on. Thus, it usually makes sense to have a table of States and to formally define them as entities, not attributes, in your data.

Cities, however, are much like area codes – it depends what you are modeling and if the concept of a “City” entity is important to your data. Often, it is just a simple attribute. But if you are modeling city populations and mayors and locations and distances and what state they belong in, then a City is a formal entity that you are tracking and relating, and it belongs in a formal City table in your database.

Dates

Even dates can work the same way. If we just want to store transaction dates, or birth dates or effective dates and things like that, it is just an attribute. But if you have accounting periods or if you need to store the first day of the month to represent that entire month as a whole, then you have some meaning and you’ll need to decide where it makes sense to set up entity tables that defines what these dates are.

For example, suppose we have a Budget table that stores one row per month per project, and the month is stored as a datetime value, with the convention that it is always the first day of the month:

create table Budget (ProjectID int, Month datetime, Amount currency)

You could add a CHECK constraint that ensures that only the first day of a month is ever used, but I would argue that instead you should have a foreign key relation to table of Months:

create table Months (Month datetime primary key, MonthNumber int, Year int, ….)

The reason is that the concept of a Month is important to your data; that datetime value in the budget table isn’t just a random datetime value, it represents a Month entity. That month might have cut-off dates or a "closed?" status, or other important attributes; it is also useful to have a Months table defined for drop down lists and reporting purposes (i.e., to use in OUTER or CROSS JOINs). Months also might be useful for other relations for other entities in your data, such as accounting dates, billing periods, and so on. Thus, in this case, I feel it makes sense to formally define Month entities to keep your data clean and consistent.

Conclusion

Data modeling is an art, not a science, and there are no hard and fast rules that always apply. (Well, except for normalization) In general, however, I find the "attribute or entity?" check to be a helpful guideline to follow when developing your data models.

see also:

Legacy Comments


Travis Laborde 2007-06-20	re: SQL Data Modeling: Entities versus Attributes Interesting post, thanks! But what about the more basic yet still arguable question "does the area code deserve it's own column or should the whole phone number be in one column, including the area code?" I'd say the same guidelines apply - if you depend on the Area Code for anything then make it a column by itself. And, if you store the whole phone number together, should it have the () and the - stuff, or be just 10 digits? Obviously this only deals with US phone numbers. Still, a discussion of these topics would make the post even better :)

Andrew Beaupre 2007-06-21	re: SQL Data Modeling: Entities versus Attributes In regards to phone numbers, I feel it's best to strip out all formatting whatsoever. Data consistency is important. One can then have a separate column which holds the formatting in the way which the user inputted the data. This gives you the ability to format it in a standard way if you like, or rely on the user for the correct formatting (which varies from country to country). Area codes are a North American thing. While Europe does have common prefixes (02 - UK landlines vs 07 - UK mobiles), these aren't regarded in the same way as North American area codes. If it's necessary to maintain international compatibility, a dedicated area code field wouldn't seem to apply. You could always extrapolate the area code from the phone number, provided the data was in a consistent format. But this is all part of the article's topic - design your schema to fit the circumstances. All too often I see people go normalization crazy which, in the end, harms performance and over complicates things.

Daniel 2010-01-15	A similar thinking I often think of it this way: "Will it depend on other data, or will other data depend on it?" It's the same thing as you said, but entity and attribute are kind of abstract notations, and this question already contains the explanation of those 2 terms. But that's just me... :) If it depends on other data -> attribute, else it's an entity. Excellent blog btw.