The problem isn't the poor database model; It's that external code is bound to the schema

Wed Aug 15, 2007 by Jeff Smith in database-design

Dealing with poorly designed databases is a simple and common fact of life for programmers. It happens, sometimes due to lack of experience or education, or sometimes because business requirements were never analyzed properly or they changed. We've all inherited poor database schemas that we've been forced to work with, and of course we've all created those poor schemas ourselves over the years.

For this discussion, I will classify the database model problems we encounter into the following two categories:

Physical Design Problems – The physical design of the schema – tables, columns, data types, constraints, etc – is causing inefficencies in performance and/or code. The design is technically storing the data it needs to, but perhaps it is not normalized properly, or there's lots of duplicates, incorrect data types, columns that contain CSV data, and so on. It "works", but you have to write large amounts of complicated SQL to do basic tasks.

Logical Design Problems – The logical design is incapable of storing the necessary data; we simply cannot model what we need to store. For example, the model might allowed for only one address per Vendor, but we need to store multiple addresses for billing versus shipping; or perhaps we can only assign one consultant per Project but in reality we need to store many of them, all assigned particular roles.

(Of course, many data models have both problems! )

I actually feel that logical design problems are better to deal with. Why? Because it forces us to stop development and to redesign the database and/or applications, since it simply is not working. Sure, it can be a lot of effort, but in the end, if we did things well, the benefit is that we now have a good data model that stores what we need.

Physical design problems occurs very frequently, and it is a far worse situation. It rarely gets addressed or fixed or changed, because it is getting the job done. The result is that, over time, we keep piling on more and more code and hacks and tricks to get data in and out, and the performance and complexity of the code grows greatly, but it all still "works". There is never really that motivation to fix things until performance or code maintenance is so bad we have no other choice. At that point, of course, it is a huge project since we have to deal with large volumes of convoluted code. If we had refactored the design properly earlier on, we'd be better off, but there was never truly that need or motivation to do it.

The reason why I bring this up is because we often do need to refactor our designs; and, in fact, in some cases we should do it even more often that we do (i.e., physical design problems). When the logical design doesn't work, we have no choice: It needs to change. If we try to use "tricks" to make a perfectly good schema work for an entirely different logical model, we end up with a really poor schema and now we've have both physical and logical problems with our model. (Only one place to store the ConsultantID for each Project? Simple – change that column to a CSV list!) The worst of both worlds!

We often see people require help with writing SQL in the forums that ends up being very difficult because the database design is so poor; the best solution is always to simply to fix the design. But, of course, that rarely happens. The reasons given are usually:

We have lots of reports that access the schema and we'd need to rewrite them all
We'd have to go through our applications and re-write all of the data access code
There are DTS packages with SQL statements that would need to be rewritten
Other external processes access the database schema, it would be impossible to track them all down

Now, let's be clear: If your logical model doesn't match your schema, all of these things need to change. If your code or reports or processes are written to allow only one consultant per project but now you need many, it is not just a physical database change, you'll need to go through and change user interface forms, report output, and so on. There's really no quick and easy fix to accomplish this.

If only your physical model should be altered, that's a pretty compelling list of reasons to leave things "as is". And that's the really problem, here, not the schema! The issue isn't that changes are necessary, it's that making those changes seems nearly impossible, so it is rarely done.

The poor physical model is one thing, but not the worst design decision that needs to be dealt with. The real issue that makes it difficult or nearly impossible to quickly fix or adjust a schema for better performance or a more normalized design is that the following rule was broken:

Don't bind code outside of your database to your schema!

Suppose you need to fix that poor physical design and change that CSV column into a related table that stores the values properly in multiple rows. Yet, every report and application and process out there directly accesses that CSV column and expects things to be stored in that particular way. Normalizing the schema can be very difficult, since everything that ever accesses that column would need to be identified and rewritten.

Instead, what if we restricted access through stored procedures? Since only the stored procedures stored in the database directly reference the CSV column, if we change the column we only need to use sp_depends to find the objects that reference that column and alter stored procedures. None of the applications or processes or reports outside of the database requires any changes! We would simply alter our schema, then go through the list of stored procedures one by one and fix them. Our inputs and outputs can remain the same, but the SQL processes in the middle can now be greatly improved since we now have a much simpler and more efficient schema. Once those procedures are altered, we are done! No reports to change, no processes to alter, no applications to rewrite.

My primary point is this: Database schema refactoring is often required. And it is almost always a good idea to do it -- the longer you deal with a bad database model, the worse and worse your code and performance becomes. If your logical model needs to change, by definition your reports and applications and processes need to change, too -- that's a fact of life. But if only your physical model needs to change -- i.e., data types, constraints, normalization, etc -- and you've restricted access to your database through stored procedures only, then only your stored procedures will need to change. And that makes fixing those broken schemas much, much easier.

It's not a crime to create a poor database design; it happens all the time, and modeling is truly an art and not a science so a good idea one moment might not be so good the next. The idea of not binding outside code to your database schema, however, is simple science and easy to do; if you follow this approach you will find that your life will be much, much easier when physical database refactoring is necessary.

Legacy Comments


Rick O 2007-08-15	re: The Real Advantage of Stored Procedures: Physical Database Model Refactoring I tend to tackle those kinds of problems with Views instead of SPs. Can you go into more detail about why SPs would be more appropriate than views?

Jeff 2007-08-15	re: The Real Advantage of Stored Procedures: Physical Database Model Refactoring Rick O -- good point, Views can work fine as well. I use them all the time as well. My wording of the article and title is perhaps a little too narrow -- Views and Stored Procs are both good tools to use to avoid binding outside code to your DB schema.

Mat Hobbs 2007-08-18	re: The problem isn't the poor database model; It's that external code is bound to the schema A stored proc (or a UDF returning a result) is more flexible about what it can do inside so there will be times you need procs rather than views. The csv-in-a-field would be done readily as a proc or udf but not feasible as a regular view? Start with all views and later find you might have to start using procs so maybe start with procs up front? Not much of an argument I know, but my other suspicion is that if starting with views or udfs you encourage application coders to join views/udfs together, thereby creating a dependency from the application to the relationship between two views/udfs. However with a single proc the dependency is only to the proc itself and not to any schema relationship or view relationship. This is probably not a major issue in general I suspect.

Michael 2007-08-20	re: The problem isn't the poor database model; It's that external code is bound to the schema I completely agree. I've recently started moving all business logic to the database instead of having the data access layer between it and the data itself. Its worked out really well. Its allowed me to make quick changes to the database without even having to touch the project code itself. I've just written an article on Data Access And Business Logic comparing what I call the "Logic First" design to the more common 3-tier design.

georgev 2007-08-20	re: The problem isn't the poor database model; It's that external code is bound to the schema Here's a similar arguement that's worth a read: http://weblogs.asp.net/alex_papadimoulis/archive/2005/05/25/408925.aspx Great articles Jeff, keep them coming! :D

Heiko Hatzfeld 2007-08-24	re: The problem isn't the poor database model; It's that external code is bound to the schema I think views are a great way to get the things going if you need to refactor your DB. You are able to keep the old style, and implement something completely different under the hood... But I only consider them a temporary solution, since you usually have to resort to some trickery (aka. Triggers) to make them work correctly. Especially if you start to split up a single table into multiple ones... My personal threshold for the maximum number of allowed triggers for a large size db is exactly 0.

Manitra 2007-10-10	re: The problem isn't the poor database model; It's that external code is bound to the schema Hi, I think that the database abstraction should be done outside of the database, in some kind of data access layer ; and then never access the database directly. The reaseons are : - creating a dataaccess layer with a languages like java / c# is easier to maintain (strongly typed and object oriented so refactoring is easy, can use object relational mapping tools like hibernate) - such a data access layer can now be used from anywhere if you just add a webservice layer over this - your data access layer could use multiple datasources of multiple kind (ldap, olap, xml ...) and not only sqlserver databases Sure, it will be slower, but not as slow as most people think. Some links that may help you : - Nhibernate : http://www.nhibernate.org - Active record pattern : http://en.wikipedia.org/wiki/Active_record_pattern Manitra