Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

The Golden Rule of Data Manipulation


There is a very simple rule when it comes to storing (and returning) data, which I see violated all the time, making life so much more complicated for everyone involved.  In case you haven't noticed, that's a common theme I discuss here on this blog – different ways programmers make life more difficult for themselves, instead of simply following good practices and doing things the easy way.  This is yet another example of that situation.

The "Golden Rule of Data Manipulation" is a simple, but important rule that you should always follow when designing a database,  writing database code, or really writing any application code at all for that matter:

"It is always easier and more flexible to combine data elements rather than to break them apart"

In other words: Concatenation is easy. Parsing is hard.  Often, very hard -- or even impossible depending on the data.

Problems with Parsing

It is amazing how often I see people struggling with "difficult SQL problems" such as:
  • Working with CSV lists of values in a single column, such as "1,3,56,2"
  • Breaking out a FirstName/MiddleName/LastName/Suffix from a single "Name" column
  • Parsing address strings into City/State/ZIP, or Number/Street/Unit
  • Parsing Phone Numbers to get just an area code, or to take different phone formats and present them all uniformly
  • Figuring out how to calculate the Day,Month, and/or Year from different string values such as "23-Jan-08", "2008-02", "20070303", "03032007"
And on and on it goes....

Now, sometimes you inherit or import data that needs to be parsed -- that's a fact of life.  You've got to figure out how to do it, and the key in those cases is to accept that because the data itself is essentially random, nothing you can write will perfectly work 100% of the time on all of it.  Often, the best you can do is handle most of the data, and then do some manual clean up. 

Parsing strings can be a very difficult task for any programmer, and the challenge isn't writing the code, it's coming up with the algorithm (another common theme on this blog).  Consider my new favorite example of why parsing a single Name column into a First/Middle/Last is not as easy as it seems:

    Oscar De La Hoya

How would your algorithm parse that one?  Never mind prefixes such as "Dr." and suffixes such as "Jr."! 

Please don't interpret what I am saying as a programming challenge -- I understand that it is possible to write long code with a list of exceptions or rules and have that algorithm work pretty well in most cases.  The point is that writing that algorithm is a lot of work, running it will be inefficient, and it will never be exact because the data itself that is being processed is essentially random.  It's just like the old saying: "garbage in, garbage out".  Still one of my favorites, after all these years, and it still applies!

A Data Model that requires Parsing = A Poor Data Model

So, we need to accept that sometimes you've got to parse data like this.  And that's OK; it happens, it can be done, even if some manual work is often involved.

However, there's no excuse when you design your database, your SQL code, or applications so that free-form data must be parsed, when you can simply design it correctly in the first place and store your data already broken out into the smallest possible units with the correct data types.

If breaking out a contact's name into First, Last, Middle, etc is important to your application, then you should force the point of data entry to accept input broken out into those columns.  The same goes for phone numbers, addresses, and so on.  Any time you have the option of accepting  input as clean, short, raw, segments of data you should always do it.   Once you have data at that smaller resolution, it is trivial to combine it any way that you want for presentation, formatting, filtering, and so on.

It may seem like overkill to break out a phone number into 4 columns:

And, in fact, it might be more complicated than that if you need to deal with international phone numbers.  You may look at your tables, and your code, and even the UI that accepts these fields and think "that is way too precise and unnecessary, breaking out phone numbers like this sure makes things complicated!"

But by doing this, and only accepting user input that follows precise rules of what is allowed in these fields, and storing each of them in their own column, you can now easily and efficiently:
  1. Sort these numbers any way you want, without worrying about extra characters like parenthesis or dashes, or leading 1s, messing things up
  2. Filter quickly on an area code without the need to use LIKE, and again worrying about extra characters getting in the way
  3. Present the phone number quickly and easily any way you want without any parsing, be it as 123.123.1345 x123 or "(123) 123-1345 extension 123", or anything you want.
  4. Validate your phone numbers, ensuring you have all the necessary parts and they are the proper length, without worrying about parsing strings
Considering doing any of those things if your data is stored in random strings like:

(123) 124-1234
123-124-1234, ex. 123
123.124.1234 x123
(123)124.1234 ext. 123
1 123 124 1234 123
and so on ...  Not so easy in that case, just as parsing simple "Name" columns into First/Last, or addresses into  Number/Street/Unit is not so easy as well. 

Again, this is not a programming challenge -- I am sure it can done. (In fact, phone numbers are generally the easiest because you can usually just ignore anything other than digits.)  Most of us have done it before.  But designing something in such a way that parsing is required to do simple filtering, sorting, or formatting, is a bad design

It's Not About the UI

As I wrote here, you should never think "I want to display phone numbers like 123.123.1234, so I should store them and return them that way."  You should always think "How can I break this down into small, concrete parts that are easily validated and easy to combine any way I want at any time?"

So, what if you need fine detail when storing addresses, but you don't want your UI to present Street Number, Street Name, Unit Type, Unit Number as different data entry fields for usability or aesthetic reasons?  That's fine, but that doesn't mean you should not set up your database properly.  Your UI can certainly still present that one single "Address" text box for the user to fill out, parse that text at data entry, show the user the parsed result in multiple fields, and ask "Please verify for your address" or something along those lines.   Then, if not, the user can tweak the results and save it.  If you do things along those lines, and focus on getting the data parsed and stored correctly at the earliest point possible, every other part of your code will be that much more efficient.

All of this applies not only to data storage, but to how data is returned and passed between tiers as well.  Again, if you just return separate columns to your client application, instead of focusing on making them "look nice" in your database code by returning nothing but long, "pre-formatted" strings, your client can simply concatenate and format those columns any way it needs.  And, different clients can format that same database output in different ways -- all without ever altering any database code! 


In short, remember that writing concatenation is easy, efficient, and exact.  Writing a parsing routine, on the other hand, is often none of those things.   You may not always be able to control the design of the data you are working with, but be sure that when you can, you do it right.  If you find yourself using lots of LIKE expressions, or string parsing for simple data retrieval operations, something is wrong.   Time to fix up your database and your code, store the parsed and validated data permanently, and make things easier and cleaner for everyone.

Whether you are designing a schema, writing a SELECT, or writing code in any other programming language, remember that the Golden Rule of Data Manipulation always applies.  Accept this rule, learn from it, and practice it, and you might be surprised to find that programming isn't quite as hard as you thought it was.

Legacy Comments

Jacques Chester
re: The Golden Rule of Data Manipulation
If I'm not mistaken, this is a key part of first normal form: every piece of data is stored atomically. It can't be further subdivided.

Jim Johannsen
re: The Golden Rule of Data Manipulation
I see this all the time. It drives me nuts cleaning up the mess of someone trying to "make it simpler".

re: The Golden Rule of Data Manipulation
Ugh, I felt this pain at my first job, the guy before me was writing code that stored these huge blocks of CSV data, and then was parsing it out. He had never heard of the terms lookup table or junction table.

re: The Golden Rule of Data Manipulation
I lost count of all the csv and access db files that I have had to clean up and parse out so that we could import the client's information into our own system. I've gotten quite good at it too! Including scanning the table quickly myself to see if there are any exceptions. phew...

Linda Wenglikowski
re: The Golden Rule of Data Manipulation
I see this mistake all the time and it has been the key point of failure to many enterprise data and data warehouse projects. Defining discrete data elements, capturing this data consistently, and the relationships between data would prevent the failure of many applications and data warehouse projects.

re: The Golden Rule of Data Manipulation
No sooner than I post on here, then I have to take a client's excel spreadsheet that had first, last, AND spouse or partner name all in one column.... (bang head on desk)

Mark Walters
re: The Golden Rule of Data Manipulation
I have seen this so often I recently had a chat with a senior lecturer in Melbourne about what is taught to programmers. He agreed with me that most of these problems seam to be created when traditional application programers design and develop database applications. My first rule of thumb is that databases are all about what you can get out of them, not what you put in them. Always applying this rule seams to work well for me.

Wayne West
re: The Golden Rule of Data Manipulation
I had to recently clean up an address file to validate it against a GIS system. The problem was that Unit/Apt/Suite/Whatever was inconsistently coded. I had some fun with it: use the REVERSE() function to find where the last element was, put that element into another field, then validated that field against a street extension lookup table (ST, AVE, LOOP, PKWY, etc).

A challenging little project, I should do a blog entry on it.

re: The Golden Rule of Data Manipulation
In my work I have to do a lot of data manipulation, I get annoyed when I receive complete names including prefix and suffix all in one column. It is required of me to split them up in to Prefix, first, last and suffix. Of course I have a tool to do it with that generally does a good job but I have to skim through for the complex ones that it gets wrong. It tends to be time consuming. Even worse is inconsistent naming conventions so pre-made queries are of no use.