There is a very simple rule when it comes to storing (and returning) data, which I see violated all the time, making life so much more complicated for everyone involved. In case you haven't noticed, that's a common theme I discuss here on this blog -- different ways programmers make life more difficult for themselves, instead of simply following good practices and doing things the easy way. This is yet another example of that situation.
The "Golden Rule of Data Manipulation" is a simple, but important rule that you should always follow when designing a database, writing database code, or really writing any
application code at all for that matter:
"It is always easier and more flexible to combine data elements rather than to break them apart"
In other words: Concatenation is easy. Parsing is hard. Often, very
hard -- or even impossible depending on the data.
Problems with Parsing
It is amazing how often I see people struggling with "difficult SQL problems" such as:
- Working with CSV lists of values in a single column, such as "1,3,56,2"
- Breaking out a FirstName/MiddleName/LastName/Suffix from a single "Name" column
- Parsing address strings into City/State/ZIP, or Number/Street/Unit
- Parsing Phone Numbers to get just an area code, or to take different phone formats and present them all uniformly
- Figuring out how to calculate the Day,Month, and/or Year from different string values such as "23-Jan-08", "2008-02", "20070303", "03032007"
And on and on it goes....
Now, sometimes you inherit or import data that needs to be parsed -- that's a fact of life. You've got to figure out how to do it, and the key in those cases is to accept that because the data itself is essentially random, nothing you can write will perfectly work 100% of the time on all of it. Often, the best you can do is handle most of the data, and then do some manual clean up.
Parsing strings can be a very difficult task for any programmer, and the challenge isn't writing the code, it's coming up with the algorithm (another common theme on this blog). Consider my new favorite example of why parsing a single Name column into a First/Middle/Last is not as easy as it seems:
Oscar De La Hoya
How would your algorithm parse that one? Never mind prefixes such as "Dr." and suffixes such as "Jr."!
Please don't interpret what I am saying as a programming challenge -- I understand that it is possible to write long code with a list of exceptions or rules and have that algorithm work pretty well in most cases. The point is that writing that algorithm is a lot of work, running it will be inefficient, and it will never be exact because the data itself that is being processed is essentially random. It's just like the old saying: "garbage in, garbage out". Still one of my favorites, after all these years, and it still applies!
A Data Model that requires Parsing = A Poor Data Model
So, we need to accept that sometimes you've got to parse data like this. And that's OK; it happens, it can be done, even if some manual work is often involved.
However, there's no excuse when you design your database, your SQL code, or applications so that free-form data must
be parsed, when you can simply design it correctly in the first place and store your data already broken out into the smallest possible units with the correct data types.
If breaking out a contact's name into First, Last, Middle, etc is important to your application, then you should force the point of data entry to accept input broken out into those columns. The same goes for phone numbers, addresses, and so on. Any time you have the option of accepting input as clean, short, raw, segments of data you should always do it. Once you have data at that smaller resolution, it is trivial to combine it any way that you want for presentation, formatting, filtering, and so on.
It may seem like overkill to break out a phone number into 4 columns:
And, in fact, it might be more complicated than that if you need to deal with international phone numbers. You may look at your tables, and your code, and even the UI that accepts these fields and think "that is way too precise and unnecessary, breaking out phone numbers like this sure makes things complicated!"
But by doing this, and only accepting user input that follows precise rules of what is allowed in these fields, and storing each of them in their own column, you can now easily and efficiently:
- Sort these numbers any way you want, without worrying about extra characters like parenthesis or dashes, or leading 1s, messing things up
- Filter quickly on an area code without the need to use LIKE, and again worrying about extra characters getting in the way
- Present the phone number quickly and easily any way you want without any parsing, be it as 123.123.1345 x123 or "(123) 123-1345 extension 123", or anything you want.
- Validate your phone numbers, ensuring you have all the necessary parts and they are the proper length, without worrying about parsing strings
Considering doing any of those things if your data is stored in random strings like:
123-124-1234, ex. 123
(123)124.1234 ext. 123
1 123 124 1234 123
and so on ... Not so easy in that case, just as parsing simple "Name" columns into First/Last, or addresses into Number/Street/Unit is not so easy as well.
Again, this is not a programming challenge -- I am sure it can done. (In fact, phone numbers are generally the easiest because you can usually just ignore anything other than digits.) Most of us have done it before. But designing something in such a way that parsing is required
to do simple filtering, sorting, or formatting, is a bad design
It's Not About the UI
As I wrote here
, you should never think "I want to display phone numbers like 123.123.1234, so I should store them and return them that way." You should always think "How can I break this down into small, concrete parts that are easily validated and easy to combine any way I want at any time?"
So, what if you need fine detail when storing addresses, but you don't want your UI to present Street Number, Street Name, Unit Type, Unit Number as different data entry fields for usability or aesthetic reasons? That's fine, but that doesn't mean you should not set up your database properly. Your UI can certainly still present that one single "Address" text box for the user to fill out, parse that text at data entry
, show the user the parsed result in multiple fields, and ask "Please verify for your address" or something along those lines. Then, if not, the user can tweak the results and save it. If you do things along those lines, and focus on getting the data parsed and stored correctly at the earliest point possible, every other part of your code will be that much more efficient.
All of this applies not only to data storage, but to how data is returned and passed between tiers as well. Again, if you just return separate columns to your client application, instead of focusing on making them "look nice" in your database code by returning nothing but long, "pre-formatted" strings, your client can simply concatenate and format those columns any way it needs. And, different clients can format that same database output in different ways -- all without ever altering any database code!
In short, remember that writing concatenation is easy, efficient, and exact. Writing a parsing routine, on the other hand, is often none
of those things. You may not always be able to control the design of the data you are working with, but be sure that when you can, you do it right. If you find yourself using lots of LIKE expressions, or string parsing for simple data retrieval operations, something is wrong. Time to fix up your database and your code, store the parsed and validated data permanently, and make things easier and cleaner for everyone.
Whether you are designing a schema, writing a SELECT, or writing code in any other programming language, remember that the Golden Rule of Data Manipulation always
applies. Accept this rule, learn from it, and practice it, and you might be surprised to find that programming isn't quite as hard as you thought it was.