Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

Distinguishing data from code

I recently submitted my Data in tables versus data in code post from a ways back to reddit, and I was surprised that there ended up being quite a bit of discussion about it. There was even an, um, interesting reply from another blogger.   Without a doubt there were some great and thought-provoking comments all around.

A few comments were made on one particular point, which is a great one and important to address: What is data, and what is code?  How do we define the difference, and decide what goes where?  It is great to say keep data out of your code, but what if that data is integral to the application itself?  Isn't it therefore code, and not data?

As always, the answer is: it depends on specifications.  I think it is very easy to distinguish between the two, it all depends on how you design your application and how you are specifying that it will function.

Here's some examples.

"The XYZ Customer system allows you to store and track customers.  Each customer can also be assigned one of 4 status codes:  'A' means that the customer is active, 'I' means inactive and they will not display on reports and cannot be edited, 'X' means they are archived and cannot be edited, and 'P' means they are preliminary and can be edited but will not show up on reporting.  The XYZ Customer system allows you to define the labels for these status code any way you want, so that 'A' can be labeled 'Active' or 'Current' or whatever your business uses."

Let's assume that the system is written to take these specifications literally.  Now, in the XYZ Customer system, what are these statuses -- code or data?  The answer is: code.  If you design you system like this, and spec it up like this, and these code values are hard-coded into your data and what they do is also hard-coded and fixed, then of course they are tightly integrated with your application code, and they are indeed code. You can store the labels in the database, and you might even store a status code table in your database well, but with this design and these specifications, these statuses are part of the code and not data.

Let's consider another system, instead.

"The ABC Customer system allows you to store and track customers. You may also assign customers a status code that determines whether they can be edited, viewed, deleted and so on.  These status codes can be defined by the administrator using the 'Admin' feature. Each code is assigned a label and also attributes that define how the application reports and processes customers with that status assigned."

This is how I choose to design my systems.  In this case, your status codes are clearly data.  These two systems, with the same set of status codes, will technically work exactly the same way. But the difference is that the ABC system doesn't have code like "IF Status=A Then display it" hard-coded throughout; it simply checks to see if the current status code of the current customer means that it should be displayed on reports.

While I firmly believe that this design is preferable, of course I will not say that it is right for everyone and all situations.  But for me, it works beautifully.  If you implement it properly, your code is much, much shorter and very easy to read and edit, yet your application is much more flexible and powerful.

Here's another.

"The Three-Shift application is designed to track all of your employees and their hours. It allows you to assign employees to one of three shifts: Midnight-8AM, 8Am-4PM, or 4PM-Midnight. Any time an employee works, you can determine what shift they worked by assigning them one of the 3 shift codes."

Here, you can argue that the shifts are code, not data. The limit of 3 is defined in the application and hard-coded; the entire system and probably all of the code written depends on those 3 shifts being defined and limited to those time periods.  Sure, we can release a new version next year that adds a 4th shift option, or maybe it allows you to store the time periods for the shifts in a config file or something. But, overall, as defined, you could argue that if you took these specifications literally and had no need to design the application with flexibility, these shifts are code.  I'll leave it up to you to decide if this is a good specification or application design.

And finally:

"The Multi-Shift application is designed to track all of your employees and their hours.  You can define as many shifts as you need and historically track changes in those shifts.  Each shift can be assigned a title, a shift manager, a start/end time, and an effective date."

I hope it is clear that in this case, by these specifications, shifts are simply data.  Again, I am sure it will not surprise you to hear that I design my applications this way.  The code is actually shorter, simpler and easier to read and maintain. You lose nothing but gain a lot by designing your system to be flexible and to grow easily by simply defining things as data and not hard-coding everything into your applications.  Reporting is consistent and easier, and historical reporting for different shift definitions is easily done as well which was not possible or easy if the application itself changes those definitions via code changes.

I hope this helps to clear things up a little. The only rule is this:  The specifications themselves along with how you decide to write your code determine what is code and what is data, not some generic rule of thumb.  Each case is different.  You can always choose to hard-code values and data into your application code, that is certainly your right and it certainly will work; indeed, if you do that, your data does literally become code and your application is now less flexible.  

If you decide to define entities and elements of your application as data and write your code in a way that it simply processes the data and does what the data defines it should, I think you might find that your code just got a lot shorter and more flexible and powerful at the same time.  That's usually a good thing, right?

One final note: I think a lot of this also depends on the programmer's philosophy in terms of making users self-sufficient.  When I write code for my users, I tend to expose every detail I can to them and hide no calculations, values, data or settings from them -- they are all there to see and potentially change.  Maybe lots of this is read-only or requires a super-user or administrator account to change, and many settings might have effective dates and/or audit history, but it is all out there.  Changes that are necessary, sometimes even significant ones, are often easily done without even needing to pick up the phone and call me.   The system can often live on for years and adjust to new business processes without changing a single line of code if I did my job right and planned appropriately. It is also quite self-documenting. (What does status code "Q" mean?  Look it up in the status codes table; it is all there, defined in one place.)  I like that; I like my users to be happy and self-sufficient and to really understand how an application works and to be able to configure it to meet their needs. 

I have found that other programmers take the opposite approach: They prefer to hard-code and hide important attributes and settings in the code, and that any even minor tweaks or changes require their assistance and "expertise".  Maybe it's their way of obtaining perceived job security?  It could be.  But I think it is more often the lack of planning ahead and not carefully designing flexible applications that correctly separate data from code.

Legacy Comments


gwenhwyfaer
2007-07-24
re: Distinguishing data from code
> My Data in tables versus data in code post from a ways back has been discussed quite a bit at reddit recently

Wouldn't it be more forthright to mention that you submitted it yourself a day or two ago, rather than implying by omission that it just happened to find its way there after a few months' wandering?

> I have found that other programmers take the opposite approach: They prefer to hard-code and hide important attributes and settings in the code

Favourably comparing yourself to the worst representatives of a group only tells us that you aren't the worst. There's an awful lot of ground between that and mediocre.

Jeff
2007-07-24
re: Distinguishing data from code
gwenhwyfaer --

>Wouldn't it be more forthright to mention that you submitted it yourself a day or two ago, rather than implying by omission that it just happened to find its way there after a few months' wandering?

sure thing! If that bothers you, I'll will update that.

>Favourably comparing yourself to the worst representatives of a group only tells us that you aren't the worst. There's an awful lot of ground between that and mediocre.

I'm actually not telling you anything other than what I have found and how I choose tod o things. You can judge me any way that you want, if you choose. If you are telling me I am not the worst programmer out there and that's about all we can say for my skills, then I'd have to agree with you! I never claimed to be an expert.

Thanks for your feedback!

Ted Z
2007-07-24
re: Distinguishing data from code
Configuration files are a middle ground between full tables and code-embedded configurations. They have lots of advantages over either approach. Not all data needs to be in tables!

There are also domain-specific languages, which often express the problem's solution in a simpler way than tables or code. The effort of implementing the DSL is very often worthwhile.

Jeff
2007-07-24
re: Distinguishing data from code
Good point, Ted. I thought about addressing config files but decide my rambling was long enough.

Here's my take:

Your application is defined to work a specific way on a specific set of data. You can write your application so that it uses config files to change or define the way it fundamentally works. But those are just settings, things that don't require a re-code or a re-compile, but that would otherwise if the setting was contained in the source code. (separate from run-time settings like database connection strings and network paths and settings of course).

Data is data. Again, looking at my example, the "Three-Shift" applicaiton probably uses config files to store the shifts. Changing the shifts = changing config files which is good, no re-compile/re-distribute needed, but it changes the way your application. There is no history, no way to report historical data, and no SQL statements or exernal reports or applications will recognize that change. Same with the Status codes that are hard-coded. And this is fine, if you choose to do things this way, just recognize the drawbacks, it all depends on your specifications.

BUt if you define shifts or status codes in data, then you can have historical changes with effective dates, all reports and all applications are consistent, and so on. Changing shifts != changing the way your applicaiton works; it is DESIGNED to handle shift changes, it is not a code or configuration change.

Probably still didn't describe things that well, but that's my attempt anyway.

Jeff
2007-07-24
re: Distinguishing data from code
To add to that last comment: of course config files are great, and of course they should be used. Lots of settings and options should be put there. Just not data! :)

Ted Z
2007-07-24
re: Distinguishing data from code
I understand about the three-shift application. The advantage there is clearly with the table-based approach if you know from the beginning that you'll need historical data. Configuration files simply make it easy to store *unstructured* pieces of data, as opposed to the *structured* data you want to store in tables.

Jeff
2007-07-24
re: Distinguishing data from code
>Configuration files simply make it easy to store *unstructured* pieces of data, as opposed to the *structured* data you want to store in tables.

Well said, a good way of looking at it, but I still would argue that the more data you can structure, the better. If only one single application needs that setting, put it in the config file. If other applications, reports, interfaces and systems need that data, put it in the database.

A lot of it is also based on one's experience and skill set: if you are stronger in data modeling than me, you might be able to store more data intelligently and efficiently in your database, while I would prefer to put it in config files because I am more comfortable working that way.

Great comments, thanks Ted.

Peter Goodman
2007-07-24
re: Distinguishing data from code
I disagree (partly) with the first example. A status is a very good example, but I think that the actual representation of the status, ie: Current/Active, Inactive, etc belongs as data whereas the implementation of the status itself belongs in the code.

For example, there might be a feature that someone with a status of I cannot do but someone with a status of A can do. Obviously this is an implementation issue--therefore a coding issue. However, the title to apply to the status says nothing about that status' abilities, just how we *refer* to that status and hence it belongs as data.

Jeff
2007-07-24
re: Distinguishing data from code
Peter -- agreed, the title or label of each status could still be stored in the database as data in that first example, as mentioned.

David
2007-07-24
re: Distinguishing data from code
> "the ABC system doesn't have code like IF Status=A Then display it" hard-coded throughout; it simply checks to see if the current status code of the current customer means that it should be displayed on reports

Um... care to give us some pseudo-code to illustrate how your app does 'simply check' whether the status code means it should display/allow editing/whatever of customer details?

Jeff
2007-07-24
re: Distinguishing data from code
David -- sure. Check out example #4 from post linked in the first sentence; that should explain it all.

here's the URL:

http://weblogs.sqlteam.com/jeffs/archive/2006/02/10/9002.aspx

If it's still not clear, let me know.




Ted Z
2007-07-24
re: Distinguishing data from code
Other issues I should mention:

It's harder to corrupt or reverse-engineer code than data in tables. It's harder to secure external data sources than just the application.

It's harder to deploy an application with external data sources. Versioning database contents is annoying for small sizes, and painful for large databases. You can partition and so on, but it's all hacky.

Connecting to external data sources is often painful, depending on your platform. Spring+Hibernate, no problem. Embedded C on a microprocessor board, not so hot.

Good database design requires a lot more than ad hoc table design. You need to consider server load, failover, caching responses, and so on on the system side. You need to consider referential integrity and query performance.

So while it's nice to store data externally from the application, sometimes it just doesn't make sense. I think it comes down to experience, familiarity, suitability of the data (structured data is better in a table), and most of all your platform's capabilities. It's always easier to say "Hibernate does this for me" than "let's set up a storage layer to make the code prettier."

Jeff
2007-07-24
re: Distinguishing data from code
>>Good database design requires a lot more than ad hoc table design. You need to consider server load, failover, caching responses, and so on on the system side. You need to consider referential integrity and query performance.

Why would you think I am advocating "ad hoc table design" ?! Where did I imply that? My goodness, let me know, I will clarify immediately -- that goes against every word I've written in this post and all my others the past 4 years or so .... My entire blog is dedicated to things like database design, referential integrity and query performance!

Analyzing your needs, designing a flexible system and separating data from code is the opposite of "ad hoc table design" !

Reg Braithwaite
2007-07-24
re: Distinguishing data from code
JM2C:



For me, the answer is surprisingly easy: data is not tested as part of a release, code is tested as part of a release. Configuration is nearly always code because it nearly always must be tested.

I will put configuration or even code in a table, but I will not provide an "Admin" function to change it at run time, because that subverts the release process. Instead, I will recommend a utility be built that writes scripts to update the database.

The Administrator runs the utility, builds new scripts, and we test the scripts as part of the next release.

So for me, I don't ask if something should be hard coded because of the spec, I ask the customer whether something needs robust quality control in the form of release management.

p.s. Your script says my email is invalid, yet it IS valid. The problem is that it contains a "+" character.

Ricky Clarkson
2007-07-24
re: Distinguishing data from code
What you're experiencing is the simple fact that code is data (and hence data is code).

http://blog.cleartrip.com/journal/2007/7/7/lisp-is-sin-and-all-data-is-code.html expands on this.

You only need to look at ant, xsl, JS inside HTML, JSON (executable data), you don't have to understand Lisp to see it.

Ted Z
2007-07-25
re: Distinguishing data from code
>> Good database design requires a lot more than ad hoc table design. You need to consider server load,
>> failover, caching responses, and so on on the system side. You need to consider referential integrity
>> and query performance.

> Why would you think I am advocating "ad hoc table design" ?! Where did I imply that?
> My goodness, let me know, I will clarify immediately -- that goes against every word
> I've written in this post and all my others the past 4 years or so ....
> My entire blog is dedicated to things like database design, referential integrity
> and query performance!

I did not say you, Jeff, were personally doing ad hoc table design. I was addressing the reader of the comment. Sorry for the misunderstanding.

I just wanted to bring up the issues that often trip up people who want to use external data sources, especially RDBMS sources. I don't think you, Jeff, are advocating bad practices, but you didn't mention all the difficult issues with external data sources, just like the advocates of code-internal data sources don't mention the problems with their approach (code is brittle, etc.). At least I didn't see those issues mentioned in your posts, perhaps I missed them.

Jeff
2007-07-25
re: Distinguishing data from code
Sorry for jumping the gun, Ted, thanks for clarifying.

Matt Cox
2007-07-25
re: Distinguishing data from code
>>Wouldn't it be more forthright to mention that you submitted it yourself a day or two ago, rather than implying by omission that it just happened to find its way there after a few months' wandering?

>sure thing! If that bothers you, I'll will update that.

Why would you change your post for an nitwit comment like that. Who cares how it got there, tell em to stick it where the sun don't shine.

To gwenhwyfaer:
Wouldn't it be more pertinent to reveal that you are a biased jerk-off who doesn't like the author rather then claim to speak for me or *some* of my peers?

PS I don't know the author I only found this blog today from DZone.com. Awesome site!