Jeff's SQL Server Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server
posts - 157, comments - 2686, trackbacks - 64

My Links

Advertisement

News

Welcome to my weblog. My name is Jeff Smith, I am software developer in Boston, MA and I was recently named a 2009 SQL Server MVP. Check in frequently for tips, tricks, commentary and ideas on SQL Server and .NET programming.


Subscribe





Archives

Post Categories

Programming

Sports

SQL

Friday, May 11, 2012

Why to avoid SELECT * from tables in your Views

-- clean up any messes left over from before:
if OBJECT_ID('AllTeams') is not null
 drop view AllTeams
go

if OBJECT_ID('Teams') is not null
 drop table Teams
go

-- sample table:
create table Teams
(
 id int primary key,
 City varchar(20),
 TeamName varchar(20)
)

go

-- sample data:
insert into Teams (id, City, TeamName )
select 1,'Boston','Red Sox' union all
select 2,'New York','Yankees'

go

create view AllTeams
as
 select * from Teams

go

select * from AllTeams

--Results:
--
--id          City                 TeamName
------------- -------------------- --------------------
--1           Boston               Red Sox
--2           New York             Yankees


-- Now, add a new column to the Teams table:

alter table Teams
add League varchar(10)
go

-- put some data in there:
update Teams
set League='AL'

-- run it again

select * from AllTeams

--Results:
--
--id          City                 TeamName
------------- -------------------- --------------------
--1           Boston               Red Sox
--2           New York             Yankees

-- Notice that League is not displayed!

-- Here's an even worse scenario, when the table gets altered in ways beyond adding columns:
drop table Teams
go

-- recreate table putting the League column before the City:
-- (i.e., simulate re-ordering and/or inserting a column)
create table Teams
(
 id int primary key,
 League varchar(10),
 City varchar(20),
 TeamName varchar(20)
)
go

-- put in some data:
insert into Teams (id,League,City,TeamName)
select 1,'AL','Boston','Red Sox' union all
select 2,'AL','New York','Yankees'

-- Now, Select again for our view:
select * from AllTeams

--Results:
--
--id          City       TeamName
------------- ---------- --------------------
--1           AL         Boston
--2           AL         New York

-- The column labeled "City" in the View is actually the League, and the column labelled TeamName is actually the City!

go
-- clean up:
drop view AllTeams
drop table Teams

posted @ Friday, May 11, 2012 10:10 AM | Feedback (2) | Filed Under [ T-SQL Techniques SQL Server 2008 ]

Monday, August 30, 2010

How to calculate Median in SQL Server

Nothing earth-shattering here, I was just helping out a colleague with this so I thought I'd post up the example I gave him.


-- sample table:
create table People
(
    Person varchar(1) primary key,
    City varchar(10),
    Age int
)

go

-- with some sample data:

insert into People
select 'A','Boston',23 union all  -- odd #
select 'B','Boston',43 union all
select 'C','Boston',29 union all

select 'D','Chicago',15 union all -- single #

select 'E','NY',12 union all  -- even #
select 'F','NY',55 union all
select 'G','NY',57 union all
select 'H','NY',61


go

-- here's our query, showing median age per city:

select city,
    AVG(age) as MedianAge
from
(
    select City, Person, Age,
        ROW_NUMBER() over (partition by City order by Age ASC) as AgeRank,
        COUNT(*) over (partition by City) as CityCount
    from
        People
) x
where
    x.AgeRank in (x.CityCount/2+1, (x.CityCount+1)/2)   
group by
    x.City    
   

go

-- clean it all up
drop table People

And here's the result:


city       MedianAge
---------- -----------
Boston     29
Chicago    15
NY         56

(3 row(s) affected)

Simply remove "City" from the SELECT clause and the GROUP BY clause to get the median age for all. 

There may be more efficient tricks out there, but this is certainly the shortest and simplest technique I am aware of.

 

posted @ Monday, August 30, 2010 12:00 AM | Feedback (1) | Filed Under [ T-SQL Techniques SQL Server 2005 GROUP BY ]

Monday, November 16, 2009

An interesting take on "Stored Procedures" (link)

Another great post SQL-related post over at TheDailyWTF regarding a, umm, "creative" use of "Stored Procedures":

http://thedailywtf.com/Articles/For-the-Ease-of-Maintenance.aspx


posted @ Monday, November 16, 2009 1:42 PM | Feedback (3) | Filed Under [ Humor Links ]

Monday, November 02, 2009

Programming is like dreaming?

This is from March, so it's a little old, but I just stumbled upon it and found it a bit interesting. I never thought of programming this way before, but it does make sense:

Programming is like a dream

posted @ Monday, November 02, 2009 2:52 PM | Feedback (1) | Filed Under [ Links ]

Tuesday, October 27, 2009

Is it just me ...

... or is about time I got back to some blogging?

posted @ Tuesday, October 27, 2009 12:50 PM | Feedback (7) |

Tuesday, December 23, 2008

The Mailbag: More on Ordering by Distinct Values ...

I promise to get back to writing articles on a more regular basis soon, but in the meantime, here's a comment from Nathan A. on using DISTINCT and ORDER BY:

Jeff,

This is actually a problem I have been puzzling over for quite a while now. I actually need to do that sort. I wonder if I may have to create another column that has the list of ordering values in it in increasing order so for my example above assuming a letter table and a number table that contains the numbers for letters, the number table would not change but I would add this second column to the letter table:

B 0
A 0, 1
C 2
B 3

But I don't like that idea as it is a denormalization and would require extra maintenance.

I can think of a way to do it with a specified number of subsequent rows to sort by. In your example you order by the minimum value of the number column. If we wanted to order by the minimum, then by the second minimum, then by the third minimum we can use nested queries to select each of those values in different columns and order by them. I have developed  query like this but there are some issues. First of all, getting the second and third lowest values requires nested queries themselves so this would require many nested queries (not sure if that is a problem). The number of nested queries increases based on how many levels down you want to sort by. The other problem with this method is that you have to specify how many levels down you want to sort by, you can't just sort by a concatenation of all numbers.

The real world example of this problem actually seems like it would be useful in many situations. Consider a task table, that has a list of tasks for people to accomplish and an assignee table that has a list of people assigned to the task. A task can have many people assigned to it. I want to get a list of tasks sorted by their assignees in alphabetical order so if the joined table looked like this:

Task Assignee
1 John
1 Mark
2 John
3 Mark

The result would be in this order:
2 John
1 John, Mark
3 Mark


Let me know what you think the best approach is.


Hi Nathan --

Well, one way to handle this is to write a User Defined Function that returns a string concatenating distinct Assignees for the Task provided as a parameter.  We can use a UDF similar to this one as an example.  UDF's such as these are the simplest and most efficient way I have seen to handle concatenation at the database layer, though there are other methods you can try.

So, let's say we have this for a schema and sample data:

create table Tasks (TaskID int primary key, TaskName varchar(10))
go

create table TaskAssignees
(   TaskID int references Tasks(TaskID), Assignee varchar(10), primary key (TaskID, Assignee))
go

insert into Tasks
select 1,'Task A' union all
select 2,'Task B' union all
select 3,'Task C'

insert into TaskAssignees
select 1, 'John' union all
select 1, 'Mark' union all
select 2, 'John' union all
select 3, 'Mark' union all
select 3, 'Ed'

We can create a UDF like this:
   
create Function GetTaskAssignees(@TaskID int)
returns varchar(100)
as
    begin
    declare @ret varchar(100)
    set @ret = ''

    select @ret= @ret + ', ' + Assignee
    from TaskAssignees
    where TaskID = @TaskID
    order by Assignee

    return substring(@ret,3,100)
    end

... and get the output you are looking for like this:

select TaskID, TaskName, dbo.GetTaskAssignees(TaskID) as Assignees
from Tasks
order by Assignees

TaskID      TaskName   Assignees
----------- ---------- --------------
3           Task C     Ed, Mark
2           Task B     John
1           Task A     John, Mark

So, this should  actually work for well, though for large sets of data performance may be an issue.  As an added bonus, this handles presentation of the names assigned to each task for you as well.

Depending on the data, however, you may need to concatenate items of a fixed length, padded by spaces, instead of simply comma-separated.   This would apply if you are sorting by numeric values, such as:

1,23,45
1,3
6,12,4

Notice that "1,23,25" sorts before  "1,3" in the example above, since it is just comparing two strings.  To solve this, you'd have to write the UDF to output like this:

1 , 3
1 ,23 ,45
6 ,12 , 4

That way, " 3" (padded to the right with a space) sorts correctly before "23".  You could also pad with leading zeroes, or padding after the value instead of before.  The trick here is to identify how much padding you need.

If you  need assistance with altering the example UDF shown to pad the output, let me know.

posted @ Tuesday, December 23, 2008 9:25 AM | Feedback (5) | Filed Under [ T-SQL Sorting User Defined Functions ]

Monday, October 27, 2008

Processing an OLAP cube with a T-SQL Stored Procedure

Here's a simple SQL Server stored procedure that you can call to process an OLAP cube using T-SQL.  The parameters should be self-explanatory.  To me, this is a little easier and more flexible than processing using DTS packages.

create procedure ProcessCube

    @Database varchar(100),

    @Cube varchar(100),

    @Partition varchar(100)  = null, -- If NULL, process the entire Cube

    @Server varchar(100) = 'localhost'

as

 

/*

    Author:        Jeff Smith

    Version:    10/27/2008

*/

 

/* variables used to store object handles */

declare @o_svr int, @o_db int, @o_cube int, @o_part int, @o_mds int

declare @hr int

 

/* different cube processing options. This SP uses "default" */

declare @PROCESS_DEFAULT int

declare @PROCESS_FULL int

declare @PROCESS_REFRESH_DATA int

 

set @PROCESS_DEFAULT = 0

set @PROCESS_FULL = 1

set @PROCESS_REFRESH_DATA = 2

 

-- create a DSO.Server object:

exec @hr = sp_OACreate 'DSO.Server', @o_svr out

if @hr <> 0

    begin

        print 'Error at create server:'

        exec sp_OAGetErrorInfo @o_svr

        goto cleanup

    end

 

-- Connect to the server:

exec @hr = sp_OAMethod @o_svr, 'Connect', null, @Server

if @hr <> 0

    begin

        print 'Error at connect to server:'

        exec sp_OAGetErrorInfo @o_svr

        goto cleanup

    end

 

-- Get the MDStores property from the Server:

exec @hr = sp_OAGetProperty @o_svr,'MDStores', @o_mds OUT

if @hr <> 0

    begin

        print 'Error at get getting Server MDStores:'

        exec sp_OAGetErrorInfo @o_svr

        goto cleanup

    end

 

-- Get the database from the MDStores:

exec @hr = sp_OAGetProperty @o_mds,'Item',@o_db OUT,@Database

if @hr <> 0

    begin

        print 'Error at get database:'

        exec sp_OAGetErrorInfo @o_mds

        goto cleanup

    end

 

-- get the MDStores property from the database:

exec sp_OADestroy @o_mds

exec @hr = sp_OAGetProperty @o_db,'MDStores', @o_mds OUT

if @hr <> 0

    begin

        print 'Error at get database MDStores:'

        exec sp_OAGetErrorInfo @o_db

        goto cleanup

    end

 

-- get the Cube from the MDStores

exec @hr = sp_OAGetProperty @o_mds,'Item',@o_cube OUT, @Cube

if @hr <> 0

    begin

        print 'Error at get Cube:'

        exec sp_OAGetErrorInfo @o_mds

        goto cleanup

    end

 

if @Partition is null -- Process the entire Cube, not just a single partition

    begin

        exec @hr = sp_OAMethod @o_cube, 'Process', null, @PROCESS_DEFAULT

        if @hr <> 0

            begin

                print 'Error at process Cube:'

                exec sp_OAGetErrorInfo @o_cube

                goto cleanup

            end

    end

 

else    -- just process the specified Partition

 

    begin

        -- Get the MDStores property of the Cube:

        exec sp_OADestroy @o_mds

        exec @hr = sp_OAGetProperty @o_cube,'MDStores', @o_mds OUT

        if @hr <> 0

            begin

                print 'Error at get Cube MDStores:'

                exec sp_OAGetErrorInfo @o_cube

                goto cleanup

            end

 

        -- Get the partition to process:

        exec @hr = sp_OAGetProperty @o_mds,'Item',@o_part OUT, @Partition

        if @hr <> 0

            begin

                print 'Error at get Parition:'

                exec sp_OAGetErrorInfo @o_mds

                goto cleanup

            end

 

        -- Process the partition:

        exec @hr = sp_OAMethod @o_part, 'Process', null, @PROCESS_DEFAULT

        if @hr <> 0

            begin

                print 'Error at process Partition:'

                exec sp_OAGetErrorInfo @o_part

                goto cleanup

            end

    end

 

-- And unlock all objects on the server:

exec @hr = sp_OAMethod @o_svr, 'UnlockAllObjects'

if @hr <> 0

    begin

        print 'Error at unlock all server objects:'

        exec sp_OAGetErrorInfo @o_svr

        goto cleanup

    end

 

cleanup:

 

if @o_mds is not null exec sp_OADestroy @o_mds

if @o_Part is not null exec sp_OADestroy @o_Part

if @o_cube is not null exec sp_OADestroy @o_cube

if @o_db is not null exec sp_OADestroy @o_db

if @o_svr is not null exec sp_OADestroy @o_svr

posted @ Monday, October 27, 2008 3:26 PM | Feedback (12) | Filed Under [ T-SQL OLAP ]

Thursday, October 16, 2008

I'm back ... with some news!

I apologize for not posting any new content in quite some time, but now I am back and will soon start posting on a (hopefully!) regular basis once again.

The reason for my hiatus was a pretty good one, though: I recently became a father with the birth of my son Benjamin on October 2, 2008!   He is doing great and already knows that he should always do his formatting at the client and never within the database.  He's a natural!

As if that wasn't enough, I also found out recently I have been named as a 2009 SQL Server MVP!  Now I can finally get a decent seat at a restaurant by pulling the "do you know who I am?" routine.  In all seriousness, though, I am very honored and excited by the award and I intend to live up to it in 2009 by continuing to update this blog and help users out in the SQLTeam forums.

Thank you to everyone who reads this blog and participates in the discussions, and I promise more updates are coming soon.  If you have any specific topics or ideas for a blog post that you think I may be able to cover effectively, just let me know.

Best Regards,

Jeff Smith
SQL Server MVP
 
(Hey -- I like the way that looks!)

posted @ Thursday, October 16, 2008 4:04 PM | Feedback (9) | Filed Under [ Miscellaneous ]

Wednesday, August 13, 2008

Why Single-Column Primary/Foreign Keys Can't Always Accurately Model/Constrain Your Data

A Simple Single-Child Model

Consider the following logical data model:
  • There are multiple Companies
  • Each Company has many Projects
  • Each Project has many Tasks
  • Each Task has a Status, selected from a global list of pre-defined Statuses.

  Status   Companies
      \       |
       \      |
        \  Projects
         \    |
          \   |
           Tasks

Let us say that we decide that the primary key of Companies, Projects, Tasks, and Status are all Identity (auto-number) columns, since we would like to auto-generate primary keys for those tables.

So, we have something like this:

Table       Primary Key     Foreign Key References
---------   -----------     ----------------------
Status      StatusID        n/a
Companies   CompanyID       n/a
Projects    ProjectID       Companies(CompanyID)
Task        TaskID          Projects(ProjectID), Status(StatusID)

At this point, things work fine -- we can enforce full referential integrity with our identity primary keys. We can't necessarily eliminate duplicate data, but suppose that is not an issue and/or we have unique constraints in place.  Choosing to follow the "every primary key must be an identity rule" did not have any adverse affects and we have a nice, clean data model.

Parents with Multiple Children

Now, allow me to add one little wrinkle.  Suppose that the available Statuses for each Task is not defined globally, but at the Company level.  That is, each Company has its own list of Statuses to which a Task can be assigned.

     Companies
      /     \
     /       \
  Status   Projects
     \       /
      \     /
       Tasks

This means that the Status table now needs a Foreign Key reference to the Companies table (to indicate which Company each Status belongs to), giving us:

Table       Primary Key     Foreign Key References
---------   -----------     ----------------------
Companies   CompanyID       n/a
Status      StatusID        Companies(CompanyID)
Projects    ProjectID       Companies(CompanyID)
Task        TaskID          Projects(ProjectID), Status(StatusID)

Are there any other changes we need to make to this data model?  Or is simply adding a CompanyID column to the Status table enough to facilitate this change?  Remember, our goal as always is full referential integrity using primary and foreign key constraints wherever possible.

Well, there is an issue:

Nothing in this data model stops us from assigning a Status to a Task that is not defined for that Task's parent company.   We have no way of enforcing this right now with our current constraints.  Our physical data model is flawed.  (This is also described a bit here.)

Identities Are Part of the Solution

This is very easily fixed, but it can only be done by violating the "all tables just need an identity primary key" rule.

First, remember this: just because an identity column is unique does not mean that that column cannot be part of a primary key.  So, even though maybe you decide that you really like integer primary keys for your Status table (when probably a simple CHAR(2) code is all you need), if Status codes are child entities of Companies then we can instead create a composite primary key like this:

Table          Primary Key             Foreign Key References
---------      -----------             ----------------------
Status         CompanyID,StatusID      Companies(CompanyID)

So now, with the two column primary key and one of those columns referencing another table, Statuses are truly a child entity of Companies.  Even though it seems that having a two-column primary key is not necessary since the StatusID column in itself is unique, you will see that to physically enforce the very simple logical model we are working with, it's required!

Now, just as Statuses are child entities of Companies, so are Projects.  And as with the StatusID column, we have a nice unique identity ProjectID column that is unique enough that it could be our primary key -- but it shouldn't be.  Once again, being a child entity of a Company, and with the goal of wishing to enforce integrity throughout our entire model, we find that the true primary key of our Project table needs to be a composite of CompanyID/ProjectID:

Table       Primary Key             Foreign Key References
---------   -----------             ----------------------
Projects    CompanyID,ProjectID    Companies(CompanyID)

None of this seems necessary yet -- why are we messing with our perfectly fine keys? -- but soon it will all make sense.

Looking at Tasks, we know we have a foreign key reference to Projects, and we know the primary key of Projects is now (CompanyID, ProjectID).  Also, remember that we changed our Status table as well to have a composite primary key including the CompanyID column.  Thus, we now know that we need to add a CompanyID column to our Tasks table to enforce these FK constraints:

Table       Primary Key     Foreign Key References
---------   -----------     ----------------------
Task        TaskID          Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)

OK, so we are making our primary keys bigger when it doesn't seem necessary, and now our foreign key constraints are getting more complicated as well since they now involve multiple columns.

Why did we bother doing all this?

Because now we have full, complete referential integrity, that's why!

Examining the Results

With these changes in place, only Status codes belonging to a Task's Company can be assigned to that Task.  By propagating the CompanyID column down to our child tables, we have the column necessary to enforce full referential integrity as dictated by our logical model.

So, in the end, it looks like this:

Table       Primary Key            Foreign Key References
---------   -----------            ----------------------
Companies   CompanyID              n/a
Status      CompanyID,StatusID     Companies(CompanyID)
Projects    CompanyID,ProjectID    Companies(CompanyID)
Task        TaskID                 Projects(CompanyID,ProjectID), Status(CompanyID,StatusID)

Just this little tweak in our logical model -- that each Company has their own list of Statuses -- required all of these changes in the physical database schema!  What gives?

Well, remember: if the logical model changes, the physical data model should change. If it doesn't, then you are probably not modeling it correctly or accurately to maintain strict data integrity in the first place.

"All-identity-all-the-time" data modelers often claim that only using identities makes things easier and "more flexible". As we can see from this example, it can work in some cases (back when Statuses were global), but not in every case.  Sometimes, if your goal is to enforce strict data integrity, you simply cannot create your physical schema that way.  If you want to approximately model your data, then all identities can be nice. If you want to accurately model your data, then it simply is not always possible with only single-column identity keys as clearly demonstrated here.

"Identifying" When Using Only Identities Will Not Work

The Companies table originally had just one child: Projects.  But once we determined that Status codes are  also a child entity of Companies, we now had multiple children for the Companies table: Projects and Statuses.

Now, that in of itself is fine; at this point, there is no reason to change anything.  We can have those entities both references the Companies table via CompanyID, and both of those tables can have single column identity primary keys.

However, also in our data model, we see that at some point a child entity of Projects (Tasks) requires that we relate Projects to Statuses.  Now, suddenly the fact that Parents and Statuses have a common parent -- CompanyID -- is very important.   This is because we cannot allow a Project from one company to be assigned to a Status from another.  Once we identify this situation, we realize that the way to handle this is that the primary keys of both the Projects table and Status table must also include the CompanyID column.

With the composite primary keys to those children with the common parent in place, we can now relate them to each other as needed and we will have full data intregrity between those two entities and any of their children.  You can see this in our new data model, as it will not allow you to assign a Status to a Task unless it has been defined for the Task's company.  We would not be able to enforce this without ensuring that the Task table had a full normalized, non-redundant CompanyID column, and the way to do that was to ensure that the primary key of Projects was CompanyID/ProjectID and that the primary key of Status was CompanyID/Status.   The rest follows naturally when assigning the foreign key constraints.

Another Example

Suppose we have Stores, and each Store has Departments, and each Department has Employees. That is a straight-forward, simple data model where no two entities share the same parent, and we can just use single-column identities all the way throughout and be fine.

But what if each Store has multiple Locations?  And each Employee works at a Location.  We cannot juse use LocationID as a primary key of Locations, otherwise we can assign any random Employee to any random Location, regardless of the Store they belong to.

So, our Location table needs to have a primary key of StoreID/LocationID.  Now we also need the StoreID column in the Employee table to relate Employees to Locations.  To get this, we need to also include StoreID in the primary key of the Department table. With that in place, we can now properly relate Employees to Locations and we are guaranteed they will be for the same Store.

Once again, once we saw that two entities had the same parent (Locations, Departments), and that somewhere "down the line" those two entities would be related (via Employees), it becomes clear that we need to incorporate the common parent's primary key (StoreID) into the primary key of those two child entities (Locations and Departments) to enable that relation (in the Employees table) to function properly.

Summary

There are 3 main points I am trying to make here today:
  1. The "all-identities-all-the-time" approach can work well in some cases, but not in all.  If you insist on only using all identities for all of your primary keys, you're not always going to have strictly accurate data integrity.
  2. Don't forget that Identity columns can be part of a composite primary key.
  3. When a table has multiple child entities in the database, and there are any relations between any of those child entities, then all of those child entities should have a composite primary key that includes the parent table's primary key columns to facilitate those relations.
This is not to say that the "all-identities-all-the-time" approach is always wrong, just that it is not always right.  And it certainly is not always "easiest", unless you find continually scrubbing invalid data to be an easy task.

posted @ Wednesday, August 13, 2008 11:06 AM | Feedback (51) | Filed Under [ Database Design ]

SQL Server 2005 Foreign Key Constraints: SET NULL and SET DEFAULT

Most people know about cascading updates and deletes, but did you know there are two other foreign key constraint options you can use to maintain referential integrity?

Read all about them in my latest article over at SQLTeam.com.

These features, introduced with SQL Server 2005, haven't got a lot of publicity, but they can be very useful.  I just used the SET NULL option recently for the first time (inspiring me to put together an article on it) and it works great.

posted @ Wednesday, August 13, 2008 9:56 AM | Feedback (3) | Filed Under [ Database Design Links Joins/Relations ]

Wednesday, August 06, 2008

Database Column Names != Report Headings

Always remember that the column names returned in a result set do not have to be the same as what you eventually output at your presentation layer.  

For example, suppose you have stored procedure that accepts a @CurrentYear parameter and returns a sales variance between the current year and the previous year for each customer.  I often see programmers struggling with writing dynamic SQL to produce output like this:

CustomerID   2008 Total    2007 Total   Variance
----------   ----------    ----------   --------
ABC          $100          $50          $50
DEF          $200          $250         -$50

That is, the names of the columns vary based on the data; that is not a good way to return data from your database!  A much better result set to return is simply this:

CustomerID   CurrentYear   PrevYear   Variance
----------   ----------    ---------  --------
ABC          $100          $50        $50
DEF          $200          $250       -$50

Notice that with that set of columns, no dynamic SQL is needed, and the column names returned are always constant regardless of the value of the @CurrYear parameter.  

As mentioned, the fact that your data set has columns labelled "CurrentYear" and "PrevYear" does not mean that you cannot re-label them any way that you like on your report or web page. 

If your client code called the stored procedure and provided a @CurrentYear parameter, then it knows exactly what "CurrentYear" and "PrevYear" represent, and you can easily label the columns in the final result exactly as needed with simple formulas or a few lines of code. 

Remember that in the world of relational database programming, table names and column names should be constant -- only the data itself should change.   Focus on returning consistently structured data from your database, and let your client applications handle the labeling of columns to make them look nice. 

posted @ Wednesday, August 06, 2008 11:43 AM | Feedback (3) | Filed Under [ Techniques Efficiency Report Writing ]

Thursday, July 31, 2008

How To Calculate the Number of Week Days Between two Dates

If the start date and end date are both week days, then the total number of week days in between is simply:

(total difference in days) - (total difference in weeks) * 2

or
 DateDiff(dd, @start, @end) - DateDiff(ww, @start, @end)*2

... since the DateDiff() function with weeks returns the number of week "boundaries" that are crossed; i.e., the number of weekends.

If you have a table of holidays, then you can simply subtract them out as well:

DateDiff(dd, @start, @end) -
DateDiff(ww, @start, @end)*2 - 
(select count(*) from holidays where holiday_date between @start and @end)

Now, what if the start day or the end day is on a weekend?  In that case, you need to define what to do in those situations in your requirements.

For example, if the start date is Sunday, Nov 20th, and the end day is Monday, Nov 21st -- how many week days are between those dates? There's no universal correct answer; it could be 0, or 1, or perhaps even "undefined" (null) depending on your needs.

posted @ Thursday, July 31, 2008 12:39 PM | Feedback (13) | Filed Under [ T-SQL ]

Thursday, July 24, 2008

Convert input explicitly at your client; don't rely on the database to "figure it out"

A common mistake beginners make when working with SQL is trying to format their output at the database layer, as opposed to simply doing this at the presentation layer (i.e., client application, reporting tool, web page, etc).  I've covered that quite a bit in various blog posts, but I've only touched upon another similar issue which I feel is equally as important and also commonly mishandled.

In the SqlTeam forums, I often see code that accepts DateTime input in the form of a string value (say, from a TextBox on a web form) and uploads that value to the database written like this:

SqlCommand c = new SqlCommand();
c.CommandText = "insert into SomeTable (DateCol) values ('" + txtDate.Text + "')";
c.ExecuteNonQuery();

Now, I think that hopefully even most beginners will agree that this is bad code.   The primary issue, of course, is SQL Injection.  Avoiding SQL Injection is very easy to do using Parameters.  So, let's say that you rewrite this code using parameters like this:

SqlCommand c = new SqlCommand();
c.CommandText = "insert into SomeTable (DateCol) values (@DateVal)";
c.Parameters.AddWithValue("@DateVal",txtDate.Text);
c.ExecuteNonQuery();

Looking at that, it seems we have done quite a bit better and should be happy with the code.  It works well, and no injection is possible.  But there is still an issue!  Why?  The txtDate.Text property returns a string, not a DateTime!   And, since we are not setting the data type of the parameter explicitly, the parameter being passed is a string (i.e., VARCHAR or NVARCHAR) value, not a true DateTime value.  This means that SQL Server must implicitly cast your string to a DateTime to store it in your table, and this may or may not work successfully, or as expected, depending on how the string is formatted. 

I've said it over and over and I'll say it again:  The concept of formatting dates should never be something that your database code should ever worry about.  The database layer should be accepting DateTime data from clients, and returning DateTime data to your clients. Where and how the client got the data before passing it to the database, or what the client does with the data in terms of formatting after receiving it from the database is of no concern to the database itself. 

So, we might decide that to fix this, we can simply declare the data type of the parameter explicitly:

SqlCommand c = new SqlCommand();
c.CommandText = "insert into SomeTable (DateCol) values (@DateVal)";
c.Parameters.Add("@DateVal", SqlDbType.DateTime).Value = txtDate.Text;
c.ExecuteNonQuery();

It appears that now we are in good shape, right?  Actually -- no! There is still an implicit conversion happening, because we are still passing a string value -- the txtDate.Text property -- to the parameter, not a true DateTime!

Let's try one more time.  How can we avoid these implicit conversions?  The answer that question is always the same: Convert explicitly!  Your client application is fully capable of handling the parsing, validation, and conversion of that string to a true DateTime value, so go ahead and do it:

DateTime dateval = DateTime.Parse(txtDate.Text); // plus more code to validate, of course

SqlCommand c = new SqlCommand();
c.CommandText = "insert into SomeTable (DateCol) values (@DateVal)";
c.Parameters.Add("@DateVal", SqlDbType.DateTime).Value = dateval;
c.ExecuteNonQuery();

Now we are in business!  Before we even create the SqlCommand object, we have a true DateTime value that we are ready to pass along to SQL Server.  Our SQL code doesn't need to worry about formatting, parsing, converting, or anything -- it is being passed a completely valid piece of data with the correct type.   In short, we can now be sure that whatever value we came up with for the date in our client code is exactly the value that will be stored in our database.  That's the idea, right?

So, please, don't rely on your database code to validate your input.    Don't just pass along generic string data and "hope" that at the end of the day the database can "handle it".  Eliminate the chance of anything going wrong and write your code to explicitly cast and convert and validate any and all input before the database even comes into the picture.  

posted @ Thursday, July 24, 2008 9:12 AM | Feedback (11) | Filed Under [ T-SQL .NET (C# / VB) Techniques DateTime Data ]

Wednesday, July 16, 2008

The MailBag --- Super-Sized Edition! String Parsing, Crosstabs, SQL Injection, and more.

OK, boys and girls, it's time for the mailbag!  There's lots of stuff to cover, so let's get to it!

---

Greg E
writes:

Hello Jeff,

I just found your blog and wanted to know if you could point me in the right direction or possibly toss me a solution.

I am looking at a badly formed telelphone number column in a MS SQL Server db. Entries contain '(555) 555-1212' or '555.555.1212, etc. Do you know how I would go about stripping out unwanted characters from the telephone number?

Thanks for the brain cycles.

Greg -- A simple UDF should do the trick for you.  For example, something like this:

create function NumbersOnly(@txt varchar(1000))
returns varchar(100)
as
begin
    declare @i int
    declare @ret varchar(100)
  
    select @i = 1, @ret = ''

    while (@i <= len(@txt))
        select @ret = @ret + case when substring(@txt,@i,1) like '[0-9]'
                                  then substring(@txt,@i,1) else ''
                             end,
               @i = @i + 1

    return @ret
end

With that, you can write something like this:

select ID, dbo.NumbersOnly(PhoneColumn) as PhoneNumbersOnly
from YourTable

Over at SQLTeam's script library forum, there is a thread with a bunch of parsing functions that you may find useful if your needs are more complex.

And, in case you missed it, be sure to read this post.

---

In response to my blog post on passing arrays to stored procedures, Juan writes:

I know is not the right solution, but I have to say it for the sake of completeness of the discussion: if the amount of items in your "array parameter" is limited (say, for example 5 or 10 items), you can always use optional parameters (i.e. assign them to null when declaring them in the SP), then insert them in a temp table or do whatever you want with them, without using dynamic,nor xml, nor string manipulation.

Great point, something I missed in my article entirely.  Sometimes, it may make sense to declare @Val1, @Val2, ... @ValN parameters if there aren't too many and there's a clearly defined limit.  Thanks for bringing that up, Juan.  The simplest solution is usually the best, and in some cases that's probably all you need.  You still have clearly defined parameters with strong typing and no parsing, and those are the main issues with CSV parameters that I wanted to avoid.

---

Marc writes:

We have three tables.  They all share the same "type" of primary key: let's say ActivityCode.  I need to pull data using an ActivityCode, but there is a catch.  If table 1 has the data, I want to use it.  If table 2 has the data and Table 1 does not I want to use Table 2.  If table 3 has the data and Table 1 and Table 2 does not, I want to use that.  The ActivityCode can be found in both Table 1 and Table 2.  Once I determine which table i am using I will need to do several other inner and/or outer joins with other tables.  I am using JDBC.  I want to be able to do this using a single SQL statement, but I am willing to use multiple statements if it makes more sense.  I just need to keep it to a single transaction under JDBC.

Marc -- I think what you are looking for is described here.  The key is to OUTER JOIN to all of your tables, and then use a CASE expression to determine which of those joined tables has the data you need.

--

Mary writes:

I struggled with a thorny SQL problem all day yesterday and found your post on set based thinking very helpful.  I needed to write an update query that updated a table with many records with the same key from a table with the key and the corresponding new value.  The table with new values didn't exist - I had to derive it from a different table showing the key, new value and date (the new value changed over time.)

Your observations that one needs to break the problem down into its simplest components helped me realize something else.

I made the classic rookie error of grabbing some code that did a similar type of update and try to hack it into my solution.  When I finally realized I was going in the wrong direction (because my solution was getting messier and messier), I went back to the beginning.

I defined the problem in its simplest terms and learned I could do a simple "update  A set A.value = B.value from A join B on B.key = A.key" .

I didn't realize I could update from but once the problem was simply defined a quick question to one of our senior engineers resulted in a quick answer leading to an elegant solution.  The whole thing was completed in less than half an hour.

The moral of the story:  Define the problem first!  Don't even think about syntax until you have written a clear, concise spec from the problem just defined.  Then if you find yourself spending an inordinate amount of time and/or the solution seems too messy or seems to run too long - google or talk to your colleagues.

Thanks for a great blog; your post made me realize it's more about how we think than throwing code at the problem - the code should be the last thing!

Thanks, Mary!  I'm glad I would be of assistance.  The "moral" that you wrote says it all. 90% of programming isn't writing code at all, it is simply defining what your code will do -- and that's always the hardest part! 

---

In response to my post on grouping by month, Mark writes:

I'm so close! I've tried all the things in this article, but can't seem to do what I want to do. I've been tearing my hair out for days! Here's what I'm trying to do.

Basically I need a sql procedure that looks at an invoicing table that totals amounts by month/year and quarter at the same time. Here's how my table looks:

Project ID  Date      Amount
1                3/11/08    10.00
1                4/18/08    10.00
1                6/22/08    10.00
2                3/01/08    10.00
2                9/15/08    10.00

I would like the output to have dynamic columns, so an output may look like:
Project ID  Jan'08  Feb'08  Mar'08  Q1'08  Apr'08  May'08  Jun'08  Q2'08  Jul'08  Aug'08  Sep'08  Q3'08
1              0.00    0.00      10.00    10.00  10.00    0.00      10.00  20.00  0.00    0.00      0.00      0.00
2              0.00    0.00      10.00    10.00  0.00      0.00      0.00    0.00  0.00    0.00      10.00    10.00

I would like the query to know if there was no value in Jan &amp; Feb'08, but still list all the months in Q1.

I'm not opposed to using a calendar table, but would like to try to avoid it if possible.

Any help would be greatly appreciated!

Hi Mark -- First off, never be afraid to use a calendar table!  There is nothing hacky or unusual or tricky about them, they can make your life much easier, your code much shorter, and everything much more efficient.  If grouping by month or some other time period is important to your reporting,  then defining those months in a permanent, nicely indexed table makes perfect sense.

In this case, though, since you are outputting one column per month for a single year, I recommend to simply use CASE expressions to "cross tab" your data.  You can alias your columns as M1,M2,M3...M12 and Q1-Q4 so that no matter what year you are running the report for, your columns will be consistently named, and you can let your presentation layer handle outputting nice column headers with the current year/month for each one.

So, all you really need is something like this:

select projectID, Y as [Year],
  sum(case when m=1 then amount else 0 end) as M1,
  sum(case when m=2 then amount else 0 end) as M2,
  sum(case when m=3 then amount else 0 end) as M3,
  sum(case when m in (1,2,3) then amount else 0 end) as Q1,
  ...
  sum(case when m=12 then amount else 0 end) as M12,
  sum(amount) as Total
from
(
  select projectID, Amount, DatePart(Month, [Date]) as M, DatePart(Year, [Date]) as Y
  from YourTable
  where [Date] >= @StartDate and Date < @EndDate
) x

Of course, you'd define @StartDate and @endDate as '01-01-2008' and '01-01-2009', respectively.

---

In this comment, Stewy writes:

I have an issue with both DISTINCT and GROUP BY.

The issue is that using either one, the results comes back ordered as if using order by.

I need the unique results in the order they are in the database. How can I do this? Thanks

Stewy -- Relational databases have no obligation to store data in any specific order, or to keep track of the order that things were entered, or to return things "as they are in the database."  There is no such thing as getting data out "the way it is stored" because a relational database may move or re-order data temporarily to efficiently execute a query depending on indexes available.   You must always explicitly specify how you want your results using an ORDER BY clause.  If you want to keep track of the order that you added data to a table, you should have a "timestamp" column that records the exact moment each row was added via a DEFAULT value or a trigger.  Or, at the very least, you can use an IDENTITY.  Then, you can simply order by that column.  This is a very important concept to understand when working with relational databases.  Things are returned based on the data itself, not based on physical storage characteristics.  I hope this helps.

---

In response to Always Use Parameters, Karuna writes:

Hi Jeff,
Just wondering if I build the Sql in Stored Procedure (Dynamic Sql) based on the parameters passed to stored proc, will it still be a possible candidate for Sql Injection? Basically I want to build the Sql in the stored procedure instead of doing it in .Net code as displayed in the article.

Dim cm As New SqlCommand("", YourConnection)
cm.CommandText = "DELETE FROM YourTable WHERE ID=@ID "
cm.Parameters.Add("@ID", SqlDbType.Int).Value = ID

If Name <> "" Then
cm.CommandText &= " And Name=@name"
cm.Parameters.Add("@Name", SqlDbType.VarChar).Value = Name
End If
If TranDate <> DateTime.MinValue Then
cm.CommandText &= " And TranDate = @TranDate"
cm.Parameters.Add("@TranDate", SqlDbType.DateTime).Value = TranDate

Hi Karuna -- you are absolutely 100% safe from SQL Injection by doing this.  Remember, SQL Injection is not about genereal SQL concatenation or about building a SQL statement dynamically.  It only can happen when you concatenate user input into a SQL string and execute it.  If you put together a big SQL statement via concatenation but you only incorporate user input via parameters, there's no need for scrubbing data or worrying in any way about SQL Injection -- it will never happen, under any circumstance. 

Avoiding SQL Injection is the easiest thing in the world -- simply do things the easy and correct way and you'll never need to worry about it.   It's like if there was a big controversy in the news about thousands of people crashing their cars because they are driving them with decorative tin foil covering their windshields, and asking the experts "how can we solve this crisis?"   Should we cut holes in the tin foil, or add mirrors, or incorporate a camera and a tv monitor?  Uh .. no.  You should just take the tin foil off of your windshield and do things the easy, simple and correct way and don't make things over complicated.   That's basically what this whole SQL Injection thing is about -- bad programmers doing stupid things when all they need to do is write decent code the easy way -- simply by using parameters.

---

Gocs writes:

I have tried to compute the number of hours based on the datetime in MS SQL 2005.  However, I am not sure the hours is correct.  Do you have any idea on how to do it correctly?


Gocs -- I think you really need to read this very carefully.  I'll be waiting!

posted @ Wednesday, July 16, 2008 4:32 PM | Feedback (0) | Filed Under [ Miscellaneous ]

Wednesday, June 11, 2008

The Golden Rule of Data Manipulation

Introduction

There is a very simple rule when it comes to storing (and returning) data, which I see violated all the time, making life so much more complicated for everyone involved.  In case you haven't noticed, that's a common theme I discuss here on this blog -- different ways programmers make life more difficult for themselves, instead of simply following good practices and doing things the easy way.  This is yet another example of that situation.

The "Golden Rule of Data Manipulation" is a simple, but important rule that you should always follow when designing a database,  writing database code, or really writing any application code at all for that matter:

"It is always easier and more flexible to combine data elements rather than to break them apart"

In other words: Concatenation is easy. Parsing is hard.  Often, very hard -- or even impossible depending on the data.

Problems with Parsing

It is amazing how often I see people struggling with "difficult SQL problems" such as:
  • Working with CSV lists of values in a single column, such as "1,3,56,2"
  • Breaking out a FirstName/MiddleName/LastName/Suffix from a single "Name" column
  • Parsing address strings into City/State/ZIP, or Number/Street/Unit
  • Parsing Phone Numbers to get just an area code, or to take different phone formats and present them all uniformly
  • Figuring out how to calculate the Day,Month, and/or Year from different string values such as "23-Jan-08", "2008-02", "20070303", "03032007"
And on and on it goes....

Now, sometimes you inherit or import data that needs to be parsed -- that's a fact of life.  You've got to figure out how to do it, and the key in those cases is to accept that because the data itself is essentially random, nothing you can write will perfectly work 100% of the time on all of it.  Often, the best you can do is handle most of the data, and then do some manual clean up. 

Parsing strings can be a very difficult task for any programmer, and the challenge isn't writing the code, it's coming up with the algorithm (another common theme on this blog).  Consider my new favorite example of why parsing a single Name column into a First/Middle/Last is not as easy as it seems:

    Oscar De La Hoya

How would your algorithm parse that one?  Never mind prefixes such as "Dr." and suffixes such as "Jr."! 

Please don't interpret what I am saying as a programming challenge -- I understand that it is possible to write long code with a list of exceptions or rules and have that algorithm work pretty well in most cases.  The point is that writing that algorithm is a lot of work, running it will be inefficient, and it will never be exact because the data itself that is being processed is essentially random.  It's just like the old saying: "garbage in, garbage out".  Still one of my favorites, after all these years, and it still applies!

A Data Model that requires Parsing = A Poor Data Model

So, we need to accept that sometimes you've got to parse data like this.  And that's OK; it happens, it can be done, even if some manual work is often involved.

However, there's no excuse when you design your database, your SQL code, or applications so that free-form data must be parsed, when you can simply design it correctly in the first place and store your data already broken out into the smallest possible units with the correct data types.

If breaking out a contact's name into First, Last, Middle, etc is important to your application, then you should force the point of data entry to accept input broken out into those columns.  The same goes for phone numbers, addresses, and so on.  Any time you have the option of accepting  input as clean, short, raw, segments of data you should always do it.   Once you have data at that smaller resolution, it is trivial to combine it any way that you want for presentation, formatting, filtering, and so on.

It may seem like overkill to break out a phone number into 4 columns:

AreaCode
Exchange
Number
Extension
   
And, in fact, it might be more complicated than that if you need to deal with international phone numbers.  You may look at your tables, and your code, and even the UI that accepts these fields and think "that is way too precise and unnecessary, breaking out phone numbers like this sure makes things complicated!"

But by doing this, and only accepting user input that follows precise rules of what is allowed in these fields, and storing each of them in their own column, you can now easily and efficiently:
  1. Sort these numbers any way you want, without worrying about extra characters like parenthesis or dashes, or leading 1s, messing things up
  2. Filter quickly on an area code without the need to use LIKE, and again worrying about extra characters getting in the way
  3. Present the phone number quickly and easily any way you want without any parsing, be it as 123.123.1345 x123 or "(123) 123-1345 extension 123", or anything you want.
  4. Validate your phone numbers, ensuring you have all the necessary parts and they are the proper length, without worrying about parsing strings
Considering doing any of those things if your data is stored in random strings like:

1-123-124-1234
(123) 124-1234
123-124-1234, ex. 123
123.124.1234 x123
(123)124.1234 ext. 123
1 123 124 1234 123
   
and so on ...  Not so easy in that case, just as parsing simple "Name" columns into First/Last, or addresses into  Number/Street/Unit is not so easy as well. 

Again, this is not a programming challenge -- I am sure it can done. (In fact, phone numbers are generally the easiest because you can usually just ignore anything other than digits.)  Most of us have done it before.  But designing something in such a way that parsing is required to do simple filtering, sorting, or formatting, is a bad design

It's Not About the UI

As I wrote here, you should never think "I want to display phone numbers like 123.123.1234, so I should store them and return them that way."  You should always think "How can I break this down into small, concrete parts that are easily validated and easy to combine any way I want at any time?"

So, what if you need fine detail when storing addresses, but you don't want your UI to present Street Number, Street Name, Unit Type, Unit Number as different data entry fields for usability or aesthetic reasons?  That's fine, but that doesn't mean you should not set up your database properly.  Your UI can certainly still present that one single "Address" text box for the user to fill out, parse that text at data entry, show the user the parsed result in multiple fields, and ask "Please verify for your address" or something along those lines.   Then, if not, the user can tweak the results and save it.  If you do things along those lines, and focus on getting the data parsed and stored correctly at the earliest point possible, every other part of your code will be that much more efficient.

All of this applies not only to data storage, but to how data is returned and passed between tiers as well.  Again, if you just return separate columns to your client application, instead of focusing on making them "look nice" in your database code by returning nothing but long, "pre-formatted" strings, your client can simply concatenate and format those columns any way it needs.  And, different clients can format that same database output in different ways -- all without ever altering any database code! 

Conclusion

In short, remember that writing concatenation is easy, efficient, and exact.  Writing a parsing routine, on the other hand, is often none of those things.   You may not always be able to control the design of the data you are working with, but be sure that when you can, you do it right.  If you find yourself using lots of LIKE expressions, or string parsing for simple data retrieval operations, something is wrong.   Time to fix up your database and your code, store the parsed and validated data permanently, and make things easier and cleaner for everyone.

Whether you are designing a schema, writing a SELECT, or writing code in any other programming language, remember that the Golden Rule of Data Manipulation always applies.  Accept this rule, learn from it, and practice it, and you might be surprised to find that programming isn't quite as hard as you thought it was.

posted @ Wednesday, June 11, 2008 11:48 AM | Feedback (9) | Filed Under [ Database Design ]

Powered by:
Powered By Subtext Powered By ASP.NET