Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

Thinking Set-Based …. or not?

Thinking "Set-Based"

So, I hear you're a "set-based SQL master"! 

As Yoda once said, you've "unlearned what you have learned". You've trained yourself to attack your database code not from a procedural, step-by-step angle, but rather from the set-based "do it all at once" approach.   It may have taken weeks, months or even years to finally obtain this enlightened state of "database zen", but it was worth it.  Your SQL code is short, fast, and efficient.  There is not a cursor in sight.   You have reached the point where you can write a single SELECT that replaces hundreds of lines cursors, temp tables and client-side processing.  Life is good.

As I read somewhere once, you don't tell SQL how to do it, you tell SQL what you want, and that's a great way of thinking about it.  A procedural programmer gets bogged down with the details, and has to concentrate on breaking things down into small pieces, explicitly reading and processing one row of data at a time, and figuring out how to combine those results together at the end to make it all work.  A set-based SQL programmer worries about none of those things:  In the set-based world, you state your relations and join the tables together, add some grouping and criteria, and it is the database engine that worries about the specifics. 

Well, maybe not … You might not want to abandon all of the things that you learned from your procedural background.  There's a danger in misunderstanding that set-based programming means "doing it all at once", and thinking that it forbids processing things "one at a time" or "in steps".  Sometimes, when you get too comfortable in the set-based way of thinking, you abandon the good things that you learned as a procedural programmer.  The two mindsets aren't as different as you might think!

Approaching a Problem

What if I ask you to write a somewhat complicated SELECT, something like this:

"Write a SELECT that returns, for a given @Year, the total sales by office, and also the office's top salesperson (highest total sales for the year) with their salary (as of the last day of that year), their total bonuses for that year, and their hire date."

While this isn't rocket science, what makes this request slightly complicated is that it appears there are at least 3 different transactional queries (sales by employee, sales by office, bonus totals by employee) that we need to put all together, as well some point-in-time reporting off of a history table (employee salaries) which can be difficult depending on how the table is structured.

Now, how does a "set-based" programmer attack this?  The schema and the specifics are not important, it is really just the general approach that I am commenting on.

Do you start by immediately finding all of the necessary tables and put them all into 1 big SELECT by joining everything that matches?  Then, from there, you may start adding columns and expressions to your GROUP BY clause, adding in criteria and CASE expressions, maybe a DISTINCT before it all?  And then, if that doesn't work, maybe you add some correlated subqueries to your SELECT list, or move things in and out of derived tables?  Then more GROUPING, more criteria, more JOINs, more moving things and shifting parts of the SELECT around until it "looks right" and it "seems to work"?

Well, that does seem to be the set-based approach  for many, since you get so trained and so used to thinking of the "big picture", and not worrying about details, that you just assume that you can dive right in and start joining and selecting and eventually you'll get there.  We've all done it.  That's what you want to do, after all. We don't want to think that we need to break things down into smaller, discrete steps, or that things should be "processed" on step at a time.  It goes against everything that we've been trying to train ourselves to do ever since we embraced the concept of relational database programming, right?

Wrong!

Thinking in Sets = Thinking in Steps

It is so important to understand that "thinking set-based" does not conflict with "thinking in steps" !!  In fact, it is more important than ever in some ways, especially as your data and your schemas and your requirements become more complex.

In the above example, if you "dive right in" and start joining and selecting and grouping and seeing how things work, that is exactly the wrong way to do it!  You need to remember that the skill you learned from your procedural world -- breaking larger problems down into smaller parts -- still applies even in when writing SQL.

Looking at the above statement, a really good SQL developer will immediately break the problem down into smaller, completely separate parts:
  •   a SELECT that returns 1 row per Office, with each Office's total sales for a @Year
  •   a SELECT that returns 1 row per employee, with their salary as of the last day of a given @Year
  •   a SELECT that returns 1 row per employee, with their total bonus amount for a given @Year
  •   a SELECT that returns 1 row per Office, with the top salesperson (Employee) and their sales amount,  for a @Year
Starting with those 4 basic pieces, all of which are completely isolated from the others, is the way to begin to approach the problem.  You don't focus on returning employee names, or sorting, or formatting dates -- you focus on the data, and returning it in small parts that will eventually all fit together.   For each SELECT, you can test it and optimize it and verify the data, and only at the very end, when all the individual parts are working, do you put them together.   This sounds familiar, doesn't it?  Much like a procedural programmer who breaks their application down into smaller parts via functions or classes or whatever tools their language provides, I am suggesting that the overall approach is still valid and in fact a great idea even when writing SQL!

In fact, when writing a SELECT that requires multiple non-related transactional tables this is really the only way to go about solving this problem, since each one must be fully grouped and summarized and ready to join on matching key columns before we can begin to even think about combining the results.  In this case, it is only at the very end, when all of our individual SELECTs are grouped by Office or Employee, that we join them together as derived tables.

In addition, the "step-based" approach involves understanding that things like formatting dates, deciding on how to output a name (first/last or last/first, etc), or sorting is irrelevant to the larger problem.  In a complicated SQL statement with lots of calculations or point in time reporting, if you can write a SELECT that returns one row per employee (determined by the employee's primary key column, let's say EmployeeID), that is all you need to return;  if you know that the Employee table has first name, last name, hire date, and a simple relations to their Department, then don't worry about any of that until the very last step!  Just focus on returning a reference to the entity (EmployeeID) and calculating the results or values that you are trying to return per entity (total sales, salary, bonus), and only when everything is accurate and correct should you dress things up with the other attributes of the entity which are trivial to obtain (employee name, hire date) through simple joins.

Putting it all Together

In the end, it really does resemble procedural programming quite a bit in that each of these little, self-contained parts, all of which are responsible for doing their job accurately and efficiently, are much like functions or classes.  And our primary SELECT is like the main program that calls each of them and in the end puts them all together:

select OfficeSales.OfficeID,
       OfficeSales.TotalSales as OfficeSales,
       Offices.OfficeName,
       TopSalesPerson.EmployeeID,
       TopSalesPerson.TotalSales as EmployeeSales,
       Employees.EmployeeName,
       Employees.HireDate,
       EmpSalaries.Salary,
       EmpBonus.Bonus,
from
    ( .... ) OfficeSales
inner join
    ( .... ) TopSalesPerson on OfficeSales.OfficeID = TopSalesPerson.OfficeID
inner join
    ( .... ) EmpSalaries on TopSalesPerson.EmployeeID = EmpSalaries.EmployeeID
inner join
    ( ... ) EmpBonus on TopSalesPerson.EmployeeID = EmpBonus.EmployeeID
inner join
    Employees on TopSalesPerson.EmployeeID = Employees.EmployeeID
inner join
    Offices on OfficeSales.OfficeID = Offices.OfficeID

When all the code is in place, this will probably be a very large, complicated SELECT.  But looking at this way, doesn't it look pretty simple?   And each of those derived tables, on their own, will also be quite simple.  That's the approach we want to take!

(note: In addition to using derived tables, you can use Common Table Expressions to facilitate this approach, since they work essentially the same way but are often easier to read and incorporate into your complicated SELECT statements.  Views and parameterized User Defined Functions can be useful as well.  The same concepts still apply -- divide and conquer!)

Only now, at the very end, do we worry if some of those joins should become LEFT OUTER JOINs, since maybe some employees might not have a bonus for a given year, and so on.  Getting the employee's Name and HireDate and the name of each Office is done here, at the very end, where it is very easy and clear since we have just focused on returning the key columns for both of those entities in our derived table results.

Think again!

So, the next time you dive right into and start joining and selecting because you know that a "set-based master" doesn't worry about breaking down the details, consider instead becoming a "step-based set-based programmer", and break down your large problem into smaller, easily solvable steps.  Even in T-SQL, this is the way to go and it will make your life easier, your code simpler, and often more efficient as well.  Don't completely disregard your past experience as you become a relational database programmer, learn how to combine the best of both worlds.

see also:

Legacy Comments


Jon
2007-05-01
re: Thinking Set-Based .... or not?
Very good post. Nice job jeff.

jt
2007-05-07
re: Thinking Set-Based .... or not?
very helpful! Sending this link to my SQL and non-SQL Developer team mates!.
thanks

Henri
2007-05-10
re: Thinking Set-Based .... or not?
Nice read, though set based is "da bomb", set-based can come in a lot of ways.

Working with temp tables is still set-based. I am no a fan of complicated query's because when somethings changed (after some time writing the initial query) it's hard to understand everything that happens).

Create a temp table and fill some fields step-by-step gives also the change to comment some code in a clear manner.

However, it was just an example.

Keep up the good work!

Damien Guard
2007-05-21
re: Thinking Set-Based .... or not?
Nice start to an article but "putting it all together" leaves me feeling cheated - the section doesn't put it all together as the problem doesn't get a solution but rather the start of one as the derived tables are ignored.

I assume these derived tables are views? If so, be prepared for disappointing performance on SQL Server.

[)amien

Jeff
2007-05-21
re: Thinking Set-Based .... or not?
Damien -- those are derived tables, as indicated quite a few times in the article, though of course you CAN use a view, or a common table expression (CTE) (as also indicated). The point of the article wasn't to give you a solution to my made up problem (I didn't give any table structures or any specific information, it was just an abstract example) but to explain the advantage of breaking things down into separate parts to make solving these problems easier.

As for disappointing performance, using views or otherwise, can you please clarify? The fact of life is that for complex SQL statements you *must* use derived tables, temp tables, views or CTE -- you cannot directly join two or three non-related transactional tables together without first summarizing them by common keys. I have never seen a view perform any worse than derived tables or CTE's, if you can show an example that would be great.


Richard
2007-06-10
re: Thinking Set-Based .... or not?
Hi Jeff, nice article. Very affirming. I do that every day with the monster queries I have to write for SSAS cubes and RS reports.

Jack V
2007-06-15
re: Thinking Set-Based .... or not?
Useful to go back to basics , when you get caught up in complicated queries




Jack
Need an IT job? http://www.itjobfeed.com

dmaclay
2007-08-14
re: Thinking Set-Based .... or not?
A big advantage that isn't mentioned is that for very large data sets (i.e. half a years cell phone traffic) selecting into tempory tables allows you to discard surplus data early, while giving you the option of building a index on some temp tables. I've seen this cut the run time of a query from hours (really) to about ten minutes.

Paul Brown
2007-08-14
re: Thinking Set-Based .... or not?
I would agree with you on almost all points, in fact this is the only way that I have ever been able to do complex queries; the idea of building it all in one go and tweaking until it works gives me the shivers. My only point of complaint would be this line, "For each SELECT, you can test it and optimize it and verify the data". Test it, yes. Verify it, well that sort of comes under testing, but anyway, yes. Optimise it, no. SQL queries are funny little beasts and you can never quite tell where the bottleneck is going to be. My rule with all things SQL (and, indeed, with all programming generally) is that optimisation is always the very last step. Make it work first, then make it work fast, because once you join all of those parts together you might find a bottleneck in a surprising place. As someone already mentioned earlier, the ability to dump unwanted data early can make quite a difference so, to use your example, there could be millions of unwanted employee records that you could count out in the first join and make the rest much quicker.

That said, the best way to optimise is not to second guess your query / server, but to use an actual query analiser; that way you will be working from fact instead of guesses and hunches.

Jeff
2007-08-14
re: Thinking Set-Based .... or not?
Hi Paul --

Great comments, thank you. I absolutely agree with what you are saying, but you can and should definitely still do basic optimization of your individual SQL "parts" before putting them together. Not optimizing as "using compiler tricks" or hints or little tweaks or hacks like that, but basic things like ensuring that your statement is using available indexes by not filtering on expressions if possible, and so on. Just basic stuff -- make sure each part is short, clean, concise and efficient and then put it all together. You are correct in that ultimately, how the optimizer chooses to put the whole thing together may render your optimizations useless -- or worse, counter-productive -- but in general what is nice about SQL is that optimizing almost always equals writing things in a short, efficient and clear manner.

Regina Obe
2007-08-15
re: Thinking Set-Based .... or not?
Interesting article. Although what you are doing I wouldn't quite call Procedural thinking. The one thing I noticed changed for me when I changed from a procedural thinker to a set based thinker is that in certain cases I realized timing does not matter.

I think naive procedural thinking thinks in sequential steps which puts you into the trap of thinking you must do A before B. What I got when I switched roles was this problem is problem a,b,c,d -> hey b and c are the same problem with argument i switched for argument j.

So in the end I think of myself not as a set based procedural thinker, but rather a set-based parallel thinker.

--Jeff Moden
2007-10-26
re: Thinking Set-Based .... or not?
I started reading the article with a highly skeptical eye because it's been demonstrated to me that most folks don't think correctly when it comes to SQL. That is, until I saw the familiar name of "Jeff Smith" from many another post I've seen. So far as this article goes, all I can say is "WOW!" Not only do you have the same name, but this article is pretty much spot on with everything I try to teach to my developers and DBA's as well as the spirit of my posts on the forums I participate in.

For the skeptics in the crowd, I'll vouch for the method of thinking that Jeff Smith has portrayed in this article... I've seen, time and again, where employing these methods to rewrite 8 to 24 hour runs has brought them down to mere shadows of their previous selves and typical operating in 10 to 30 minutes instead. To wit, I recently rewrote a job that was taking 24 hours and more to sometimes fail... I rewrote the job using the very same techniques Jeff Smith portrays in this article... the job now takes only 22 minutes and does 50% more work than it did before (3 months of post processing instead of just 2).

Huge side benefits of this type of thinking... no one will have to "optimize" your code... it's already optimized if you use methods like this. Troubleshooting code and making modifications becomes a lot easier because, instead of trying to unknot a huge hairball of twisted code, you can troubleshoot/modify one piece at a time.

Something else everyone needs to be made aware of... the sum of the parts is greater than the whole... the indivdual derived tables may take much more time to execute individually because they don't enjoy the added filtering caused by the joins. In other words, if you have, say, 5 derived tables that take 10 minutes each to resolve when individually executed, don't think that the entire query is going to take 50 minutes to run... the implicit criteria cause by the correct joins may cause the overall query to run in sub-second times.

Jeff Smith... very nice job on this article. I was looking for a good article on how to think "set based" to teach some of my Developers and DBA's with... this article will figure prominently in those teachings. Thank you, Sir.


Jared Hunt
2007-11-09
re: Thinking Set-Based .... or not?
I began studying SQL in January. Since then I have learned much from many books, co-workers, and the web. Lately, I have been struggling with some updates and (unfortunately) have been attacking it as described in the first paragraph. While scripting the update this week, I knew there must be a more efficient way. This article precisely identifies the way in which a SQL problem should be analyzed. THANK YOU!!!! I believe I can effectively re-evaluate my task, making significant improvements to my code.

During all my studies in IT, I've found the simple solutions are always the most effective.

Jack
2007-12-11
re: Thinking Set-Based .... or not?
This article gives me some good clarity. Quite often this is the key problem in solving complex sql based problems i.e the inability to streamline the sql statement , so it's all returning the relevant records .

dennis
2008-03-14
re: Thinking Set-Based .... or not?
I really liked the article, but I am not sure that there is a big distinction between sets and steps based on this approach. Mathematically, as set can be defined either by enumerating the elements or giving a rule for set membership. What I thought I saw you do was provide the rule or definition for four separate sets which you then brought back together as part of your answer.

Is that a fair statement about your process?

Jeff
2008-03-15
re: Thinking Set-Based .... or not?
dennis -- yes, that is correct. The point is simply that processing data in sets does not mean you cannot break the problem down into smaller parts.

--Jeff Moden
2008-04-10
Nicely done!
Outstanding blog, Jeff! Although the whole post is well written and makes several great points, I particularly liked the introduction. "Set-based", indeed, does not mean "doing it all in one query". Even the MS provided procs will frequently use some form of temporary storage, be it a derived table (as is prevalent in your example code), a CTE, or even a temp table to hold some interim result. Along with "set based" thinking, people must remember the "Divide and Conquer" nature of approaching any programming problem. This type of thinking is so very important in both the single row processing necessary in GUI interfaces and in batch programming… troubleshooting/maintenance is easier… its easier to read/follow… allows better opportunities for embedded documentation… and, for numerous reasons, frequently allows better performance tuning opportunities.

Like I said, very well done… This blog entry should be required reading for everyone even if they think they're a set-based "Ninja" already. It's certainly a "keeper" for me. Thanks for taking the time to write it.

--Jeff Moden
2008-04-10
re: Thinking Set-Based .... or not?
... heh.. and, Yes, I know I already posted once expressing similar kudos. I need to revisit this blog and, even though I'd previously responded, was impressed yet again and had to say so. Thanks again, Jeff.

Russ
2008-07-23
re: Thinking Set-Based .... or not?
Excellent! I just stumbled across this as I was googling to try and find a succinct description of the difference between set-based and row-based programming. Very well expressed and funnily enough, its pretty much the way I do things myself. Mind you, given the non-related nature of the varying data-sets I work on, I think this is the ONLY way that a sensible query can constructed and more importantly, by having the simple sub-queries, it gives me a lot more confidence that I am getting the data that I ask for rather than data that looks like what I asked for ;-) I think I shall bookmark this.....Nice One.

Colin
2008-11-19
re: Thinking Set-Based .... or not?
Excellent article. Thanks especially for the quote, "You tell SQL what you want". Hahaaa. The devil is in the nasty little details that MSSQL's engine can usually do on its own.

Josh
2009-01-17
re: Thinking Set-Based .... or not?
* a SELECT that returns 1 row per Office, with each Office's total sales for a @Year
* a SELECT that returns 1 row per employee, with their salary as of the last day of a given @Year
* a SELECT that returns 1 row per employee, with their total bonus amount for a given @Year
* a SELECT that returns 1 row per Office, with the top salesperson (Employee) and their sales amount, for a @Year

What I would do in this case is to create a scalar valued function for each of these tasks, and then create a stored procedure that applies those functions to a dataset. Would look something like this:

declare @tmpTable (
fldOfficeId int,
fldOfficeName varchar(150),
fldTotalSales decimal(25, 8),
fldTopSalesperson int,
fldSalaryEndOfYear decimal(25, 8),
fldTotalBonus decimal(25, 8)
)

insert into @tmpTable (
fldOfficeId,
fldOfficeName,
fldTotalSales,
fldTopSalesperson
select
fldOfficeId,
fldOfficeName,
dbo.fnOfficeTotalSales(fldOfficeId, @Year),
dbo.fnTopSalesperson(fldOfficeId, @Year),
from
tblOffice

update @tmpTable
set
fldSalary = dbo.fnEmployeeSalaryEOY(fldTopSalesperson, @Year),
fldTotalBonus = dbo.fnEnployeeTotalBonus(fldTopSalesperson, @Year)

select * from @tmpTable -- to get the final output

What do you think of this? To me this code is very easy to follow, and if you need to know details you step into the functions.

I see it as a set based approach, with the addition of modular thinking and function calls common in procedural programming.