Jeff Smith Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server

Rewriting correlated sub-queries with CASE expressions

Here's a very common situation that is very easy to optimize and simplify, submitted via the mailbag:

Nate writes:

Hey, I have a read a bunch of your stuff on your blog and you seem to

 be right on the money. I thought maybe you would be able to point me in
 the right direction and possibly address this issue on your blog so
 others could benefit from your understanding.

I have been searching for the best way to do what I think should be a
 simple task in SQL. I have a table full of call history events and I
 want to get a summary of some events for each calling party that matches
 the form "username(extension)". Using correlated subqueries I do the
 following.

SELECT calling_party,

(SELECT SUM(end_time-start_time) AS total_time FROM `event` ei WHERE
 event_type=4 AND ei.calling_party = eo.calling_party) AS
 total_talking_time,

(SELECT SUM(end_time-start_time) AS total_time FROM `event` ei WHERE
 event_type=7 AND ei.calling_party = eo.calling_party) AS
 total_ringing_time,

(SELECT MAX(end_time-start_time) AS total_time FROM `event` ei WHERE
 event_type=4 AND ei.calling_party = eo.calling_party) AS max_talking_time

FROM `event` eo WHERE calling_party LIKE '%(%)'

GROUP BY calling_party

This works, but ends up taking a really long time to run. I figured
 that each subquery was getting executed multiple times. I made a quick php
 script that returns the same information but uses multiple queries. I
 just query all of the "groups" (SELECT DISTINCT calling_party FROM
 `event` WHERE calling_party LIKE '%(%)') and then iterate over that
 recordset plugging the actual calling_party value into the where clause of the
 aggregate query. The output is identical, but this script runs in well
 under a second.

So the question is what is the correct way to do this using SQL? Any
 help would be really appreciated.
 
Hi Nate --

The script as written is making 4 different calls to the calling_party table, and we can reduce this to one call to the table by using CASE expressions to only aggregate the data we need for each column.  This is pretty much the standard "static cross-tab" technique to summarize data into multiple columns in a single SELECT.

So, all we need to do is re-write your SELECT like this:

select
  calling_party,
  sum(case when event_type=4 then end_time-start_time else 0 end) as total_talking_time,
  sum(case when event_type=7 then end_time-start_time else 0 end) as total_ringing_time,
  max(case when event_type=4 then end_time-start_time else 0 end) as max_talking_time
from
  event
where
    calling_party LIKE '%(%)'
group by
  calling_party

That will return the same results as what you had written, only it should be much more efficient.  It is also much shorter.  Notice that the CASE expressions ensure that only the data with the specified event_type is including in each aggregate calculation, otherwise we aggregate a value of 0 which has no affect on the results.  You can also use a default value of NULL instead of 0 if you'd like, depending on your desired results.

If you have more data in that table for event_types other than 4 and 7, and you only want to return calling_party rows that contain data with those event_types, then be sure to add

and event_type in (4,7)

to your WHERE clause, which will increase the efficiency even further.   However, adding that will return different results from your original SELECT (it returns all calling_groups regardless of event_type) so be sure that what you do meets the specifications you require.

As always, be sure that you have proper indexes on all of your tables, and in this case it seems that event_type should certainly be indexed.

see also:

Legacy Comments


Perfect!
2008-01-09
re: Rewriting correlated sub-queries with CASE expressions
Thanks! That is EXACTLY what I was looking for... thanks for the explanation, it was spot on as always.

Joe
2008-02-29
re: Rewriting correlated sub-queries with CASE expressions
That just saved me hours of work ... THANKS!!!!!

otis
2010-06-26
re: Rewriting correlated sub-queries with CASE expressions
Hi,

Is it possible to apply the same to this sample query?

select distinct a.varchar_Field1
,a.varchar_Field2
,a.varchar_Field3
,a.varchar_Field4
,a.date_field1
,(select sum(b.number_Field1)
from TABLE1 b
where b.varchar_Field1 = a.varchar_Field1
and b.varchar_Field2 = a.varchar_Field2
and b.varchar_Field3 = a.varchar_Field3
and b.varchar_Field4 = a.varchar_Field4
and b.date_field1 <= a.date_field1
) as Total
from TABLE1 a


The query should give me a running total for a given "varchar_Field4"