Rewriting correlated sub-queries with CASE expressions

Wed Jan 9, 2008 by Jeff Smith in crosstabs-pivoting-data, efficiency, joins-relations

Here's a very common situation that is very easy to optimize and simplify, submitted via the mailbag:

Nate writes:

Hey, I have a read a bunch of your stuff on your blog and you seem to
be right on the money. I thought maybe you would be able to point me in
the right direction and possibly address this issue on your blog so
others could benefit from your understanding.

I have been searching for the best way to do what I think should be a
simple task in SQL. I have a table full of call history events and I
want to get a summary of some events for each calling party that matches
the form "username(extension)". Using correlated subqueries I do the
following.

SELECT calling_party,

(SELECT SUM(end_time-start_time) AS total_time FROM `event` ei WHERE
event_type=4 AND ei.calling_party = eo.calling_party) AS
total_talking_time,

(SELECT SUM(end_time-start_time) AS total_time FROM `event` ei WHERE
event_type=7 AND ei.calling_party = eo.calling_party) AS
total_ringing_time,

(SELECT MAX(end_time-start_time) AS total_time FROM `event` ei WHERE
event_type=4 AND ei.calling_party = eo.calling_party) AS max_talking_time

FROM `event` eo WHERE calling_party LIKE '%(%)'

GROUP BY calling_party

This works, but ends up taking a really long time to run. I figured
that each subquery was getting executed multiple times. I made a quick php
script that returns the same information but uses multiple queries. I
just query all of the "groups" (SELECT DISTINCT calling_party FROM
`event` WHERE calling_party LIKE '%(%)') and then iterate over that
recordset plugging the actual calling_party value into the where clause of the
aggregate query. The output is identical, but this script runs in well
under a second.

So the question is what is the correct way to do this using SQL? Any
help would be really appreciated.

Hi Nate --

The script as written is making 4 different calls to the calling_party table, and we can reduce this to one call to the table by using CASE expressions to only aggregate the data we need for each column. This is pretty much the standard "static cross-tab" technique to summarize data into multiple columns in a single SELECT.

So, all we need to do is re-write your SELECT like this:

select
calling_party,
sum(case when event_type=4 then end_time-start_time else 0 end) as total_talking_time,
sum(case when event_type=7 then end_time-start_time else 0 end) as total_ringing_time,
max(case when event_type=4 then end_time-start_time else 0 end) as max_talking_time
from
event
where
calling_party LIKE '%(%)'
group by
calling_party

That will return the same results as what you had written, only it should be much more efficient. It is also much shorter. Notice that the CASE expressions ensure that only the data with the specified event_type is including in each aggregate calculation, otherwise we aggregate a value of 0 which has no affect on the results. You can also use a default value of NULL instead of 0 if you'd like, depending on your desired results.

If you have more data in that table for event_types other than 4 and 7, and you only want to return calling_party rows that contain data with those event_types, then be sure to add

and event_type in (4,7)

to your WHERE clause, which will increase the efficiency even further. However, adding that will return different results from your original SELECT (it returns all calling_groups regardless of event_type) so be sure that what you do meets the specifications you require.

As always, be sure that you have proper indexes on all of your tables, and in this case it seems that event_type should certainly be indexed.

see also:

Legacy Comments


Perfect! 2008-01-09	re: Rewriting correlated sub-queries with CASE expressions Thanks! That is EXACTLY what I was looking for... thanks for the explanation, it was spot on as always.

Joe 2008-02-29	re: Rewriting correlated sub-queries with CASE expressions That just saved me hours of work ... THANKS!!!!!

otis 2010-06-26	re: Rewriting correlated sub-queries with CASE expressions Hi, Is it possible to apply the same to this sample query? select distinct a.varchar_Field1 ,a.varchar_Field2 ,a.varchar_Field3 ,a.varchar_Field4 ,a.date_field1 ,(select sum(b.number_Field1) from TABLE1 b where b.varchar_Field1 = a.varchar_Field1 and b.varchar_Field2 = a.varchar_Field2 and b.varchar_Field3 = a.varchar_Field3 and b.varchar_Field4 = a.varchar_Field4 and b.date_field1 <= a.date_field1 ) as Total from TABLE1 a The query should give me a running total for a given "varchar_Field4"