Jeff's SQL Server Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server
posts - 157, comments - 2686, trackbacks - 64

My Links

Advertisement

News

Welcome to my weblog. My name is Jeff Smith, I am software developer in Boston, MA and I was recently named a 2009 SQL Server MVP. Check in frequently for tips, tricks, commentary and ideas on SQL Server and .NET programming.


Subscribe





Archives

Post Categories

Programming

Sports

SQL

SQL GROUP BY techniques

One aspect of the versatile SELECT statement that seems to confuse many people is the GROUP BY clause.  It is very important to group your rows in the proper place.  Always push GROUP BY aggregations as far into your nested SELECT statements as possible – if you have a situation in which you are grouping by long lists of columns that are not part of primary keys, you are probably have not structured your query correctly.

Here's a classic example that returns total sales per customer, in addition to returning the customer's name and address:

SELECT
  C.CustomerID, C.CustomerName, 
  C.CustomerType, C.Address1, C.City,
  C.State, SUM(S.Sales) as TotalSales
FROM
  Customers C
INNER JOIN Sales S
  ON C.CustomerID = S.CustomerID
GROUP BY
  C.CustomerID, C.CustomerName,
  C.CustomerType, C.Address1, C.City, C.State

I can’t say how many times I see SELECT's written this way and it is simply wrong. You should only be grouping on CustomerID, and not on all those other columns.  Push the grouping down a level, into a derived table:

SELECT
  C.CustomerID, C.CustomerName,
  C.CustomerType, C.Address1, C.City,
  C.State, S.TotalSales
FROM
  Customers C
INNER JOIN
  (SELECT
     CustomerID, SUM(Sales) as TotalSales
   FROM
     Sales
   GROUP BY
     CustomerID) S
ON
  C.CustomerID = S.CustomerID

Hopefully, you will agree that it is a much cleaner SELECT statement now, it is more efficient and it logically makes more sense when you look at it. 

One more common mistake is that people just mimic the expressions in their SELECT list in the GROUP BY clause, without thinking logically about what grouping is really necessary.  For example:

SELECT LastName + ', ' + FirstName, ... etc ...
FROM Names
GROUP BY LastName + ', ' + FirstName

Again, that is logically wrong and also less efficient.  You should not be grouping by the expression itself; you should be grouping by what is needed to evaluate that expression. The correct grouping is:

GROUP BY LastName, FirstName

Too many people just keep stuffing column names and expressions into the GROUP BY clause until the errors go away without stepping back and thinking logically about how grouping works.  Take a minute to really consider what you need to return and how it should be grouped, and try using derived tables more frequently when writing aggregate queries to help keep them structured and efficient.

see also:

Print | posted on Wednesday, December 14, 2005 11:53 AM | Filed Under [ T-SQL GROUP BY ]

Feedback

Gravatar

# Amen brother!

Amen brother! I too see a lot of code where people have just stuffed whatever it takes into the GROUP BY to make the errors stop. Another suspicious place is the ORDER BY clause. And yet another common no-no I see is code using DISTINCT where it's not needed, or where GROUP BY is needed instead or vice versa.
12/14/2005 9:04 PM | Xaprb
Gravatar

# re: Proper GROUP BY usage

Of course if there is a need for totals at customer type/State/city then a different approach is needed. :)
12/15/2005 3:55 PM | Joker
Gravatar

# re: Proper GROUP BY usage

>>Of course if there is a need for totals at customer type/State/city then a different approach is needed.

Close ... A different SELECT is needed, but not a different approach. The approach stays the same.

Especially CustomerType is a foreign key to the table that contains the type description or any other info you might want to include in the results.
12/16/2005 10:50 AM | Jeff S
Gravatar

# re: Proper GROUP BY usage

Watch out with that advice about expressions. The following two queries are _NOT_ logically equivalent:


SELECT FirstName + ',' + LastName
FROM
(
SELECT 'F1', 'F2'
UNION ALL
SELECT NULL, 'F3'
UNION ALL
SELECT NULL, NULL
UNION ALL
SELECT 'F4', 'F5'
) x (FirstName, LastName)
GROUP BY FirstName, LastName

SELECT FirstName + ',' + LastName
FROM
(
SELECT 'F1', 'F2'
UNION ALL
SELECT NULL, 'F3'
UNION ALL
SELECT NULL, NULL
UNION ALL
SELECT 'F4', 'F5'
) x (FirstName, LastName)
GROUP BY FirstName + ',' + LastName

The right answer will depend on your business requirements.
12/18/2005 10:45 PM | Adam Machanic
Gravatar

# Proper use of SQL Group By

12/19/2005 10:27 PM | Derek White's Blog
Gravatar

# re: Proper GROUP BY usage

Thanks, Adam. you're right in this particular case if either of those columns allows for Nulls and the SELECT isn't written to handle them. The advice remains the same, however -- I still very much stand by my statements and it still is very important to group by what you need to calculate your expression, not the expression itself, where appropriate.

If you consider the NULL possibility, and if we are assuming that we are writing "good" SELECT's, then the original might be:

SELECT COALESCE(LastName, 'N/A') + ', ' + COALESCE(FirstName,'N/A')
FROM...
GROUP BY COALESCE(LastName, 'N/A') + ', ' + COALESCE(FirstName,'N/A')

or something like that; the key is if nulls are there, you would handle them. And, ironically, this is an even *better* example than the one I posted, because of course the proper grouping remains the same:

GROUP BY lastname, firstname

12/20/2005 1:19 PM | Jeff
Gravatar

# re: GROUP BY techniques

>> I can’t say how many times I see SELECT's written this way and it is simply wrong.

I can't agree. If the other fields are attributes of a customer, then as long as the CustomerID is listed (preferably first) then there isn't a problem. Now, I am not saying its better or worse or faster or slower or whatever. But from a relational data standpoint, one thing that it isn't is 'wrong'.

>> Hopefully, you will agree that it is a much cleaner SELECT statement now, it is more efficient and it logically makes more sense when you look at it.

Cleaner - debatable. Logical - debatable. But lets focus on performance. How sure are you that the new structure is more efficient? Have you tested on large sets of data?
12/28/2005 5:52 PM | Matthew
Gravatar

# re: GROUP BY techniques

Thanks, Mathew, for the feeback.

When writing any SELECT, you have to ask yourself "What am I returning?" In this case, it to return one row per Customer, along with their total sales. What defines a customer? In this case, the PK of the customer table -- CustomerID. Therefore, the ONLY grouping you should be doing ANYWHERE is grouping by customerID. Why would you join every row in the entire transaction table to the customer table and then group those rows by customer name, city, state, etc, when you can simply group the transaction table by one of its own (indexed) columns? Which do you feel is more efficient?

As for more logically, convert each into an English sentence:

Query 1: Join the customer table with the transaction table, and then return total sales for each combination of cusomterID, customer name, address, city, and state.

Query 2: Return customer ID, name, address, city and state from the customer table, and also join to a subquery that returns total sales for each customer.

Which statement makes more logical sense to you?

12/28/2005 6:37 PM | Jeff
Gravatar

# re: GROUP BY techniques

>> Which statement makes more logical sense to you?

You must understand I was not saying which is better or worse. I was simply saying that possibly 'wrong' is an inappropriate term to use.

You stated:

>> I can’t say how many times I see SELECT's written this way and it is simply wrong. You should only be grouping on CustomerID, and not on all those other columns.

Your second sentence is possibly true. But I don't think the former is. From a relational standpoint, it is perfectly valid to GROUP BY fields that you know are attributes of an entity whose identifier is also included in the GROUP BY. The output of the query will be stable and known - it will work, and you can rely on the fact that it will work. Thus the word 'wrong' seems an odd choice.

Maybe I am just nitpicking.
12/28/2005 7:55 PM | Matthew
Gravatar

# re: GROUP BY techniques

>> Which do you feel is more efficient?

I'd be interested in whether one was quicker or more efficient than the other, but what I 'feel' is completely irrelevant to the issue at hand.

In performance optimisation you sometimes find that what you think is a bottleneck often is not, and doing 'obvious' optimisations may slows things down.

My quick testing at my end shows the two queries are performing very similar. But I'd be interested to see an instance where that wasn't the case.
12/28/2005 7:59 PM | Matthew
Gravatar

# re: GROUP BY techniques

Thanks, Matthew. My use of the word "wrong" is definitely referring to style, not that the results do not come out correctly; i.e., it is also "wrong" to ever use RIGHT OUTER JOINs, but technically using them returns valid results, since it is (unfortunately) a valid join type and equivalent to a backwards LEFT OUTER JOIN.

As for your note about optimizations, my post has nothing to do with opmizations at all -- it has to do with writing clearly structured SELECT statements that will perform better or (the worst case scenerio) the same as a sloppy one. Depends on the optimizer. The SQL Server 2005 optimizer essentially rewrites the first ("wrong") SELECT as the second ("correct") one anyway when evaluating (look very closely at the execution plan -- it is smart enough to know that it should not be grouping on all those columns) so they come out exactly the same. But to rely on the optimizer to clean up your code for you is not something I would recommend when you can easily write it "correctly" in the first place without any effort other than really understanding how to write a well structured SELECT.
12/28/2005 9:03 PM | Jeff
Gravatar

# re: GROUP BY techniques

>> As for your note about optimizations, my post has nothing to do with opmizations at all

Apologies for the misunderstanding - I read your comment about it being more efficient as being related to performance.
12/29/2005 6:43 PM | Matthew
Gravatar

# re: GROUP BY techniques

Hey Matthew -- no need to apologize at all. I also did not word my post too well, there was lots of room left for interpretation. You raise some great points and thanks again for your feedback!

- Jeff
1/12/2006 12:53 PM | Jeff S
Gravatar

# Some Simple SQL Rules to Live By

3/14/2006 12:57 PM | Welcome to Jeff's Blog
Gravatar

# re: GROUP BY techniques

Well, this post is a bit old, but I figured I would reply with my findings...

Jeff, the solution works! I had a query that was developed by another developer where they were grouping by the entire select criteria. The query was joining between two tables and only two columns from the second table were used in aggragate functions, however, they were still grouping by the entire select criteria.

anyway, I used to you logic to move the group by on the two columns to the join and it WORKED.!! Rather than trying to group by the entire select statement, I was only grouping by the two columns involved in the aggragate function.

With regards to performance, the initial query took about 14 mins to return results (8035 rows), because it had to re-order the results based on the group by on the entire select. When I used your query, the same amount of results were returned, however, this time it only took 30 seconds..

I was pleasantly surprised of the performance increase... Thank you..
5/5/2006 1:17 PM | Michael
Gravatar

# re: GROUP BY techniques

I have tried the same thing on a large chunk of data between tables, where grouping by select list took 2:49 to return data and grouping in th join on key field took 2:48, no great saving?

Any ideas?
5/17/2006 8:45 AM | Baz
Gravatar

# re: GROUP BY techniques

This is a very useful post.

I have run some tests on some sizable datasets and found some slight performance improvement, but perhaps more importantly the improvement in code readability and justifies the technique.
6/8/2006 3:02 AM | Trevor
Gravatar

# re: SQL GROUP BY techniques

After reading your article I just used this approach on a query and my performance declined. Can you offer some suggestions as to when each approach is better or worse?

I am using SQL Server 2000 in this case and I had to join to the sub-query with the aggregate on two fields, not one as in your example. Could that be the difference?
4/20/2007 6:25 PM | Zims
Gravatar

# re: SQL GROUP BY techniques

Interesting, but it is impossible to determine for sure what is happening without an example. Performance should never decline. Can you post your SQL and the relevant tables' DDL ?
4/20/2007 10:14 PM | Jeff
Gravatar

# re: SQL GROUP BY techniques

Nevermind, I think the problem is that there was no index that met the needs of my newest query. Once I added it, sure enough, the one with the derived table with aggregate like you demonstrated above is faster. When I went to check the indexes there were 11 besides the PK index, and I had to add yet another. As I understand it, there is a trade off between the work required to maintain indexes and the gains from quicker data access. I clearly need to look more into this.

Thanks for the great article. I will definitely be using that technique. I hated putting all those fields in GROUP BY clauses and it always seemed clunky to me. Now I know why!
4/22/2007 11:38 PM | Zims
Gravatar

# re: SQL GROUP BY techniques

Hi Jeff, I've used your SQL structure to pull out a similar set of results with two tables. I have the aggregate function inside the sub query but ms access returns the error SYNTAX ERROR IN JOIN OPERATION. What have I done wrong here?

SELECT *
FROM tblPurchaseOrderHeader
INNER JOIN
(SELECT tblPurchaseOrderDetail.HeaderID, SUM(tblPurchaseOrderDetail.TotalAmount + tblPurchaseOrderDetail.TaxAmount) as TotalPO
FROM
tblPurchaseOrderDetail
GROUP BY
tblPurchaseOrderDetail.HeaderID)
ON
tblPurchaseOrderHeader.HeaderID = tblPurchaseOrderDetail.HeaderID;
4/30/2007 1:25 AM | Samuel
Gravatar

# re: SQL GROUP BY techniques

Samuel -- keep in mind this is a SQL Server weblog, many things written here may not work exactly for MS Access. Anyway, for this example, you did not give the derived table an alias. Also, you cannot reference any of the tables within the derived table, you can only reference the output via the alias as well.

Note the changes below to your code where the alias X is used (twice, towards the end):

SELECT *
FROM tblPurchaseOrderHeader
INNER JOIN
(SELECT tblPurchaseOrderDetail.HeaderID, SUM(tblPurchaseOrderDetail.TotalAmount + tblPurchaseOrderDetail.TaxAmount) as TotalPO
FROM
tblPurchaseOrderDetail
GROUP BY
tblPurchaseOrderDetail.HeaderID) X
ON
tblPurchaseOrderHeader.HeaderID = X.HeaderID;
4/30/2007 8:34 AM | Jeff
Gravatar

# re: SQL GROUP BY techniques

hi Jeff

This is such a Nice article that i have falling short of words to appreciate you. The Example you have provided can give
the reader a clear insight of group by caluse. Keep writing such articles and serving the world .

Add few more examples and also write some thing on Distinct Clause as well


Zubair
5/3/2007 6:25 AM | zubair Masoodi
Gravatar

# re: SQL GROUP BY techniques

Hi Jeff

Just some musings on the above...

Another compelling reason for your view point - not repeating the columns in two clauses makes code more maintainable and less likely to introduce a bug (for example if a column is removed from the select clause but the developer forgets to adjust the group by).

Having said that I don't totally buy it. You are making two passes at the customer table instead of one. I quickly adapted your code for AdventureWorks. Scan count and logical reads are up for your version. The difference may even be artificially low as the second scan is of a nonclustered index for the example I used.

I am happy to concede that your method is more easily maintained and less likely to catch people out if there is a deliberate grouping on a column that is not selected. I guess based on that I have to concede that it is more readable.

Based on a very elementary example though the execution plan is less efficient. Obviously there would have to be more tests to firm that up to anything more conclusive. Do you have figures to the contrary?

USE AdventureWorks
go

SET STATISTICS IO ON
GO
--SET SHOWPLAN_TEXT ON
--GO

SELECT
C.CustomerID, C.CustomerType, SUM(S.SubTotal) as TotalSales
FROM
AdventureWorks.Sales.Customer AS C
INNER JOIN AdventureWorks.Sales.SalesOrderHeader AS S
ON C.CustomerID = S.CustomerID
GROUP BY
C.CustomerID, C.CustomerType

SELECT
S.CustomerID, C.CustomerType, S.TotalSales
FROM
AdventureWorks.Sales.Customer AS C
INNER JOIN
(SELECT C.CustomerID
, SUM(S.SubTotal) as TotalSales
FROM AdventureWorks.Sales.Customer AS C
INNER JOIN
AdventureWorks.Sales.SalesOrderHeader AS S
ON C.CustomerID = S.CustomerID
GROUP BY C.CustomerID) AS S
ON S.CustomerID = C.CustomerID
GO

--SET SHOWPLAN_TEXT OFF
--GO
SET STATISTICS IO OFF
5/4/2007 8:50 AM | pootle_flump
Gravatar

# re: SQL GROUP BY techniques

pootle_flump -- great feedback, thanks for testing, but your second SELECT is incorrect; there is no need to include the customer table in the derived table, you just group it by customerID. that's kind of the main point of the article -- group the transaction tables by the key columns w/o joining to the entity tables until the last step (at the outer select).

Your second SELECT should read:

SELECT
S.CustomerID, C.CustomerType, S.TotalSales
FROM
AdventureWorks.Sales.Customer AS C
INNER JOIN
(SELECT CustomerID
, SUM(SubTotal) as TotalSales
FROM
AdventureWorks.Sales.SalesOrderHeader
GROUP BY CustomerID) AS S
ON S.CustomerID = C.CustomerID

Also, you are only returning CustomerType, not a whole bunch of columns related to the Customer, so it's not a great example.

From what I've tested, it appears that SQL 2005 uses the same execution plan for both techniques, since if you look at it carefully, it essentially re-writes your SQL to work the same way as pushing the grouping into a derived table. (this has been discussed in the comments).

I believe that with SQL 2000, however, it is indeed a performance improvement to do your grouping "properly".
5/4/2007 9:07 AM | Jeff
Gravatar

# re: SQL GROUP BY techniques

Using your basic example, let's do a join to the Address table and include address and city in our results, and compare the two techniques:

USE AdventureWorks
go

SET STATISTICS IO ON
GO
--SET SHOWPLAN_TEXT ON
--GO

SELECT
C.CustomerID, C.CustomerType, AD.AddressLine1, AD.AddressLine2, AD.City,
SUM(S.SubTotal) as TotalSales
FROM
AdventureWorks.Sales.Customer AS C
INNER JOIN AdventureWorks.Sales.SalesOrderHeader AS S
ON C.CustomerID = S.CustomerID
INNER JOIN Sales.CustomerAddress AS CA
ON C.CustomerID = CA.CustomerID and CA.AddressTypeID = 3
INNER JOIN Person.Address AD
ON CA.AddressID = AD.AddressID
GROUP BY
C.CustomerID, C.CustomerType, AD.AddressLine1, AD.AddressLine2, AD.City


--


SELECT
S.CustomerID, C.CustomerType, S.TotalSales, AD.AddressLine1, AD.AddressLine2, AD.City
FROM
AdventureWorks.Sales.Customer AS C
INNER JOIN
(SELECT CustomerID
, SUM(SubTotal) as TotalSales
FROM
AdventureWorks.Sales.SalesOrderHeader
GROUP BY CustomerID) AS S
ON S.CustomerID = C.CustomerID
INNER JOIN Sales.CustomerAddress AS CA
ON C.CustomerID = CA.CustomerID and CA.AddressTypeID = 3
INNER JOIN Person.Address AD
ON CA.AddressID = AD.AddressID
GO

--SET SHOWPLAN_TEXT OFF
--GO
SET STATISTICS IO OFF


In this case, the second is more efficient. Not by a ton, but it is, and it also has the benefit of being more maintainable, shorter, and more logical.
5/4/2007 9:14 AM | Jeff
Gravatar

# re: SQL GROUP BY techniques

>> your second SELECT is incorrect; there is no need to include the customer table in the derived table,
Dear me Jeff - and there I was determined not to post a dumb comment :-/ Thanks for going easy on me ;-)

Yeah - I knew the query didn't pull in much in the way of other attributes (first time I have delved into AdventureWorks - very odd design that will take a bit of looking around in).

I will now bear the advice in mind for the future.

Thanks
5/4/2007 9:51 AM | pootle_flump
Gravatar

# re: SQL GROUP BY techniques

Thanks Jeff, I'm still a newbie to SQL.
6/5/2007 10:04 PM | Samuel
Gravatar

# re: SQL GROUP BY techniques

Hello Jefj,
I am reading this trace on Group By, and it is very revealing.
In my case I have a table map_medical_claims with claims identified by claim_id.
Claims may have a number of items identified by:
cardholder_id,first_dt_of_service,last_dt_of_service, diagnosis_cd_1,diagnosis_cd_2,diagnosis_cd_3,diagnosis_cd_4,
procedure_cd_1,procedure_cd_2,procedure_cd_3,procedure_cd_4.
Requirements demand identifier for each item in a separate column. This table has about 550,000 rows. I tried to create sql filling an item identifier, but failed. So I went with cursor, but this takes a lots of time, since I had to use two nested cursors.
Is there any way to do it in sql?

The answer willl be greatly appreciated.
Vladimir
6/20/2007 10:12 AM | Vladimir
Gravatar

# re: SQL GROUP BY techniques

Thanks for the examples. I tried them both, and for my data, the first query performed better than the second query: 70ms vs 7000ms
7/16/2007 10:04 AM | Phil
Gravatar

# re: SQL GROUP BY techniques

Phil -- Wow, that seems like quite a difference ... can you please post the two SQL statements that you executed, along with the relevant DDL? I'm very curious about your results. No offense, but I suspect that you did something wrong or not optimally when you re-wrote your query .... For example, look at poodle_flump's posts, above -- he initially had the same result as you, but that was because he did not re-write the query efficiently.

Thanks in advance.

- jeff
7/16/2007 10:24 AM | Jeff
Gravatar

# re: SQL GROUP BY techniques

Hi , Jeff

Thanks very useful
8/6/2007 3:01 PM | Genady
Gravatar

# re: SQL GROUP BY techniques

Um... I would never introduce a subquery if it weren't absolutely required, or unless performance really didn't matter.
I'd test the performance both ways, then choose the faster way.

I am always right.
8/6/2007 9:03 PM | Gandy Dancer
Gravatar

# re: SQL GROUP BY techniques

It's *always* dangerous to predict which query is faster. Let's look at your first example:

SELECT
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City,
C.State, S.TotalSales
FROM
Customers C
INNER JOIN
(SELECT
CustomerID, SUM(Sales) as TotalSales
FROM
Sales
GROUP BY
CustomerID) S
ON
C.CustomerID = S.CustomerID


This one is clearly faster than


SELECT
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City,
C.State, SUM(S.Sales) as TotalSales
FROM
Customers C
INNER JOIN Sales S
ON C.CustomerID = S.CustomerID /* Bug fixed */
GROUP BY
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City, C.State


or isn't it?

What happens if Sales.CustomerID is NULL for 99.99% of all records (sold to anonymous customers in the shop), so only a few Sales records can actually be joined to Customers? The "correct" statements aggregates first, then does the join. Doing that is arguably slower than the other way round.

BTW: my prefered way to avoid long group by clauses, arguably just as ugly, is using "max" or "min", as in

SELECT
C.CustomerID, max(C.CustomerName),
max(C.CustomerType), max(C.Address1), max(C.City),
max(C.State), SUM(S.Sales) as TotalSales
FROM
Customers C
INNER JOIN Sales S
ON C.CustomerID = S.CustomerID /* Bug fixed */
GROUP BY
C.CustomerID


8/7/2007 3:41 PM | ammoQ
Gravatar

# re: SQL GROUP BY techniques

Hi AmmoQ -- absolutely correct, always go with the most efficient method unless the efficiency gained is minimal and comes as the expense of clean code.

Of course, if you have NULL customer values in your Order table, and only want to deal with orders that have customers, you could simply exclude NULL values from your aggregate query.

Also -- thanks for pointing out that bug! This post has been up for *years* and no one ever noticed it! Well done!
8/8/2007 8:32 AM | Jeff
Gravatar

# re: SQL GROUP BY techniques

I have a little problem here, I hope somebody could help me out. I have a table say 'sample' with two columns, say 'A'(varchar2) and 'B'(date).


eg:

A B
1 12 july 2007
1 12 aug 2007
1 13 aug 2007
1 14 aug 2007
2 17 aug 2007
2 18 aug 2007

Here the max date for 1 is 14 aug 2007. 14 aug 2007 - 6 = 8 aug 2007.

I need to pick up the record which has the minimum date AFTER 8 aug 2007.
A B
1 12 aug 2007

(not 12 July 2007 which is the actual minimum date)

Can anybody help me with the query?

Any suggestions would be appreciated.

Thanks.
mkc

9/10/2007 5:14 AM | mkc
Gravatar

# re: SQL GROUP BY techniques

That was englighting! And the comments above as well. Thanks.

I have just one more comment to the second example. It is not only the NULL that can make the two queries have different meaning. Consider this:

create table a( a varchar(1), b varchar(1));
insert into a values ('', 'a');
insert into a values ('a', '');

ymogen2=# select a || b from a group by a, b;
?column?
----------
a
a
(2 rows)

ymogen2=# select a || b from a group by a || b;
?column?
----------
a
(1 row)
9/21/2007 12:06 PM | Zbigniew Lukasiak
Gravatar

# re: SQL GROUP BY techniques

Ah - yeah that was on PostgreSQL - '||' means concatenation.
9/21/2007 12:09 PM | Zbigniew Lukasiak
Gravatar

# re: SQL GROUP BY techniques

Hi Zbigniew -- thanks for your comment, that was actually raised and addressed in one of the very first comments (from Adam), just scroll up a bit to see it.

If you have nulls in A or B, and you need A+B to return NULL, then absolutely, don't group on A,B.
9/21/2007 1:07 PM | Jeff
Gravatar

# re: SQL GROUP BY techniques

Not exactly - the columns don't contain NULL - they contain '' - and if you think that '' is similar to NULL - then I could make it ('a', 'ab') and ('aa', 'b'). My point is that you don't need NULL's to have the behaviour that Adam described.
9/22/2007 8:31 AM | Zbigniew Lukasiak
Gravatar

# re: SQL GROUP BY techniques

Hi Zbigniew -- ah, sorry, read to quickly. Yes, you are correct. Again, the fact remains: group on what you need to evaluate the expression, not the expression itself, exactly as I wrote .... If you are just concatenating two columns, or it doesn't logically work for the expression you have, then you cannot group on them separately for the reason you mentioned, absolutely.

9/22/2007 8:35 AM | Jeff
Gravatar

# re: SQL GROUP BY techniques

Hi,

Its helped me a lot for my project success.
11/4/2007 1:23 PM | Chandra Mohan Naidu Yerrabothula
Gravatar

# re: SQL GROUP BY techniques

I found your post very interesting - and despite it is old, I still will comment it.

When considering simple queries, I like this approach, but when it comes to some more complex queries - it makes only extra work.
For example: View "viewCars" with several columns including "expDate" column, for which I should make restriction and then GROUP BY two columns "CarID" and "ExpType" and display other columns too.

This would be query with "bad" style. But when I tried to rewrite it - it become more complex and still there was big problems to construct query with "good" style at all.
Probably I am wrong... But up to date it is my opinion.
SELECT
CarID,
CarName,
CompanyID,
CompanyName,
ExpType,
SUM(Expenses) AS Summa
FROM
(SELECT
CarID
CarName
CompanyID
CompanyName
ExpType
Expenses
FROM viewCars
WHERE
expDate BETWEEN '2008-07-1' AND '2008-07-31'
) subViewCars
GROUP BY CarID, CarName, CompanyID, CompanyName, ExpType
9/11/2008 9:46 AM | Juris
Gravatar

# re: SQL GROUP BY techniques

Juris -- no, you did not rewrite that correctly. And you shouldn't query a view like that -- it will not be as efficient as querying the underlying tables of that view properly.

You should be writing something like:

select c.* (all columns), e.totalExpenses
from cars c
left outer join
(select carID, sum(expenses) as TotalExpenses
from expenses
where expdate between .... group by carID
) e on c.carID = e.carID

Never query a View that does joins that aren't what you need for your SELECT. Or,
9/11/2008 10:48 AM | jeff
Gravatar

# re: SQL GROUP BY techniques

Thanks for your answer!
Your answer is completely right for my question :). I have constructed analogue of my real query (which has fields not in English), but I have not described my problem correctly.

My idea in general was: I have already views on which are simpler to create bad style queries and get what I need instead of to create new ones (views) with correct approach. To add GROUP BY clause to existing view using correct approach particular is impossible (as far as I understand) because I do not need all rows from existing view, I need already grouped rows.

When creating non-existing view - of course I would use your correct approach.
The view which I described (viewCars) is already built on several tables:
Company
Car (should be grouped)
Car specification
Car-Company-Usage
Car user
Expenses (should be grouped)
Expenses detailed
Suppliers
And I have two choices - construct bad GROUP BY query with already given view or create new view, which is based on the same tables and with correct GROUP BY clause.
Probably the answer, which choice is better is not so simple.

But thank you one more time - your post is very usefull.
9/16/2008 11:15 AM | Juris
Gravatar

# re: SQL GROUP BY techniques

Does t-sql have something similar to inline groupings that oracle provides, similar to partition by clause?
9/18/2008 7:51 AM | TampaDev
Gravatar

# re: SQL GROUP BY techniques

Thanks Jeff. I have to do a bit of sql-maintanance and my sql-skills are rusty. I found myself stuffing extra fields in the group-by expression, and I felt I should probably gain a better understanding of group-by before proceeding. Your article was exactly what I needed (and luckily it was the 2nd result of googling "group by").
11/12/2008 8:27 AM | Sander
Gravatar

# re: SQL GROUP BY techniques

Hi;
I have a quick question for you.
following is my Query:
"
SELECT Sum(Amount) As vSum, Count(Amount) As vCount, ZoneCode
Into TmpRec
FROM GReciept INNER JOIN TmpReciept ON GReciept.RowNo = TmpReciept.RowNo
Group By left(ZoneCode,1) with rollup
"
"ZoneCode" is a two chars. (for example '00', '01', '02', ..., '10', '11', '12', ..., '20', '21'....), and I need group them by first char. I tried above Query, but it doesn't work.

I appreciate for any help.
N.Zargari
11/18/2008 7:01 AM | N.Zargari
Gravatar

# re: SQL GROUP BY techniques

Hi Jeff;
Sorry my email address is nzargari@yahoo.com.
Thanks.
11/19/2008 12:23 AM | N.Zargari
Gravatar

# re: SQL GROUP BY techniques

N.Zargari --

You cannot select the entire ZoneCode and group only on the first character.

see:

http://weblogs.sqlteam.com/jeffs/jeffs/archive/2007/07/20/60261.aspx

11/19/2008 3:08 PM | jeff
Gravatar

# re: SQL GROUP BY techniques

Thank you Jeff;
I fixed it.
Thanks again for your help.
N.Zargari
11/22/2008 4:11 AM | N.Zargari
Gravatar

# re: SQL GROUP BY techniques

Hi Jeff;
I got another Question for you.
I use sql server 2000 and have a table with 61085 Recs. and 68 chars. in every Rec.
I use " Select * from TABLENAME " query and it takes 33 secs. to show the result.
How can I reduce this time?(I need all fields).

Tanks.
N.Zargari.
11/22/2008 7:15 AM | N.Zargari
Gravatar

# re: SQL GROUP BY techniques

Hi sir,
I have a question in sql. Is it possible to have a 1 sql using group by to produce this result?
Col1 Col2 Col3
AAAA 1111 1111
AAAA 1111 1112
AAAA 1111 1113
BBBB 1111 1111
BBBB 1111 1112

to produce like this:
Col1 Col2 Col3 CountofCol3
AAAA 1111 1111 3
AAAA 1111 1112 3
AAAA 1111 1113 3
BBBB 1111 1111 2
BBBB 1111 1112 2

I am using this sql statement to produce this result below:
select a.col1, a.col2, a.col3, b.CountofCol3
from table a
left join (select col1, col2, count(*) CountofCol3 from table group by col1, col2) b
on a.col1=b.col1 and a.col2=b.col2

is there any sql statement that would be shortened my sql statement?

thank
Cyril
12/14/2008 11:04 PM | cyril
Gravatar

# re: SQL GROUP BY techniques

Hi Jeff. I have to second the comment made earlier that including more than absolutly necessary in the GROUP BY clause is not wrong. For the example you gave, setting out the SQL as you suggest is easy and makes nice, logical code. And as has been pointed out, mindlessly stuffing everything into the GROUP BY can be problematic and is sloppy programming technique.

However, if you are joining to multiple tables and calling out aggregate functions on most of the fields your approach could quickly produce rather unwieldy code. In these cases, a few extra fields in the grouping may be the best approach. For example, consider rewriting this fairly simple statement into your approach:

SELECT
ttbSFHistoryTEMP.rundate, tGoal.SMRevenueGoal,
ttbSFHistoryTEMP.OfficeID, sum(ttbSFHistoryTEMP.revenue) as TotRevenue,
sum(ttbSFHistoryTEMP.GP) as TotGP, sum(ttbSFHistoryTEMP.booked) as TotBooked
FROM
ttbSFHistoryTEMP
LEFT JOIN
(SELECT
tblOffice.OfficeID, tblSalesGoals.DivisionID, tblSalesGoals.SMRevenueGoal
FROM
tblSalesGoals
JOIN
tblOffice
ON
tblSalesGoals.DivisionID = tblOffice.OfficeLocationID
WHERE tblSalesGoals.SGType = 1 and tblSalesGoals.SGName = DATEPART(year,getdate())-1) tGoal
ON
ttbSFHistoryTEMP.OfficeID = tGoal.OfficeID
GROUP BY
ttbSFHistoryTEMP.OfficeID, ttbSFHistoryTEMP.rundate, tGoal.SMRevenueGoal


If I understand your approach, in order to get rid of my two extra GROUP BY fields I would need to add another temp query for my groupings and include this with another join. OK, so not too bad for this particular example...but consider as the base statement becomes more complex. Say I need to bring in aggregated results for 12 fields in 4 tables along with 2 or 3 additional identification fields that really don't need to be in the GROUP BY. I don't believe that the resulting code would be more structured or in any way more easily understood by adding the temp tables and extra joins required to GROUP BY only the single attributable field.
2/11/2009 8:27 PM | Michael
Gravatar

# re: SQL GROUP BY techniques

Thanks! This is exactly what I was looking for.
4/10/2009 3:03 PM | sqlfun
Gravatar

# re: SQL GROUP BY techniques

Hello,

I need a query to solve the below problem. The table rows are given below

SerialNo Col2Group
--------- ------------
1 123
2 123
3 123
4 456
5 456
6 456
7 123
8 123
9 789
10 789
11 456


I want the output as

---------- --------------
1 to 3 123
4 to 6 456
7 to 8 123
9 to 10 789
11 to 11 456


Can anybody send me the query to solve the above problem. My mail ID is saravana.kumar@rsystems.com or skumarv@hotmail.com. Thanks in advance.

Regards,
Saravanakumar
5/7/2009 12:51 AM | Saravanakumar V
Gravatar

# re: SQL GROUP BY techniques

Many thanks for this, very useful,

However I am confused about how this works with regard to joining more than two tables. I have simple code:

Select recipes.REC_name, SUM(ingredients.sup_ID*ingandrec.quantity)
FROM ingredients
INNER JOIN ingandrec
ON ingredients.I_id = ingandrec.i_id JOIN recipes
ON recipes.REC_ID = ingandrec.rec_id
WHERE recipes.REC_ID=1
GROUP BY recipes.Rec_ID,recipes.REC_name

which Pulls the name of the recipe from the recipes table then works out the total cost based on the ingredients in the recipe (data stored on the Ingredients table), by way of the ingandrec table which tells me each ingredient and how much of each ingredient goes in each recipe.

Putting aside my horrendously named tables and my lack of aliases I cannot figure out how to use your approach to remove the extra group by columns (only one at the minute, but it is the principle). Every time I try to reduce it down I get error messages. Could you explain?
5/14/2009 3:08 PM | Graeme
Gravatar

# re: SQL GROUP BY techniques

I am a teacher writing a program for my school to keep track of absences and latenesses. The SQL below finds the students who have been absent at least n times between specified dates. I have four questions.

1) WHY am I required to GROUP BY so many fields when logically I only want to group by tblAbsence.StudentID???

2) Some people say that it is valid to use an alias in an ORDER BY clause but it always gives me a "parameter needed" error. Can an alias be used in an ORDER BY clause???

3) Could you show me the correct way to write my query so that I only need to GROUP BY tblAbsence.StudentID???

4) Should I put "AbsenceDate BETWEEN StartDate AND EndDate" in parentheses???

I read your information about "grouping by everything until the errors go away" and I'm sure that I'm doing that but I'm still unsure as to how to write my query correctly.

Thank you very much.

John Jaros

SELECT tblAbsence.StudentID AS StudentID,
tblStudentInfo.StudentLastName AS LastName,
tblStudentInfo.StudentFirstName AS FirstName,
tblStudentInfo.StudentMiddleName AS MiddleName,
tblStudentInfo.StudentClass AS Class,
COUNT (tblAbsence.StudentID) AS NumTimesAbsent
FROM tblStudentInfo, tblAbsence
WHERE tblStudentInfo.StudentID = tblAbsence.StudentID AND
AbsenceDate BETWEEN StartDate AND EndDate AND
AbsenceType = 'UNEXCUSED'
GROUP BY tblAbsence.StudentID,
tblStudentInfo.StudentLastName,
tblStudentInfo.StudentFirstName,
tblStudentInfo.StudentMiddleName,
tblStudentInfo.StudentClass
HAVING COUNT(tblAbsence.StudentID) >= <n-times>
ORDER BY tblStudentInfo.StudentClass ASC,
tblStudentInfo.StudentLastName ASC
7/13/2009 2:40 AM | John Jaros
Gravatar

# re: SQL GROUP BY techniques

The ways to avoid this error is simple

just look into my post


http://sqlerrormessages.blogspot.com/2009/08/sql-server-error-message-msg-130-cannot.html
8/21/2009 6:20 AM | Syni
Gravatar

# re: SQL GROUP BY techniques

If you want to use a where clause with grouping here is an example

select Position, count(Position) as Total from SpecimenPositions sp inner join
(select BoxId from SpecimenBoxes where BoxName='HUN008B') sb
on sp.Boxid=sb.Boxid group by Position having count(Position)>1
9/7/2009 9:50 PM | Nick
Gravatar

# re: SQL GROUP BY techniques

an example to bind inner join and having clause

select StorageName, ContainerName, sb.BoxName from SpecimenBoxes sb
inner join SpecimenContainers sc on sc.ContainerId= sb.ContainerId
inner join SpecimenStorages ss on sc.StorageId = ss.StorageId
where sb.boxid in
(select boxid from specimens group by boxid having count(boxid)>100 )
order by Storagename, ContainerName, BoxName
9/7/2009 10:00 PM | Sydney Web Designer
Gravatar

# re: SQL GROUP BY techniques

i have one table its having
CUSTOMER_NAME,CUSTOMER_PRODUCT,QUANTITY,DATE
PATEL MILK 2000 2/12/2005
SHAH YOUGURT 20 3/22/2005
PATEL MILK 200 2/12/2004


I WANT OUT PUT LIKE
CUSTOMER_NAME,CUTOMER_PRODUCT,MAX-QUANTITY,MAX-DATE,MIN-QUANTITY,MIM-DATE ,AVERAGE ONLY ONE SINGLE SCAN.
NOT A INDIVIDUAL SCAN
ANY ONE HELP ME??

10/29/2009 11:17 PM | shruti
Gravatar

# re: SQL GROUP BY techniques

Hey Jeff;

Your weary post is still out there doing good! Many thanks for helping me understand how to write better SQL statements. Seeing all those column names in a Group By to appease the PARSER had a bad smell. There had to be a better way.

Bill
11/22/2009 4:04 PM | billb
Gravatar

# re: SQL GROUP BY techniques

Hi ,

Your post helped me overcome a big headache and was very helping in writing cleaner and logical sql group by statements.

keep them coming


cathy
12/2/2009 9:41 AM | cathy
Gravatar

# re: SQL GROUP BY techniques

Sorry for spamming this!! that one had the wrong date info on the where line



SELECT
SUM(Value) AS Value, SUM(Unit) AS Units, DATEPART(month, someDate) AS Month, 'All' as Manufacturer,
DATEPART(year, someDate) AS Year
FROM sometable
inner join sometableloclink on sometableloclink.sometableid = sometable.sometableid
inner join sometable2 on sometable2.sometableid = sometable.sometableid
WHERE (IsNew = 'N') and
AND (DATEPART(year, someDate)=@year AND DATEPART(month, someDate)<=@month or DATEPART(year, someDate)+1=@year AND DATEPART(month, someDate)>@month)
AND sometable2.DealerShipId=@dealershipId
and sometableLocLink.districtId = @districtId
and sometableLocLink.industryId = @industryid
GROUP BY DATEPART(year, someDate), DATEPART(month, someDate) ORDER BY Year ASC, Month ASC
12/3/2009 4:10 PM | angela
Gravatar

# re: SQL GROUP BY techniques

Grrr. I always put the group by in the wrong spot too.
12/20/2009 12:52 PM | Colby Africa
Gravatar

# re: SQL GROUP BY techniques

Thanks Jeff, great post.

clear and nice.

regards
12/29/2009 6:39 AM | Rami Nsouli
Gravatar

# re: SQL GROUP BY techniques

I much prefer this, clean.. easy and can be compiled via sql server however sql likes it to be:

SELECT C.CustomerID, C.CustomerName, C.CustomerType, C.Address1, C.City, C.State, Totals.TotalSales
FROM Customers C
OUTER APPLY
(
SELECT SUM(S.Sales) AS TotalSales
FROM Sales S
WHERE S.CustomerID = C.CustomerID
) AS Totals
1/12/2010 12:18 AM | David Mc
Gravatar

# re: SQL GROUP BY techniques

David Mc -- check your performance on that one; applies and correlated subqueries are almost always less efficient than the equivalent JOIN, assuming that SQL doesn't convert it to a JOIN internally.
1/15/2010 6:50 PM | Jeff S
Gravatar

# re: SQL GROUP BY techniques

Hey Jeff S, Its a LEFT OUTER JOIN according to the execution plan.

SELECT
Cost: 0%
|
Nested Loops
(Left outer Join)
Cost 0%
| |
| Clustered Index Scan (Clustered)
| [Customers].[PK_Customers] [C]
| Cost: 48%
|
Compute Scalar
Cost: 0%
|
Compute Scalar
Cost: 0%
|
Stream Aggregate (Aggregate)
Cost 0%
|
Clustered Index Scan (Clustered)
[Sales].[PK_Sales] [S]
Cost: 51%

I checked the plans for the other types in the main article, both have differing execution plans again and all 3 use joins and all 3 need to scan each table once using the clustered index... so in "theory" they would be completely comparable right?

Interestingly... the second query main article the "better one" performs both clustered scans but also a Sort that takes some of the major execution time (Sort Cost: 63%) not sure how much effect this would have

If i have time i'll perform a *real* test.. .one of actual time difference in query execution of large data sets... unless someone wants to do that :p
1/19/2010 10:01 PM | David Mc
Gravatar

# re: SQL GROUP BY techniques

Results of "Estimated Operator Cost" in Actual Execution Plan
(Showing only NON-ZERO results)

Using inner join group by on sub query
0.0032919
0.0114058 <- this is the sort operator
0.0000059
0.0032831
0.0000042
=
0.0179909

using outer apply
0.0034687
0.0000163
0.0000003
0.0032853
0.0000125
=
0.0067831

using all fields in a group by
0.0032853
0.0034687
0.0000121
0.0000059
=
0.0067720

I don't have alot of data in my tables so these are very small readings heh

it seems the best performing query is the one with all the group by items
second is to use outer apply, but its so close (each run varies slightly anyway) that they are almost identical
third by being more than twice as slow is the inner join by sub table

however with alot of data in the table i do wonder what the results would be, i might be tempted to see soon :) perhaps 500,000 customers with random sets of sales between 10 and 50k each?

1/20/2010 3:35 AM | David Mc
Gravatar

# re: SQL GROUP BY techniques

Amazing really.. I saw lots of these smelly code before. You just hit the target perfectly!
1/22/2010 9:04 PM | Ahmed IG
Gravatar

# re: SQL GROUP BY techniques


hi could any one of you correct me where i am wrong .
thanks in advance .

select b.TARGET_HOST_NAME from PROFILE_BAC_01.SM_DEF_TARGET b INNER JOIN (select a.RMD_PASS_COUNT, a.RMD_FAIL_COUNT, a.RMD_ABNORMAL_COUNT, (a.RMD_PASS_COUNT + NVL(a.RMD_FAIL_COUNT,0) + a.RMD_ABNORMAL_COUNT) as Totsample, (a.RMD_PASS_COUNT / (a.RMD_PASS_COUNT + NVL(a.RMD_FAIL_COUNT,0) + a.RMD_ABNORMAL_COUNT)*100) as Avail from PROFILE_BAC_01.SM_RAWDATA_MEAS_DAY a group by a.RMD_TARGET_ID) test ON b.TARGET_ID = a.RMD_TARGET_ID









2/1/2010 12:25 AM | Gokul
Gravatar

# re: SQL GROUP BY techniques

I am trying to learn SQL and i am doing some self learn things on the internet and from some books but i have hit a question that i dont understand what it means by using the group by function. any help is appreciated...

2. From the items_ordered table, select the item, maximum price, and minimum price for each specific item in the table. Hint: The items will need to be broken up into separate groups.

There is no actual table it is just simply writing commands to answer the questions

Thanks!!!
2/13/2010 7:42 PM | SQL inquirer
Gravatar

# re: SQL GROUP BY techniques

This was incredibly helpful, truly you made my day!
Cheers!
3/11/2010 11:29 AM | Mattias
Gravatar

# re: SQL GROUP BY techniques

Hi, I am trying to group spread(Expression) based on range of balance size using access. In my query, spread is calculated using fields from all 3 tables. My query looks like this


Select IIf(C.LifeFloor = 0, Diff(T1.Intrate,F.Ftp),Diffloor(I.Rate,C.Sprd,F.Ftp))As Spread, C.BalTrier From I Inner Join C On I.Index = C.IndexRt Inner Join F on C .acno= F.acno
Where C.bal > 0
Group By C.BalTier;

Here BalTier is also an calculated expression based on C table ( I left its calculation to simplify).
So when I run this query it as usual gives Spread(Expression) is not a part of aggregate function.

My question is, how do i use a nested select on C table (to groupby Baltier) when I am calling C table in spread ?
If this is not possible as Groupby needs an aggreagate function, Can anyone suggest me another way of achiveing my goal of grouping spread based on the balance tier.

Need help badly,
Thanks
A


4/2/2010 12:50 PM | a
Gravatar

# re: SQL GROUP BY techniques

if two tables customer having customerid and order having orderid and customer id i would like to how to find total orders placed by each customer using group by
5/5/2010 2:40 PM | priya
Gravatar

# re: SQL GROUP BY techniques

"Push the grouping down a level, into a derived table: "...awesome ! nicer and cleaner!
5/17/2010 4:21 AM | Karachiite
Gravatar

# re: SQL Count and MAx functions


How can i fix this error? Msg 130, Level 15, State 1, Line 1
Cannot perform an aggregate function on an expression containing an aggregate or a subquery.


SELECT f.FilmTitle AS Film, p.FirstName AS Actor
FROM Person p, Role r, Film f
WHERE f.FilmID IN
(SELECT FilmID FROM Award a JOIN Role r ON a.RoleID = r.RoleID)
AND f.FilmID = r.FilmID
AND r.PersonID = p.PersonID
AND r.PersonID IN
(SELECT PersonID

FROM Role r JOIN Award a ON r.RoleID = a.RoleID
GROUP by PersonID



HAVING COUNT(*) =
(

SELECT MAX(COUNT(*) )
FROM Role r JOIN Award a ON r.RoleID = a.RoleID)

)
5/18/2010 2:31 PM | Tom
Gravatar

# re: SQL GROUP BY techniques

<<Need Help>>
I read your above blog, and found that I suffer from what you are explaining - However, I do not know how to fix it.

The below statement will return a list of poker venue leaders for a particular season (with no player idnumber). I need to get the idnumber. When I put it in the select statement I get an error that its not in the GROUP BY statement. When I put it in the GROUP BY statement - The query (of course) returns all the players in all the venues, not the ones with the MAX SUM TOTAL... Please help.


SELECT MAX(Total) AS Venue_Leader_Points, ltid, season
FROM (SELECT SUM(results1.trnpoints) AS Total, results1.idnumber, lttable.ltid, results1.season
FROM results1 INNER JOIN
lttable ON results1.ltid = lttable.ltid
GROUP BY results1.idnumber, lttable.ltid, results1.season
HAVING (results1.season = @season)) AS intermediate
GROUP BY ltid, season
9/12/2010 7:13 PM | Chuck
Comments have been closed on this topic.

Powered by:
Powered By Subtext Powered By ASP.NET