SQL GROUP BY techniques
One aspect of the versatile SELECT statement that seems to confuse many people is the GROUP BY clause. It is very important to group your rows in the proper place. Always push GROUP BY aggregations as far into your nested SELECT statements as possible – if you have a situation in which you are grouping by long lists of columns that are not part of primary keys, you are probably have not structured your query correctly.
Here's a classic example that returns total sales per customer, in addition to returning the customer's name and address:
SELECT
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City,
C.State, SUM(S.Sales) as TotalSales
FROM
Customers C
INNER JOIN Sales S
ON C.CustomerID = S.CustomerID
GROUP BY
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City, C.State
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City,
C.State, SUM(S.Sales) as TotalSales
FROM
Customers C
INNER JOIN Sales S
ON C.CustomerID = S.CustomerID
GROUP BY
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City, C.State
I can’t say how many times I see SELECT's written this way and it is simply wrong. You should only be grouping on CustomerID, and not on all those other columns. Push the grouping down a level, into a derived table:
SELECT
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City,
C.State, S.TotalSales
FROM
Customers C
INNER JOIN
(SELECT
CustomerID, SUM(Sales) as TotalSales
FROM
Sales
GROUP BY
CustomerID) S
ON
C.CustomerID = S.CustomerID
C.CustomerID, C.CustomerName,
C.CustomerType, C.Address1, C.City,
C.State, S.TotalSales
FROM
Customers C
INNER JOIN
(SELECT
CustomerID, SUM(Sales) as TotalSales
FROM
Sales
GROUP BY
CustomerID) S
ON
C.CustomerID = S.CustomerID
Hopefully, you will agree that it is a much cleaner SELECT statement now, it is more efficient and it logically makes more sense when you look at it.
One more common mistake is that people just mimic the expressions in their SELECT list in the GROUP BY clause, without thinking logically about what grouping is really necessary. For example:
SELECT LastName + ', ' + FirstName, ... etc ...
FROM Names
GROUP BY LastName + ', ' + FirstName
FROM Names
GROUP BY LastName + ', ' + FirstName
Again, that is logically wrong and also less efficient. You should not be grouping by the expression itself; you should be grouping by what is needed to evaluate that expression. The correct grouping is:
GROUP BY LastName, FirstName
Too many people just keep stuffing column names and expressions into the GROUP BY clause until the errors go away without stepping back and thinking logically about how grouping works. Take a minute to really consider what you need to return and how it should be grouped, and try using derived tables more frequently when writing aggregate queries to help keep them structured and efficient.
see also:
- Group by Month (and other time periods)
- More on GROUP BY; Examining SUM(Distinct)
- But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?
- Using GROUP BY to avoid self-joins
- SQL Server 2005: Specifying Partitions for Aggregate Functions
Legacy Comments
Xaprb
2005-12-14 |
Amen brother! Amen brother! I too see a lot of code where people have just stuffed whatever it takes into the GROUP BY to make the errors stop. Another suspicious place is the ORDER BY clause. And yet another common no-no I see is code using DISTINCT where it's not needed, or where GROUP BY is needed instead or vice versa. |
Joker
2005-12-15 |
re: Proper GROUP BY usage Of course if there is a need for totals at customer type/State/city then a different approach is needed. :) |
Jeff S
2005-12-16 |
re: Proper GROUP BY usage >>Of course if there is a need for totals at customer type/State/city then a different approach is needed. Close ... A different SELECT is needed, but not a different approach. The approach stays the same. Especially CustomerType is a foreign key to the table that contains the type description or any other info you might want to include in the results. |
Adam Machanic
2005-12-18 |
re: Proper GROUP BY usage Watch out with that advice about expressions. The following two queries are _NOT_ logically equivalent: SELECT FirstName + ',' + LastName FROM ( SELECT 'F1', 'F2' UNION ALL SELECT NULL, 'F3' UNION ALL SELECT NULL, NULL UNION ALL SELECT 'F4', 'F5' ) x (FirstName, LastName) GROUP BY FirstName, LastName SELECT FirstName + ',' + LastName FROM ( SELECT 'F1', 'F2' UNION ALL SELECT NULL, 'F3' UNION ALL SELECT NULL, NULL UNION ALL SELECT 'F4', 'F5' ) x (FirstName, LastName) GROUP BY FirstName + ',' + LastName The right answer will depend on your business requirements. |
Jeff
2005-12-20 |
re: Proper GROUP BY usage Thanks, Adam. you're right in this particular case if either of those columns allows for Nulls and the SELECT isn't written to handle them. The advice remains the same, however -- I still very much stand by my statements and it still is very important to group by what you need to calculate your expression, not the expression itself, where appropriate. If you consider the NULL possibility, and if we are assuming that we are writing "good" SELECT's, then the original might be: SELECT COALESCE(LastName, 'N/A') + ', ' + COALESCE(FirstName,'N/A') FROM... GROUP BY COALESCE(LastName, 'N/A') + ', ' + COALESCE(FirstName,'N/A') or something like that; the key is if nulls are there, you would handle them. And, ironically, this is an even *better* example than the one I posted, because of course the proper grouping remains the same: GROUP BY lastname, firstname |
Matthew
2005-12-28 |
re: GROUP BY techniques >> I can’t say how many times I see SELECT's written this way and it is simply wrong. I can't agree. If the other fields are attributes of a customer, then as long as the CustomerID is listed (preferably first) then there isn't a problem. Now, I am not saying its better or worse or faster or slower or whatever. But from a relational data standpoint, one thing that it isn't is 'wrong'. >> Hopefully, you will agree that it is a much cleaner SELECT statement now, it is more efficient and it logically makes more sense when you look at it. Cleaner - debatable. Logical - debatable. But lets focus on performance. How sure are you that the new structure is more efficient? Have you tested on large sets of data? |
Jeff
2005-12-28 |
re: GROUP BY techniques Thanks, Mathew, for the feeback. When writing any SELECT, you have to ask yourself "What am I returning?" In this case, it to return one row per Customer, along with their total sales. What defines a customer? In this case, the PK of the customer table -- CustomerID. Therefore, the ONLY grouping you should be doing ANYWHERE is grouping by customerID. Why would you join every row in the entire transaction table to the customer table and then group those rows by customer name, city, state, etc, when you can simply group the transaction table by one of its own (indexed) columns? Which do you feel is more efficient? As for more logically, convert each into an English sentence: Query 1: Join the customer table with the transaction table, and then return total sales for each combination of cusomterID, customer name, address, city, and state. Query 2: Return customer ID, name, address, city and state from the customer table, and also join to a subquery that returns total sales for each customer. Which statement makes more logical sense to you? |
Matthew
2005-12-28 |
re: GROUP BY techniques >> Which statement makes more logical sense to you? You must understand I was not saying which is better or worse. I was simply saying that possibly 'wrong' is an inappropriate term to use. You stated: >> I can’t say how many times I see SELECT's written this way and it is simply wrong. You should only be grouping on CustomerID, and not on all those other columns. Your second sentence is possibly true. But I don't think the former is. From a relational standpoint, it is perfectly valid to GROUP BY fields that you know are attributes of an entity whose identifier is also included in the GROUP BY. The output of the query will be stable and known - it will work, and you can rely on the fact that it will work. Thus the word 'wrong' seems an odd choice. Maybe I am just nitpicking. |
Matthew
2005-12-28 |
re: GROUP BY techniques >> Which do you feel is more efficient? I'd be interested in whether one was quicker or more efficient than the other, but what I 'feel' is completely irrelevant to the issue at hand. In performance optimisation you sometimes find that what you think is a bottleneck often is not, and doing 'obvious' optimisations may slows things down. My quick testing at my end shows the two queries are performing very similar. But I'd be interested to see an instance where that wasn't the case. |
Jeff
2005-12-28 |
re: GROUP BY techniques Thanks, Matthew. My use of the word "wrong" is definitely referring to style, not that the results do not come out correctly; i.e., it is also "wrong" to ever use RIGHT OUTER JOINs, but technically using them returns valid results, since it is (unfortunately) a valid join type and equivalent to a backwards LEFT OUTER JOIN. As for your note about optimizations, my post has nothing to do with opmizations at all -- it has to do with writing clearly structured SELECT statements that will perform better or (the worst case scenerio) the same as a sloppy one. Depends on the optimizer. The SQL Server 2005 optimizer essentially rewrites the first ("wrong") SELECT as the second ("correct") one anyway when evaluating (look very closely at the execution plan -- it is smart enough to know that it should not be grouping on all those columns) so they come out exactly the same. But to rely on the optimizer to clean up your code for you is not something I would recommend when you can easily write it "correctly" in the first place without any effort other than really understanding how to write a well structured SELECT. |
Matthew
2005-12-29 |
re: GROUP BY techniques >> As for your note about optimizations, my post has nothing to do with opmizations at all Apologies for the misunderstanding - I read your comment about it being more efficient as being related to performance. |
Jeff S
2006-01-12 |
re: GROUP BY techniques Hey Matthew -- no need to apologize at all. I also did not word my post too well, there was lots of room left for interpretation. You raise some great points and thanks again for your feedback! - Jeff |
Michael
2006-05-05 |
re: GROUP BY techniques Well, this post is a bit old, but I figured I would reply with my findings... Jeff, the solution works! I had a query that was developed by another developer where they were grouping by the entire select criteria. The query was joining between two tables and only two columns from the second table were used in aggragate functions, however, they were still grouping by the entire select criteria. anyway, I used to you logic to move the group by on the two columns to the join and it WORKED.!! Rather than trying to group by the entire select statement, I was only grouping by the two columns involved in the aggragate function. With regards to performance, the initial query took about 14 mins to return results (8035 rows), because it had to re-order the results based on the group by on the entire select. When I used your query, the same amount of results were returned, however, this time it only took 30 seconds.. I was pleasantly surprised of the performance increase... Thank you.. |
Baz
2006-05-17 |
re: GROUP BY techniques I have tried the same thing on a large chunk of data between tables, where grouping by select list took 2:49 to return data and grouping in th join on key field took 2:48, no great saving? Any ideas? |
Trevor
2006-06-08 |
re: GROUP BY techniques This is a very useful post. I have run some tests on some sizable datasets and found some slight performance improvement, but perhaps more importantly the improvement in code readability and justifies the technique. |
Zims
2007-04-20 |
re: SQL GROUP BY techniques After reading your article I just used this approach on a query and my performance declined. Can you offer some suggestions as to when each approach is better or worse? I am using SQL Server 2000 in this case and I had to join to the sub-query with the aggregate on two fields, not one as in your example. Could that be the difference? |
Jeff
2007-04-20 |
re: SQL GROUP BY techniques Interesting, but it is impossible to determine for sure what is happening without an example. Performance should never decline. Can you post your SQL and the relevant tables' DDL ? |
Zims
2007-04-22 |
re: SQL GROUP BY techniques Nevermind, I think the problem is that there was no index that met the needs of my newest query. Once I added it, sure enough, the one with the derived table with aggregate like you demonstrated above is faster. When I went to check the indexes there were 11 besides the PK index, and I had to add yet another. As I understand it, there is a trade off between the work required to maintain indexes and the gains from quicker data access. I clearly need to look more into this. Thanks for the great article. I will definitely be using that technique. I hated putting all those fields in GROUP BY clauses and it always seemed clunky to me. Now I know why! |
Samuel
2007-04-30 |
re: SQL GROUP BY techniques Hi Jeff, I've used your SQL structure to pull out a similar set of results with two tables. I have the aggregate function inside the sub query but ms access returns the error SYNTAX ERROR IN JOIN OPERATION. What have I done wrong here? SELECT * FROM tblPurchaseOrderHeader INNER JOIN (SELECT tblPurchaseOrderDetail.HeaderID, SUM(tblPurchaseOrderDetail.TotalAmount + tblPurchaseOrderDetail.TaxAmount) as TotalPO FROM tblPurchaseOrderDetail GROUP BY tblPurchaseOrderDetail.HeaderID) ON tblPurchaseOrderHeader.HeaderID = tblPurchaseOrderDetail.HeaderID; |
Jeff
2007-04-30 |
re: SQL GROUP BY techniques Samuel -- keep in mind this is a SQL Server weblog, many things written here may not work exactly for MS Access. Anyway, for this example, you did not give the derived table an alias. Also, you cannot reference any of the tables within the derived table, you can only reference the output via the alias as well. Note the changes below to your code where the alias X is used (twice, towards the end): SELECT * FROM tblPurchaseOrderHeader INNER JOIN (SELECT tblPurchaseOrderDetail.HeaderID, SUM(tblPurchaseOrderDetail.TotalAmount + tblPurchaseOrderDetail.TaxAmount) as TotalPO FROM tblPurchaseOrderDetail GROUP BY tblPurchaseOrderDetail.HeaderID) X ON tblPurchaseOrderHeader.HeaderID = X.HeaderID; |
zubair Masoodi
2007-05-03 |
re: SQL GROUP BY techniques hi Jeff This is such a Nice article that i have falling short of words to appreciate you. The Example you have provided can give the reader a clear insight of group by caluse. Keep writing such articles and serving the world . Add few more examples and also write some thing on Distinct Clause as well Zubair |
pootle_flump
2007-05-04 |
re: SQL GROUP BY techniques Hi Jeff Just some musings on the above... Another compelling reason for your view point - not repeating the columns in two clauses makes code more maintainable and less likely to introduce a bug (for example if a column is removed from the select clause but the developer forgets to adjust the group by). Having said that I don't totally buy it. You are making two passes at the customer table instead of one. I quickly adapted your code for AdventureWorks. Scan count and logical reads are up for your version. The difference may even be artificially low as the second scan is of a nonclustered index for the example I used. I am happy to concede that your method is more easily maintained and less likely to catch people out if there is a deliberate grouping on a column that is not selected. I guess based on that I have to concede that it is more readable. Based on a very elementary example though the execution plan is less efficient. Obviously there would have to be more tests to firm that up to anything more conclusive. Do you have figures to the contrary? USE AdventureWorks go SET STATISTICS IO ON GO --SET SHOWPLAN_TEXT ON --GO SELECT C.CustomerID, C.CustomerType, SUM(S.SubTotal) as TotalSales FROM AdventureWorks.Sales.Customer AS C INNER JOIN AdventureWorks.Sales.SalesOrderHeader AS S ON C.CustomerID = S.CustomerID GROUP BY C.CustomerID, C.CustomerType SELECT S.CustomerID, C.CustomerType, S.TotalSales FROM AdventureWorks.Sales.Customer AS C INNER JOIN (SELECT C.CustomerID , SUM(S.SubTotal) as TotalSales FROM AdventureWorks.Sales.Customer AS C INNER JOIN AdventureWorks.Sales.SalesOrderHeader AS S ON C.CustomerID = S.CustomerID GROUP BY C.CustomerID) AS S ON S.CustomerID = C.CustomerID GO --SET SHOWPLAN_TEXT OFF --GO SET STATISTICS IO OFF |
Jeff
2007-05-04 |
re: SQL GROUP BY techniques pootle_flump -- great feedback, thanks for testing, but your second SELECT is incorrect; there is no need to include the customer table in the derived table, you just group it by customerID. that's kind of the main point of the article -- group the transaction tables by the key columns w/o joining to the entity tables until the last step (at the outer select). Your second SELECT should read: SELECT S.CustomerID, C.CustomerType, S.TotalSales FROM AdventureWorks.Sales.Customer AS C INNER JOIN (SELECT CustomerID , SUM(SubTotal) as TotalSales FROM AdventureWorks.Sales.SalesOrderHeader GROUP BY CustomerID) AS S ON S.CustomerID = C.CustomerID Also, you are only returning CustomerType, not a whole bunch of columns related to the Customer, so it's not a great example. From what I've tested, it appears that SQL 2005 uses the same execution plan for both techniques, since if you look at it carefully, it essentially re-writes your SQL to work the same way as pushing the grouping into a derived table. (this has been discussed in the comments). I believe that with SQL 2000, however, it is indeed a performance improvement to do your grouping "properly". |
Jeff
2007-05-04 |
re: SQL GROUP BY techniques Using your basic example, let's do a join to the Address table and include address and city in our results, and compare the two techniques: USE AdventureWorks go SET STATISTICS IO ON GO --SET SHOWPLAN_TEXT ON --GO SELECT C.CustomerID, C.CustomerType, AD.AddressLine1, AD.AddressLine2, AD.City, SUM(S.SubTotal) as TotalSales FROM AdventureWorks.Sales.Customer AS C INNER JOIN AdventureWorks.Sales.SalesOrderHeader AS S ON C.CustomerID = S.CustomerID INNER JOIN Sales.CustomerAddress AS CA ON C.CustomerID = CA.CustomerID and CA.AddressTypeID = 3 INNER JOIN Person.Address AD ON CA.AddressID = AD.AddressID GROUP BY C.CustomerID, C.CustomerType, AD.AddressLine1, AD.AddressLine2, AD.City -- SELECT S.CustomerID, C.CustomerType, S.TotalSales, AD.AddressLine1, AD.AddressLine2, AD.City FROM AdventureWorks.Sales.Customer AS C INNER JOIN (SELECT CustomerID , SUM(SubTotal) as TotalSales FROM AdventureWorks.Sales.SalesOrderHeader GROUP BY CustomerID) AS S ON S.CustomerID = C.CustomerID INNER JOIN Sales.CustomerAddress AS CA ON C.CustomerID = CA.CustomerID and CA.AddressTypeID = 3 INNER JOIN Person.Address AD ON CA.AddressID = AD.AddressID GO --SET SHOWPLAN_TEXT OFF --GO SET STATISTICS IO OFF In this case, the second is more efficient. Not by a ton, but it is, and it also has the benefit of being more maintainable, shorter, and more logical. |
pootle_flump
2007-05-04 |
re: SQL GROUP BY techniques >> your second SELECT is incorrect; there is no need to include the customer table in the derived table, Dear me Jeff - and there I was determined not to post a dumb comment :-/ Thanks for going easy on me ;-) Yeah - I knew the query didn't pull in much in the way of other attributes (first time I have delved into AdventureWorks - very odd design that will take a bit of looking around in). I will now bear the advice in mind for the future. Thanks |
Samuel
2007-06-05 |
re: SQL GROUP BY techniques Thanks Jeff, I'm still a newbie to SQL. |
Vladimir
2007-06-20 |
re: SQL GROUP BY techniques Hello Jefj, I am reading this trace on Group By, and it is very revealing. In my case I have a table map_medical_claims with claims identified by claim_id. Claims may have a number of items identified by: cardholder_id,first_dt_of_service,last_dt_of_service, diagnosis_cd_1,diagnosis_cd_2,diagnosis_cd_3,diagnosis_cd_4, procedure_cd_1,procedure_cd_2,procedure_cd_3,procedure_cd_4. Requirements demand identifier for each item in a separate column. This table has about 550,000 rows. I tried to create sql filling an item identifier, but failed. So I went with cursor, but this takes a lots of time, since I had to use two nested cursors. Is there any way to do it in sql? The answer willl be greatly appreciated. Vladimir |
Phil
2007-07-16 |
re: SQL GROUP BY techniques Thanks for the examples. I tried them both, and for my data, the first query performed better than the second query: 70ms vs 7000ms |
Jeff
2007-07-16 |
re: SQL GROUP BY techniques Phil -- Wow, that seems like quite a difference ... can you please post the two SQL statements that you executed, along with the relevant DDL? I'm very curious about your results. No offense, but I suspect that you did something wrong or not optimally when you re-wrote your query .... For example, look at poodle_flump's posts, above -- he initially had the same result as you, but that was because he did not re-write the query efficiently. Thanks in advance. - jeff |
Genady
2007-08-06 |
re: SQL GROUP BY techniques Hi , Jeff Thanks very useful |
Gandy Dancer
2007-08-06 |
re: SQL GROUP BY techniques Um... I would never introduce a subquery if it weren't absolutely required, or unless performance really didn't matter. I'd test the performance both ways, then choose the faster way. I am always right. |
ammoQ
2007-08-07 |
re: SQL GROUP BY techniques It's *always* dangerous to predict which query is faster. Let's look at your first example: SELECT C.CustomerID, C.CustomerName, C.CustomerType, C.Address1, C.City, C.State, S.TotalSales FROM Customers C INNER JOIN (SELECT CustomerID, SUM(Sales) as TotalSales FROM Sales GROUP BY CustomerID) S ON C.CustomerID = S.CustomerID This one is clearly faster than SELECT C.CustomerID, C.CustomerName, C.CustomerType, C.Address1, C.City, C.State, SUM(S.Sales) as TotalSales FROM Customers C INNER JOIN Sales S ON C.CustomerID = S.CustomerID /* Bug fixed */ GROUP BY C.CustomerID, C.CustomerName, C.CustomerType, C.Address1, C.City, C.State or isn't it? What happens if Sales.CustomerID is NULL for 99.99% of all records (sold to anonymous customers in the shop), so only a few Sales records can actually be joined to Customers? The "correct" statements aggregates first, then does the join. Doing that is arguably slower than the other way round. BTW: my prefered way to avoid long group by clauses, arguably just as ugly, is using "max" or "min", as in SELECT C.CustomerID, max(C.CustomerName), max(C.CustomerType), max(C.Address1), max(C.City), max(C.State), SUM(S.Sales) as TotalSales FROM Customers C INNER JOIN Sales S ON C.CustomerID = S.CustomerID /* Bug fixed */ GROUP BY C.CustomerID |
Jeff
2007-08-08 |
re: SQL GROUP BY techniques Hi AmmoQ -- absolutely correct, always go with the most efficient method unless the efficiency gained is minimal and comes as the expense of clean code. Of course, if you have NULL customer values in your Order table, and only want to deal with orders that have customers, you could simply exclude NULL values from your aggregate query. Also -- thanks for pointing out that bug! This post has been up for *years* and no one ever noticed it! Well done! |
mkc
2007-09-10 |
re: SQL GROUP BY techniques I have a little problem here, I hope somebody could help me out. I have a table say 'sample' with two columns, say 'A'(varchar2) and 'B'(date). eg: A B 1 12 july 2007 1 12 aug 2007 1 13 aug 2007 1 14 aug 2007 2 17 aug 2007 2 18 aug 2007 Here the max date for 1 is 14 aug 2007. 14 aug 2007 - 6 = 8 aug 2007. I need to pick up the record which has the minimum date AFTER 8 aug 2007. A B 1 12 aug 2007 (not 12 July 2007 which is the actual minimum date) Can anybody help me with the query? Any suggestions would be appreciated. Thanks. mkc |
Zbigniew Lukasiak
2007-09-21 |
re: SQL GROUP BY techniques That was englighting! And the comments above as well. Thanks. I have just one more comment to the second example. It is not only the NULL that can make the two queries have different meaning. Consider this: create table a( a varchar(1), b varchar(1)); insert into a values ('', 'a'); insert into a values ('a', ''); ymogen2=# select a || b from a group by a, b; ?column? ---------- a a (2 rows) ymogen2=# select a || b from a group by a || b; ?column? ---------- a (1 row) |
Zbigniew Lukasiak
2007-09-21 |
re: SQL GROUP BY techniques Ah - yeah that was on PostgreSQL - '||' means concatenation. |
Jeff
2007-09-21 |
re: SQL GROUP BY techniques Hi Zbigniew -- thanks for your comment, that was actually raised and addressed in one of the very first comments (from Adam), just scroll up a bit to see it. If you have nulls in A or B, and you need A+B to return NULL, then absolutely, don't group on A,B. |
Zbigniew Lukasiak
2007-09-22 |
re: SQL GROUP BY techniques Not exactly - the columns don't contain NULL - they contain '' - and if you think that '' is similar to NULL - then I could make it ('a', 'ab') and ('aa', 'b'). My point is that you don't need NULL's to have the behaviour that Adam described. |
Jeff
2007-09-22 |
re: SQL GROUP BY techniques Hi Zbigniew -- ah, sorry, read to quickly. Yes, you are correct. Again, the fact remains: group on what you need to evaluate the expression, not the expression itself, exactly as I wrote .... If you are just concatenating two columns, or it doesn't logically work for the expression you have, then you cannot group on them separately for the reason you mentioned, absolutely. |
Chandra Mohan Naidu Yerrabothula
2007-11-04 |
re: SQL GROUP BY techniques Hi, Its helped me a lot for my project success. |
Juris
2008-09-11 |
re: SQL GROUP BY techniques I found your post very interesting - and despite it is old, I still will comment it. When considering simple queries, I like this approach, but when it comes to some more complex queries - it makes only extra work. For example: View "viewCars" with several columns including "expDate" column, for which I should make restriction and then GROUP BY two columns "CarID" and "ExpType" and display other columns too. This would be query with "bad" style. But when I tried to rewrite it - it become more complex and still there was big problems to construct query with "good" style at all. Probably I am wrong... But up to date it is my opinion. SELECT CarID, CarName, CompanyID, CompanyName, ExpType, SUM(Expenses) AS Summa FROM (SELECT CarID CarName CompanyID CompanyName ExpType Expenses FROM viewCars WHERE expDate BETWEEN '2008-07-1' AND '2008-07-31' ) subViewCars GROUP BY CarID, CarName, CompanyID, CompanyName, ExpType |
jeff
2008-09-11 |
re: SQL GROUP BY techniques Juris -- no, you did not rewrite that correctly. And you shouldn't query a view like that -- it will not be as efficient as querying the underlying tables of that view properly. You should be writing something like: select c.* (all columns), e.totalExpenses from cars c left outer join (select carID, sum(expenses) as TotalExpenses from expenses where expdate between .... group by carID ) e on c.carID = e.carID Never query a View that does joins that aren't what you need for your SELECT. Or, |
Juris
2008-09-16 |
re: SQL GROUP BY techniques Thanks for your answer! Your answer is completely right for my question :). I have constructed analogue of my real query (which has fields not in English), but I have not described my problem correctly. My idea in general was: I have already views on which are simpler to create bad style queries and get what I need instead of to create new ones (views) with correct approach. To add GROUP BY clause to existing view using correct approach particular is impossible (as far as I understand) because I do not need all rows from existing view, I need already grouped rows. When creating non-existing view - of course I would use your correct approach. The view which I described (viewCars) is already built on several tables: Company Car (should be grouped) Car specification Car-Company-Usage Car user Expenses (should be grouped) Expenses detailed Suppliers And I have two choices - construct bad GROUP BY query with already given view or create new view, which is based on the same tables and with correct GROUP BY clause. Probably the answer, which choice is better is not so simple. But thank you one more time - your post is very usefull. |
TampaDev
2008-09-18 |
re: SQL GROUP BY techniques Does t-sql have something similar to inline groupings that oracle provides, similar to partition by clause? |
Sander
2008-11-12 |
re: SQL GROUP BY techniques Thanks Jeff. I have to do a bit of sql-maintanance and my sql-skills are rusty. I found myself stuffing extra fields in the group-by expression, and I felt I should probably gain a better understanding of group-by before proceeding. Your article was exactly what I needed (and luckily it was the 2nd result of googling "group by"). |
N.Zargari
2008-11-18 |
re: SQL GROUP BY techniques Hi; I have a quick question for you. following is my Query: " SELECT Sum(Amount) As vSum, Count(Amount) As vCount, ZoneCode Into TmpRec FROM GReciept INNER JOIN TmpReciept ON GReciept.RowNo = TmpReciept.RowNo Group By left(ZoneCode,1) with rollup " "ZoneCode" is a two chars. (for example '00', '01', '02', ..., '10', '11', '12', ..., '20', '21'....), and I need group them by first char. I tried above Query, but it doesn't work. I appreciate for any help. N.Zargari |
N.Zargari
2008-11-19 |
re: SQL GROUP BY techniques Hi Jeff; Sorry my email address is nzargari@yahoo.com. Thanks. |
jeff
2008-11-19 |
re: SQL GROUP BY techniques N.Zargari -- You cannot select the entire ZoneCode and group only on the first character. see: http://weblogs.sqlteam.com/jeffs/jeffs/archive/2007/07/20/60261.aspx |
N.Zargari
2008-11-22 |
re: SQL GROUP BY techniques Thank you Jeff; I fixed it. Thanks again for your help. N.Zargari |
N.Zargari
2008-11-22 |
re: SQL GROUP BY techniques Hi Jeff; I got another Question for you. I use sql server 2000 and have a table with 61085 Recs. and 68 chars. in every Rec. I use " Select * from TABLENAME " query and it takes 33 secs. to show the result. How can I reduce this time?(I need all fields). Tanks. N.Zargari. |
cyril
2008-12-14 |
re: SQL GROUP BY techniques Hi sir, I have a question in sql. Is it possible to have a 1 sql using group by to produce this result? Col1 Col2 Col3 AAAA 1111 1111 AAAA 1111 1112 AAAA 1111 1113 BBBB 1111 1111 BBBB 1111 1112 to produce like this: Col1 Col2 Col3 CountofCol3 AAAA 1111 1111 3 AAAA 1111 1112 3 AAAA 1111 1113 3 BBBB 1111 1111 2 BBBB 1111 1112 2 I am using this sql statement to produce this result below: select a.col1, a.col2, a.col3, b.CountofCol3 from table a left join (select col1, col2, count(*) CountofCol3 from table group by col1, col2) b on a.col1=b.col1 and a.col2=b.col2 is there any sql statement that would be shortened my sql statement? thank Cyril |
Michael
2009-02-11 |
re: SQL GROUP BY techniques Hi Jeff. I have to second the comment made earlier that including more than absolutly necessary in the GROUP BY clause is not wrong. For the example you gave, setting out the SQL as you suggest is easy and makes nice, logical code. And as has been pointed out, mindlessly stuffing everything into the GROUP BY can be problematic and is sloppy programming technique. However, if you are joining to multiple tables and calling out aggregate functions on most of the fields your approach could quickly produce rather unwieldy code. In these cases, a few extra fields in the grouping may be the best approach. For example, consider rewriting this fairly simple statement into your approach: SELECT ttbSFHistoryTEMP.rundate, tGoal.SMRevenueGoal, ttbSFHistoryTEMP.OfficeID, sum(ttbSFHistoryTEMP.revenue) as TotRevenue, sum(ttbSFHistoryTEMP.GP) as TotGP, sum(ttbSFHistoryTEMP.booked) as TotBooked FROM ttbSFHistoryTEMP LEFT JOIN (SELECT tblOffice.OfficeID, tblSalesGoals.DivisionID, tblSalesGoals.SMRevenueGoal FROM tblSalesGoals JOIN tblOffice ON tblSalesGoals.DivisionID = tblOffice.OfficeLocationID WHERE tblSalesGoals.SGType = 1 and tblSalesGoals.SGName = DATEPART(year,getdate())-1) tGoal ON ttbSFHistoryTEMP.OfficeID = tGoal.OfficeID GROUP BY ttbSFHistoryTEMP.OfficeID, ttbSFHistoryTEMP.rundate, tGoal.SMRevenueGoal If I understand your approach, in order to get rid of my two extra GROUP BY fields I would need to add another temp query for my groupings and include this with another join. OK, so not too bad for this particular example...but consider as the base statement becomes more complex. Say I need to bring in aggregated results for 12 fields in 4 tables along with 2 or 3 additional identification fields that really don't need to be in the GROUP BY. I don't believe that the resulting code would be more structured or in any way more easily understood by adding the temp tables and extra joins required to GROUP BY only the single attributable field. |
sqlfun
2009-04-10 |
re: SQL GROUP BY techniques Thanks! This is exactly what I was looking for. |
Saravanakumar V
2009-05-07 |
re: SQL GROUP BY techniques Hello, I need a query to solve the below problem. The table rows are given below SerialNo Col2Group --------- ------------ 1 123 2 123 3 123 4 456 5 456 6 456 7 123 8 123 9 789 10 789 11 456 I want the output as ---------- -------------- 1 to 3 123 4 to 6 456 7 to 8 123 9 to 10 789 11 to 11 456 Can anybody send me the query to solve the above problem. My mail ID is saravana.kumar@rsystems.com or skumarv@hotmail.com. Thanks in advance. Regards, Saravanakumar |
Graeme
2009-05-14 |
re: SQL GROUP BY techniques Many thanks for this, very useful, However I am confused about how this works with regard to joining more than two tables. I have simple code: Select recipes.REC_name, SUM(ingredients.sup_ID*ingandrec.quantity) FROM ingredients INNER JOIN ingandrec ON ingredients.I_id = ingandrec.i_id JOIN recipes ON recipes.REC_ID = ingandrec.rec_id WHERE recipes.REC_ID=1 GROUP BY recipes.Rec_ID,recipes.REC_name which Pulls the name of the recipe from the recipes table then works out the total cost based on the ingredients in the recipe (data stored on the Ingredients table), by way of the ingandrec table which tells me each ingredient and how much of each ingredient goes in each recipe. Putting aside my horrendously named tables and my lack of aliases I cannot figure out how to use your approach to remove the extra group by columns (only one at the minute, but it is the principle). Every time I try to reduce it down I get error messages. Could you explain? |
John Jaros
2009-07-13 |
re: SQL GROUP BY techniques I am a teacher writing a program for my school to keep track of absences and latenesses. The SQL below finds the students who have been absent at least n times between specified dates. I have four questions. 1) WHY am I required to GROUP BY so many fields when logically I only want to group by tblAbsence.StudentID??? 2) Some people say that it is valid to use an alias in an ORDER BY clause but it always gives me a "parameter needed" error. Can an alias be used in an ORDER BY clause??? 3) Could you show me the correct way to write my query so that I only need to GROUP BY tblAbsence.StudentID??? 4) Should I put "AbsenceDate BETWEEN StartDate AND EndDate" in parentheses??? I read your information about "grouping by everything until the errors go away" and I'm sure that I'm doing that but I'm still unsure as to how to write my query correctly. Thank you very much. John Jaros SELECT tblAbsence.StudentID AS StudentID, tblStudentInfo.StudentLastName AS LastName, tblStudentInfo.StudentFirstName AS FirstName, tblStudentInfo.StudentMiddleName AS MiddleName, tblStudentInfo.StudentClass AS Class, COUNT (tblAbsence.StudentID) AS NumTimesAbsent FROM tblStudentInfo, tblAbsence WHERE tblStudentInfo.StudentID = tblAbsence.StudentID AND AbsenceDate BETWEEN StartDate AND EndDate AND AbsenceType = 'UNEXCUSED' GROUP BY tblAbsence.StudentID, tblStudentInfo.StudentLastName, tblStudentInfo.StudentFirstName, tblStudentInfo.StudentMiddleName, tblStudentInfo.StudentClass HAVING COUNT(tblAbsence.StudentID) >= <n-times> ORDER BY tblStudentInfo.StudentClass ASC, tblStudentInfo.StudentLastName ASC |
Syni
2009-08-21 |
re: SQL GROUP BY techniques The ways to avoid this error is simple just look into my post http://sqlerrormessages.blogspot.com/2009/08/sql-server-error-message-msg-130-cannot.html |
Nick
2009-09-07 |
re: SQL GROUP BY techniques If you want to use a where clause with grouping here is an example select Position, count(Position) as Total from SpecimenPositions sp inner join (select BoxId from SpecimenBoxes where BoxName='HUN008B') sb on sp.Boxid=sb.Boxid group by Position having count(Position)>1 |
Sydney Web Designer
2009-09-07 |
re: SQL GROUP BY techniques an example to bind inner join and having clause select StorageName, ContainerName, sb.BoxName from SpecimenBoxes sb inner join SpecimenContainers sc on sc.ContainerId= sb.ContainerId inner join SpecimenStorages ss on sc.StorageId = ss.StorageId where sb.boxid in (select boxid from specimens group by boxid having count(boxid)>100 ) order by Storagename, ContainerName, BoxName |
shruti
2009-10-29 |
re: SQL GROUP BY techniques i have one table its having CUSTOMER_NAME,CUSTOMER_PRODUCT,QUANTITY,DATE PATEL MILK 2000 2/12/2005 SHAH YOUGURT 20 3/22/2005 PATEL MILK 200 2/12/2004 I WANT OUT PUT LIKE CUSTOMER_NAME,CUTOMER_PRODUCT,MAX-QUANTITY,MAX-DATE,MIN-QUANTITY,MIM-DATE ,AVERAGE ONLY ONE SINGLE SCAN. NOT A INDIVIDUAL SCAN ANY ONE HELP ME?? |
billb
2009-11-22 |
re: SQL GROUP BY techniques Hey Jeff; Your weary post is still out there doing good! Many thanks for helping me understand how to write better SQL statements. Seeing all those column names in a Group By to appease the PARSER had a bad smell. There had to be a better way. Bill |
cathy
2009-12-02 |
re: SQL GROUP BY techniques Hi , Your post helped me overcome a big headache and was very helping in writing cleaner and logical sql group by statements. keep them coming cathy |
angela
2009-12-03 |
re: SQL GROUP BY techniques Sorry for spamming this!! that one had the wrong date info on the where line SELECT SUM(Value) AS Value, SUM(Unit) AS Units, DATEPART(month, someDate) AS Month, 'All' as Manufacturer, DATEPART(year, someDate) AS Year FROM sometable inner join sometableloclink on sometableloclink.sometableid = sometable.sometableid inner join sometable2 on sometable2.sometableid = sometable.sometableid WHERE (IsNew = 'N') and AND (DATEPART(year, someDate)=@year AND DATEPART(month, someDate)<=@month or DATEPART(year, someDate)+1=@year AND DATEPART(month, someDate)>@month) AND sometable2.DealerShipId=@dealershipId and sometableLocLink.districtId = @districtId and sometableLocLink.industryId = @industryid GROUP BY DATEPART(year, someDate), DATEPART(month, someDate) ORDER BY Year ASC, Month ASC |
Colby Africa
2009-12-20 |
re: SQL GROUP BY techniques Grrr. I always put the group by in the wrong spot too. |
Rami Nsouli
2009-12-29 |
re: SQL GROUP BY techniques Thanks Jeff, great post. clear and nice. regards |
David Mc
2010-01-12 |
re: SQL GROUP BY techniques I much prefer this, clean.. easy and can be compiled via sql server however sql likes it to be: SELECT C.CustomerID, C.CustomerName, C.CustomerType, C.Address1, C.City, C.State, Totals.TotalSales FROM Customers C OUTER APPLY ( SELECT SUM(S.Sales) AS TotalSales FROM Sales S WHERE S.CustomerID = C.CustomerID ) AS Totals |
Jeff S
2010-01-15 |
re: SQL GROUP BY techniques David Mc -- check your performance on that one; applies and correlated subqueries are almost always less efficient than the equivalent JOIN, assuming that SQL doesn't convert it to a JOIN internally. |
David Mc
2010-01-19 |
re: SQL GROUP BY techniques Hey Jeff S, Its a LEFT OUTER JOIN according to the execution plan. SELECT Cost: 0% | Nested Loops (Left outer Join) Cost 0% | | | Clustered Index Scan (Clustered) | [Customers].[PK_Customers] [C] | Cost: 48% | Compute Scalar Cost: 0% | Compute Scalar Cost: 0% | Stream Aggregate (Aggregate) Cost 0% | Clustered Index Scan (Clustered) [Sales].[PK_Sales] [S] Cost: 51% I checked the plans for the other types in the main article, both have differing execution plans again and all 3 use joins and all 3 need to scan each table once using the clustered index... so in "theory" they would be completely comparable right? Interestingly... the second query main article the "better one" performs both clustered scans but also a Sort that takes some of the major execution time (Sort Cost: 63%) not sure how much effect this would have If i have time i'll perform a *real* test.. .one of actual time difference in query execution of large data sets... unless someone wants to do that :p |
David Mc
2010-01-20 |
re: SQL GROUP BY techniques Results of "Estimated Operator Cost" in Actual Execution Plan (Showing only NON-ZERO results) Using inner join group by on sub query 0.0032919 0.0114058 <- this is the sort operator 0.0000059 0.0032831 0.0000042 = 0.0179909 using outer apply 0.0034687 0.0000163 0.0000003 0.0032853 0.0000125 = 0.0067831 using all fields in a group by 0.0032853 0.0034687 0.0000121 0.0000059 = 0.0067720 I don't have alot of data in my tables so these are very small readings heh it seems the best performing query is the one with all the group by items second is to use outer apply, but its so close (each run varies slightly anyway) that they are almost identical third by being more than twice as slow is the inner join by sub table however with alot of data in the table i do wonder what the results would be, i might be tempted to see soon :) perhaps 500,000 customers with random sets of sales between 10 and 50k each? |
Ahmed IG
2010-01-22 |
re: SQL GROUP BY techniques Amazing really.. I saw lots of these smelly code before. You just hit the target perfectly! |
Gokul
2010-02-01 |
re: SQL GROUP BY techniques hi could any one of you correct me where i am wrong . thanks in advance . select b.TARGET_HOST_NAME from PROFILE_BAC_01.SM_DEF_TARGET b INNER JOIN (select a.RMD_PASS_COUNT, a.RMD_FAIL_COUNT, a.RMD_ABNORMAL_COUNT, (a.RMD_PASS_COUNT + NVL(a.RMD_FAIL_COUNT,0) + a.RMD_ABNORMAL_COUNT) as Totsample, (a.RMD_PASS_COUNT / (a.RMD_PASS_COUNT + NVL(a.RMD_FAIL_COUNT,0) + a.RMD_ABNORMAL_COUNT)*100) as Avail from PROFILE_BAC_01.SM_RAWDATA_MEAS_DAY a group by a.RMD_TARGET_ID) test ON b.TARGET_ID = a.RMD_TARGET_ID |
SQL inquirer
2010-02-13 |
re: SQL GROUP BY techniques I am trying to learn SQL and i am doing some self learn things on the internet and from some books but i have hit a question that i dont understand what it means by using the group by function. any help is appreciated... 2. From the items_ordered table, select the item, maximum price, and minimum price for each specific item in the table. Hint: The items will need to be broken up into separate groups. There is no actual table it is just simply writing commands to answer the questions Thanks!!! |
Mattias
2010-03-11 |
re: SQL GROUP BY techniques This was incredibly helpful, truly you made my day! Cheers! |
a
2010-04-02 |
re: SQL GROUP BY techniques Hi, I am trying to group spread(Expression) based on range of balance size using access. In my query, spread is calculated using fields from all 3 tables. My query looks like this Select IIf(C.LifeFloor = 0, Diff(T1.Intrate,F.Ftp),Diffloor(I.Rate,C.Sprd,F.Ftp))As Spread, C.BalTrier From I Inner Join C On I.Index = C.IndexRt Inner Join F on C .acno= F.acno Where C.bal > 0 Group By C.BalTier; Here BalTier is also an calculated expression based on C table ( I left its calculation to simplify). So when I run this query it as usual gives Spread(Expression) is not a part of aggregate function. My question is, how do i use a nested select on C table (to groupby Baltier) when I am calling C table in spread ? If this is not possible as Groupby needs an aggreagate function, Can anyone suggest me another way of achiveing my goal of grouping spread based on the balance tier. Need help badly, Thanks A |
priya
2010-05-05 |
re: SQL GROUP BY techniques if two tables customer having customerid and order having orderid and customer id i would like to how to find total orders placed by each customer using group by |
Karachiite
2010-05-17 |
re: SQL GROUP BY techniques "Push the grouping down a level, into a derived table: "...awesome ! nicer and cleaner! |
Tom
2010-05-18 |
re: SQL Count and MAx functions How can i fix this error? Msg 130, Level 15, State 1, Line 1 Cannot perform an aggregate function on an expression containing an aggregate or a subquery. SELECT f.FilmTitle AS Film, p.FirstName AS Actor FROM Person p, Role r, Film f WHERE f.FilmID IN (SELECT FilmID FROM Award a JOIN Role r ON a.RoleID = r.RoleID) AND f.FilmID = r.FilmID AND r.PersonID = p.PersonID AND r.PersonID IN (SELECT PersonID FROM Role r JOIN Award a ON r.RoleID = a.RoleID GROUP by PersonID HAVING COUNT(*) = ( SELECT MAX(COUNT(*) ) FROM Role r JOIN Award a ON r.RoleID = a.RoleID) ) |
Chuck
2010-09-12 |
re: SQL GROUP BY techniques <<Need Help>> I read your above blog, and found that I suffer from what you are explaining - However, I do not know how to fix it. The below statement will return a list of poker venue leaders for a particular season (with no player idnumber). I need to get the idnumber. When I put it in the select statement I get an error that its not in the GROUP BY statement. When I put it in the GROUP BY statement - The query (of course) returns all the players in all the venues, not the ones with the MAX SUM TOTAL... Please help. SELECT MAX(Total) AS Venue_Leader_Points, ltid, season FROM (SELECT SUM(results1.trnpoints) AS Total, results1.idnumber, lttable.ltid, results1.season FROM results1 INNER JOIN lttable ON results1.ltid = lttable.ltid GROUP BY results1.idnumber, lttable.ltid, results1.season HAVING (results1.season = @season)) AS intermediate GROUP BY ltid, season |