Jeff's SQL Server Blog

Random Thoughts & Cartesian Products with Microsoft SQL Server
posts - 157, comments - 2686, trackbacks - 64

My Links

Advertisement

News

Welcome to my weblog. My name is Jeff Smith, I am software developer in Boston, MA and I was recently named a 2009 SQL Server MVP. Check in frequently for tips, tricks, commentary and ideas on SQL Server and .NET programming.


Subscribe





Archives

Post Categories

Programming

Sports

SQL

But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Server: Msg 8120, Level 16, State 1, Line 1
Column 'xyz' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Arrghh !!! There it is, yet again .. that annoying error message.   Why is SQL so picky about this?  What's the deal!?

A very important concept for anyone who ever works with SQL to fully understand is this: the aggregate function requirement for non-grouped columns is not some arbitrary syntax restriction of SQL in general or of SQL Server's implementation!  It is simple logic, like dividing by zero -- it must be handled and dealt with explicitly, whether we are writing SQL or working it out on paper by hand.

Suppose I ask you this:  "For each city in Massachusetts, what is the total number of residents and what is their age?"  

Now, think logically and carefully about what I am asking you.  Does it make sense?  The total number of people does -- it is just a single value, the count of how many people live in the city.  But how can I return "their age"?  What is really being asked for?  A long list of every age of every person, all delimited by commas (e.g., a list of thousands of repeating numbers)?  A distinct list of all ages, also separated by commas (e.g., 0,1,2,3 ....) ?  Perhaps the average age?  The median age?  The minimum and/or the maximum age? The result of adding up the ages of everyone in the city?

As you can see, we cannot answer the question, since it does not make logical sense without providing more specific information about what you need. 

Now, if we relate this to SQL, the original question directly translates to something like this:

select city, count(*) as residents, age
from people
where state='ma'
group by city

If we run that, the result is that "annoying" error:

Server: Msg 8120, Level 16, State 1, Line 1
Column 'people.age' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

In other words, the computer has the same restriction we do -- it cannot answer that question unless you are more specific!  As you hopefully can see now, this is not some arbitrary syntax error or some SQL Server implementation quirk -- it is a logic error.  We must qualify what we mean by "their age" when asking the question in English; what exactly do we want returned? 

If it turns out that we want the average age of the residents, now we can ask a complete and logical question in English: "For each city, what is the total number of residents and what is their average age?"  And that directly translates to a more complete and valid SQL statement:

select city, count(*) as residents, avg(age) as AverageAge
from people
where state='ma'
group by city

... which will now compile and execute without errors.

If this concept still doesn't make sense, please let me know in the comments and I can try to clarify further; it is so crucial to understand this when writing SQL and thinking logically about the data you are trying to return in a SELECT statement or a report.

(More on GROUP BY here.)

see also:

Print | posted on Friday, July 20, 2007 10:33 AM | Filed Under [ Miscellaneous GROUP BY ]

Feedback

Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Thats awesome dude you rock !!!!!!!!
7/20/2007 10:57 AM | fauxDBA
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

While it is part to the standard to include any non-aggregate object in the group by clause, it can be tedious. Surely the engine can indentify the aggregate and non-aggregate objects. I often have to use nested functions in the select clause for reporting purposes, and in turn have to copy it down verbatim to the group by clause. Only to receive an error message because I overlooked all the AS aliasing.
7/20/2007 12:16 PM | Scott Frigard
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Scott -- derived tables are your friend!

Instead of repeating the expressions twice, just use a derived table. It's cleaner and simpler.

For example,

select a,b,c
from
( select <complicated expression 1> as a,
<complicated expression 2> as b,
<complicated expression 3> as c
from
yourtable
) tmp
group by a,b,c

7/20/2007 2:00 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

One thing I have always wondered, though, is why I can't refer to other columns of a table I have grouped by its primary key. I know the answer is "because you can't", but including them cannot change the grouping, so it feels silly to have to list them in the group by clause (and the clutter certainly makes it harder to see what the real grouping rules are).

I find myself wanting to do this fairly frequently when the foriegn reference specifies some sort of category that I'd like to aggregate statistics about the members of.

SELECT foreign_val1, foreign_val2, AVERAGE(the_table.attribute)
FROM the_table, foreign_table
WHERE the_table.ref = foreign_table.id
GROUP BY foreign_table.id /* but I have to list foreign_val* too, even though they can't possibly change the grouping).

I suppose one could GROUP BY foreign_table.*, to hold the clutter down; DB optimizers are probably smart enough to realize they can group by the key in that case.
7/20/2007 4:40 PM | Kevin Puetz
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

I'm not sure you've quite hit the point here. The reason behind the necessity of the GROUP BY clause, even when its contents could be deduced from the SELECT statement, is the GROUP BY clause is operated on *before* the SELECT clause. SQL syntax obscures this somewhat (QUEL and Tutorial-D make it a little more explicit), but if we visualise a query as something akin to a pipeline, its order is something like this:

. the required tables are joined
. the composite dataset is filtered through the WHERE clause
. the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
. they are then filtered again, through the HAVING clause
. finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.

So whilst the minimum possible GROUP BY clause can be deduced from the SELECT clause, it cannot be deduced from the UPDATE or DELETE clauses; and even for a SELECT statement, the GROUP BY clause can be more extensive. For example, this query:

select count(*) as residents, avg(final_age) as lifetime
from people
where date_of_death is not null
group by city

is perfectly valid, and pretty useful for determining whether there's a correlation between population size and longevity; the actual city names aren't too relevant.

The point is that it's not that the GROUP BY has to name all the columns in the SELECT, but quite the opposite - that the SELECT cannot include any columns not already in the GROUP BY. Because by the point at which the SELECT is done, the GROUP BY will have already aggregated those columns, and the SELECT can only choose from the ones that remain.
7/20/2007 5:22 PM | gwenhwyfaer
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

I'm with Kevin. If I'm joining to a table based on its primary key, then I shouldn't have to include all its columns in the GROUP BY clause. If I have defined the joining column as a primary key, SQL should be able to recognize that, end of story. I have had to include a dozen meaningless columns at the end of a GROUP BY clause because of this worthless check.
7/20/2007 5:50 PM | Impatient
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

does anyone here know what an aggregate is? that might be a good place to start.
7/20/2007 6:12 PM | anon flamer
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Kevin/Impatient -- if you are listing meaningless columns in your GROUP BY, then it means one thing: you have not written it correctly. It is very easy to write good sql to avoid this. See: http://weblogs.sqlteam.com/jeffs/archive/2005/12/14/8546.aspx

There is never a need to list unncessary columns in a GROUP BY.

- Jeff
7/20/2007 6:12 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

gwenhwyfaer --

you wrote:

>>The point is that it's not that the GROUP BY has to name all the columns in the SELECT, but quite the opposite - that the SELECT cannot include any columns not already in the GROUP BY.

I don't recall ever writing that the "group by has to name all the columns in the SELECT" ! That is not my point at all and I have no idea why you think I am saying or implying that.

I appreciate your feedback, I certainly don't disagree with any of it, but it unfortunately
has nothing to do with what I have written.
7/20/2007 6:22 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Your point-to solution works and to a certain extent makes good logical sense. But I can't agree that that is *cleaner* than what I proposed above, which is that SQL simply recognizes the join to a key and simply allows those columns. That would make for a very concise SQL statement that *anybody* could understand at a glance.
7/20/2007 6:29 PM | Impatient
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Impatient -- the problem is, SQL would need to anayze your tables, primary keys and relations and then determine that some selects must explicitly state all grouping columns from certain tables yet others can ommit them. It would be very inconsistent, and it could lead to mistakes and further confusion, especially when examining a SQL statement without the benefit of a schema in front of you.

SQL is a declarative language - you tell it what you want, and the computer takes care of it. The good news is, we don't have to worry about hash tables and page faults and data retrieval; the bad news is, it needs us to specifically tell it what we want it to do logically and completely.

I agree there might be a better or shorter or more efficient way to implement the syntax, such as SELECT GROUPED(ColumnName) or something like that, but I feel that a programming language that forces you to be explicit in what you need and disallows any ambigouity is a good thing.

For example, we could argue in C# that this should be legal:

int a = 0;
b = a; // b is not declared anywhere

since it is clear that b is meant to be an int, and that having to explicitly declare b is annoying and less concise. BUt we all know the drawbacks of that approach and why explicitly declaring variables and data types is a good thing.

Thanks for your feedback and bringing up some good points.
7/20/2007 6:40 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Yes, I could have written that as

SELECT foreign_val1, foreign_val2, avg_attribute
FROM (SELECT ref, AVERAGE(attribute) AS avg_attribute
FROM the_table
GROUP BY ref) AS agg, foreign_table
WHERE agg.ref = foreign_table.id

and this is indeed the solution I wanted the planner to find, where it performs the aggregation earlier in the query since it was on the table's key.

But I find actually writing it myself in this manner far messier; I've had to name the columns involved multiple times, and I have also had to plan the join order myself. If I want to swap around the grouping, I have to rewrite far more fo the query instead of just the GROUP BY. It's a particular nuisance when I want to aggregate across a view that contains joins like those in my example; there I don't have the easy option of reshaping the underlying joins, nor can I say table.*, so I end up with the ugly group by clause full of a long list of redundant columns that we both dislike...

One thing I do use sometimes is the fact that postgresql can define custom aggregation functions; I've defined one called uniq() that just blows up if it gets non-matching values, or returns it if they all match. Quite handy for ad-hoc avoidance of this problem in badly normalized data... but definitely not a clean solution.
7/21/2007 11:21 AM | Kevin Puetz
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Kevin -- I really don't understand what your point is. Yes, sometimes you need to start over and rewrite things; you cannot always just add GROUP BY clauses to a SELECT and expect it to come out right. That's the way things work in programming; sometimes, a block of code is written to work one way specifically and to return results for one situation, and if you decide to change that specification, often you need to rewrite it.

If I write a view that returns table A joined to table B, and then you decide you need to GROUP table A first and THEN join it to table B, you should not use my view; it does you no good. You should write what you need from scratch, or create an alternate view that does the trick.

Here's an analogy for that situation: what if you already have a function that accepts a string and returns an array of each character in that string. Now suppose you need to write another function that takes a string and reverses it. You COULD have your new function call the first function that returns the array, process it backwards, concatenate it together, and then return the results. Or, you could just write a more efficient and shorter function specifically designed to do just what you want. Not a perfect analogy perhaps, but a simple one that hopefully illustrates my point.
7/22/2007 7:59 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

I have one small question :

assume that I have an application that would imitate Excel tables

I have a simple schema to store tables information in one table and and the second table to store columns
for each table in a one to many relationship. there is also a third table to store values for each column
in a one to many relationship.

The question is : How I can display the columns and rows for a specific table in a nice matrix format ???

I have tried many times and the same error come to me again

sorry if I declared a very specific situation ...

thanks,

John,
7/23/2007 11:45 AM | John
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

I used to work on Sybase (MSSQL's Grandpappy) and I don't know if it was a server option that was turned on or if it was just a feature in Sybase but when you submitted:

select, city, count(1)
from A_BIG_TABLE
--without a group by . . .

Sybase just said "OK" and returned every row. This seriously sucked at times . . .

BUT, you could pull off some useful tricks with "having" statements: by selectively grouping my data, my "having" clause could evaluate the results at a different level of granularity than the result set. Too foggy right now to give a good example of this but you get the idea (I hope). I sometimes wish MSSQL would allow this same ability . . .
7/24/2007 4:43 PM | Bob
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

>>BUT, you could pull off some useful tricks with "having" statements: by selectively grouping my data, my "having" clause could evaluate the results at a different level of granularity than the result set. Too foggy right now to give a good example of this but you get the idea (I hope). I sometimes wish MSSQL would allow this same ability . . .

If you are using SQL 2005, look into specifying an OVER() clause for an aggregate function; this may let you do what you are indicating.
7/25/2007 1:30 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Obviously,
SELECT City, COUNT(*) as Residents, Age
FROM People
WHERE State = 'MA'
GROUP BY City

should be exactly the same as
SELECT City, COUNT(*) as Residents, Age
FROM People
WHERE State = 'MA'
GROUP BY City, Age

which could then be further simplified to
SELECT City, COUNT(*) as Residents, Age
FROM People
WHERE State = 'MA'

If I wanted an aggregate function over Age, I would've said so. If I didn't say so, then you should group by that column. No fuss, no muss.
7/26/2007 11:11 AM | Mark Brackett
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Mark -- thanks for you feedback. I think SQL does it this way is to avoid errors or ambiguity and to force the programmer to explicitly declare what he or she wants. In some cases specifying AGE in the select w/o a group by might mean that you want to group on it, but in other cases it might be a mistake and they really want a min or a max or different column or expression or something else altogether. The explicit GROUP BY clause helps with this.

Even worse when you have complex expressions that involve a non-grouped column buried somewhere -- often sql programmers forget or lose track of a grouping and reference something that they logically cannot, and the error is great because it forces them to think about what they need and be explicit.

This is just like declaring variables or keywords like "overrides" or "shadows" in other languages. Technically, you don't *need* it, but requiring a programmer to state it helps to assert exactly what is desired and to help avoid programming and logic errors in code. And that's a good thing.
7/26/2007 11:30 AM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Hi Jeff Smith,

Your article is good but pls appreciate that *gwenhwyfaer* explanation is straight forward & makes perfect sense.

cheers
7/30/2007 8:42 AM | Reader
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Reader -- thanks, it sure does make perfect sense, as I indicated in the comments. It just doesn't apply to what I wrote, that's all.
7/30/2007 9:01 AM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Accepting all the explanations, I still wish I had an "autogroup" keyword that derives the group-by clause from the select list...
8/7/2007 3:48 PM | ammoQ
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

I'm trying to make an sql statement for mssql server.

SELECT name, sum(value), 'employee' as type
from <whatever>
group by name, type

I've also tried:

SELECT name, sum(value), 'employee' as type
from <whatever>
group by 1,3

and

SELECT name, sum(value), 'employee'
from <whatever>
group by name, 'employee'


Non of these work.....
Any idea????

8/8/2007 8:19 AM | gia
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

gia -- if you need help, ask your question in the SQLTeam forums.
8/8/2007 8:29 AM | Jeff
Gravatar

# How to write a group by statement in this case?

I am the new user of SQL. Please help me with the following question. Thank you in advance.

How do I do a group by state in the following case?

select a, case when b=0 then 'y' else 'n',c,count(d)
from my_table
group by a,???,c

what should I put in in the above ???
11/8/2007 11:08 AM | Sam Huang
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Is b a bit, and is it always either 1 or 0? then just group by "B". Otherwise, GROUP BY the entire CASE expression.
11/8/2007 11:47 AM | Jeff
Gravatar

# RE: completely new to SQL

I am not a programer at all, i work in finance mainly financial reporting. A file we use is becoming to large to work with due to the amount of data we are pulling into it. I am trying to sum the data before it is dropped into excel, i know i need to use the Group By function but am getting syntax errors.


Currently this is how the program does this:

SELECT ACTIVITY.ACCOUNTNUMBER, ACTIVITY.BALANCE, ACTIVITY.MONTH FROM `database’ ORDER BY ACTIVITY.ACCOUNTNUMBER

I am trying to sum the activity balace before dropping it into excel. I have tried to follow the models on this website and others but am still getting errors. If anyone could help me out with this that would be great.

Thanks
11/27/2007 12:07 PM | jim
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Fine, but there should be a simple count(*) without having to use group by
4/29/2008 4:41 PM | Brandon Duncan
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Hai Jeff,

Great to see your blogs solving much of the confusion to beginners and intermediate programmers. I just love the work you do in these web pages . The explanation you provided is okay here with regards to Age column not in the GROUP By Clause.
But more technically speaking I would say in group by clause we can only put those fields which are not in used in the aggregate function in the select query. Thats a better explanation . I think you know it . Well, I am actually talking about the other side of the coin.The Other side of the coin is that what to do with those fields in the SELECT clause which are used by aggregate function.Well, these fields can be used in the HAVING Clause.So,using this technical understanding we can frame any kind of report.

Thanks .. and great work jeff!!!
5/8/2008 2:02 AM | Joseph Thapa
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Hi Jeff,
Thanks first for i got solution for my problem but i still confused that how its working.
When we group by a column &quot;abc&quot;, we can use all the functions for that column. But how come, we are able to take up an aggregate function for some other column &quot;def&quot; when we are grouping by some other columm &quot;abc&quot;.

Please help me on this

Thanks in advance Jeff,
Sam
5/26/2008 9:26 AM | Sam
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Thanks Jeff for your help and thanks to gwenhwyfaer too. This made a lot of sense:

&amp;quot;The point is that it's not that the GROUP BY has to name all the columns in the SELECT, but quite the opposite - that the SELECT cannot include any columns not already in the GROUP BY. Because by the point at which the SELECT is done, the GROUP BY will have already aggregated those columns, and the SELECT can only choose from the ones that remain.&amp;quot;

I'm starting to understand this, but my select needs more columns then I have in the group by. A join maybe?
6/10/2008 2:19 PM | Gary
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Gary --

You GROUP BY some columns, and for those you aren't grouping, you aggregate -- i.e., you must calc the SUM or MIN or MAX. It's as simple as that.
6/10/2008 2:24 PM | Jeff
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

That nice...but how do we use that statement is an inner join....

Here is my problem:



SELECT Distinct e.EmployeeNo, COUNT(j.JobOrderNo) AS 'Number of Workloads', e.empfirstname, e.empposition
FROM tblEmployees e INNER JOIN
tblJobOrder j ON e.EmployeeNo = j.EmployeeNo
WHERE (e.EmpPosition = 'Senior Technician') AND (e.EmpStatus = 'Active') OR
(e.EmpPosition = 'Junior Technician') AND (e.EmpStatus = 'Active')
GROUP BY e.EmployeeNo, e.emplastname



im having the same problem but its just using an inner join stament...can you guys try to correct this
thanks
6/29/2008 1:02 AM | Ravi
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Jeff, thanks for your explanation. My problem is, I don't want to Group By and I don't want to aggregate a column, I simply want to display the value of that column without it grouping and throwing everything else off. I realize I can use a derived table with grouping/aggregates and then point back to it and add in this additional field, but that's seems like a bunch of extra work. All i want is to display the field value in the table just like I would with any other info in a normal SELECT query, yet I see no way to do that.
6/1/2009 4:12 PM | Mike
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

Finally! Thankyou, have never really understood this.
1/11/2010 4:54 AM | Jon
Gravatar

# re: But *WHY* Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

great explanation !
5/17/2010 4:19 AM | Karachiite
Comments have been closed on this topic.

Powered by:
Powered By Subtext Powered By ASP.NET