top keyword with group by

fikir · Oct 30, 2007

One thing is troubling me

I have a table

delcare @tbl table (depid int, empid int)

insert into @tbl
select 1, 11
union all select 1, 12
union all select 1, 13
union all select 1, 14
union all select 2, 20
union all select 2, 23
union all select 3, 30

I want to select any top max 2 empid for for each depid
i.e two record for each depid if it has two or more records otherwise one

in this case the recordset looks like

depid empid
1 11
1 12
2 20
2 23
3 30

can we use top keyword with group by

Thanks,

ESquared · Oct 30, 2007

You cannot select "top per group" with any native syntax. There are many threads in this forum dealing with that exact problem, though.

In SQL 2005 it is fairly easy with the new ranking functions such as ROW_NUMBER(), although for extremely large data sets these functions can be poor performers.

In SQL 2000 to calculate which rows belong in your result, you'll need to do some kind of self-join, or correlated subquery, or insert to an identity-column bearing temp table.

Code:

SELECT
   depid,
   empid
FROM
   emp e
WHERE
   empid <= ANY (SELECT TOP 2 x.empid FROM emp x WHERE e.depid = x.depid ORDER BY x.empid)

For large datasets, the temp table with an identity-column will perform MUCH better. For small ones, this should be fine.

[COLOR=black #e0e0e0]For SQL and technical ideas, visit my blog, Squared Thoughts.

_{The best part about anything that has cheese is the cheese.}[/color]

r937 · Oct 30, 2007

yes, you can select "top X per Y" easily

Code:

select t1.deptid
     , t1.empid
  from daTable as t1
inner
  join daTable as t2
    on t1.deptid = t2.deptid
   and t1.empid  <= t2.empid
group
    by t1.deptid
     , t1.empid
having count(*) <= 2
order
    by t1.deptid
     , t1.empid desc

r937.com | rudy.ca

ESquared · Oct 31, 2007

Rudy, that self-join does work, but with recent experimentation I've found that for even medium-sized result sets, performance-wise it's far superior to use a temp table with an identity column for such calculations.

I realize this uses tempdb and has space usage considerations. But at a certain point, with a not very large resultset, the I/O and CPU usage of the self-join becomes so great that this method just blows that out of the water. It would go something like this:

Code:

SELECT
   seq = identity(int, 1, 1),
   depid,
   empid
INTO #t
FROM emp e
ORDER BY
   depid,
   empid

SELECT
   t.depid,
   t.empid
FROM
   #t t
   INNER JOIN (
      SELECT depid, sstart = Min(seq) FROM #t GROUP BY depid
   ) x on t.depid = x.depid
WHERE
   seq - sstart < 2

This is untested.

All these methods including my original one with the TOP 2 correlated subquery should be tested for which actually is better. It will make a difference what other columns are in the emp table. Fewer columns will magnify the difference between solutions that have ~ one read per row and solutions that have ~ one read per page. Many many columns will reduce the difference between the queries.

LNBruno · Oct 31, 2007

Or, for sufficiently smaller result sets, you could use a table variable. ;-)

< M!ke >
[small]"Oops" is never a good thing.[/small]

ESquared · Oct 31, 2007

True, a table variable works fine. In both cases a primary key on the work table might help (but this is not always true).

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

top keyword with group by

fikir

Programmer

ESquared

Programmer

r937

Technical User

ESquared

Programmer

LNBruno

Programmer

ESquared

Programmer

Similar threads

Part and Inventory Search

Sponsor