Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Get the Top 100 of each "Category" - a Distribution Query perhaps? 4

Status
Not open for further replies.

kjv1611

New member
Jul 9, 2003
10,758
US
I'm working on a project at the moment, and what I'm trying to do seems like it should be so simple, but I just can't think of how to do it.

I need to pull somewhat of a distribution of records based on date and one other criteria.

For the date differences, I'm just planning on (for now) pulling a few different date ranges, and labeling them according to the date range.

Within that, I want to pull some sort of distribution from across each date range. So, for instance, if I've got 5,000 records that have dates between 1/1/2011 and 1/10/2011, I want to pull 1,000 of those, but I want to try to get as even a distribution as possible amongst the other criteria. The other criteria is a text field (actually, I can query on the ID of said text field, and pull that in later).

So, here is what I'd like the data to look like once the query is run:

[TT]RecordID Fruit Date
11111 Apple 1/1/2011
11112 Pear 1/2/2011
11113 Orange 1/1/2011
11114 Grape 1/10/2011

22222 Apple 2/5/2011
22221 Pear 2/12/2011
22223 Orange 2/7/2011
22224 Grape 2/3/2011[/TT]

I hope that's enough fake data to get the idea across. I'm not concerned with ordering the results for the purpose, I just need to get a distribution across the text category within each date range.

Let me know whether or not I'm making any sense.

Thanks.

 
I've got 5,000 records that have dates between 1/1/2011 and 1/10/2011, I want to pull 1,000 of those

Sounds like you want to pull 1,000 random rows, which should give an even distribution based on the other columns, right?

It's not often, but when I need random rows, I usually do something like this:

Code:
Select Top 1000 *
From   YourTable
Where  DateColumn >= '20110101' 
       and DateColumn < '20110111'
[!]Order By NewId()[/!]

NewId() returns a guid which happens to be random.


-George
Microsoft SQL Server MVP
My Blogs
SQLCop
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Yes, I just need a random selection as you're saying. However, I did not think that NewID() pulled an even distribution based on the other columns in the query automatically. Is that the case, or would I need to add some other piece?
 
If you already know that you want 1,000 out of 5,000 records for a range, plus a representative amount of the rest, try
Code:
 Select TOP 20 PERCENT...



Greg
People demand freedom of speech as a compensation for the freedom of thought which they seldom use. Kierkegaard
 
I must be missing something. I am not seeing how TOP 1,000 or TOP 20% would work without some additional criteria. Does the SQL Server engine automatically distribute the records according to the other fields?

I would think the TOP # or TOP % would just do that - TOP whatever, sorted by whatever ORDER BY clause you use.

Maybe I'm just over thinking the thing. I'll do some testing.
 
Is there not some way to say something like "SELECT THE TOP 10 of EACH Category"? Category being the text field?
 
Yes. There is. Take a look at the Row_Number function (if you are using sql2005 or newer).

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Yeah, this one is on 2005, and I imagine like everything else, it'll be converted to '08 in the next year or two.

Thanks. I did see the Row_Number function, but didn't really understand it all too well. I'll try to find some other resources.

Thanks for the mention.
 
There are several ways to solve this problem.

One of them:
Code:
;with Fruits as (select Fruit from [YourTable] where Your criteria
group by Fruit)

select R.*
from Fruits F 
CROSS APPLY (select top (N) from YourTable T where 
your criteria and F.Fruit = T.Fruit order by NewID() -- to get random distribution per group)
Probably better solution is

Code:
;with cte as (select *, row_number() over (partition by Fruit order by NewID()) as Row from YourTable where Your criteria)

select * from cte where Row <=1000 -- to get N random items per group

Also, I think it will be helpful for you to check this blog post
Optimizing TOP N per Group Queries


PluralSight Learning Library
 
[bomb]

Okay, consider my mind blown. I think I sorta see this one, but it's going to take a while to sink in. I REALLY want to understand it, but I think it's going to take a little while.

Thanks for putting the blog post as well. I'll read over that, and see if it helps clear the mud out o' my head.
 
Okay, I think I'm starting to get this (didn't read the BLOG post yet):
Code:
;with Fruits as (select Fruit from [YourTable] where Your criteria
group by Fruit)

select R.*
from Fruits F
CROSS APPLY (select top (N) from YourTable T where
your criteria and F.Fruit = T.Fruit order by NewID() -- to get random distribution per group)

The top portion in the code window (WITH) similar to using a SUBQUERY, but you're using a WITH statement. I've just never used that in SQL, and had no idea that it was even a possibility.


However, did you mean that to be:

SELECT F.*

Rather than

SELECT R.*

Thanks again.



 
markros,

Just gotta take baby steps, I guess.... I had no idea about CROSS APPLY, WITH, nor PARTITION. I had at least seen RANK, and it appears RANK and PARTITION go hand-in-hand, so I probably just forgot about it since I hadn't used it.

I think, as you stated yourself, your 2nd example will work the best.

I'll keep digging a little more, and post back with whatever I end up doing.

Thanks again for all the suggestions, references, etc.
 
ME said:
However, did you mean that to be:
Never mind that question.. I now realize that you clarified in your next post b/c of the missing parenthesis.
 
Just a quick update... will give final update tomorrow in the morning... I think I now have what I need..

Basically, I'll use the 2nd suggestion from markros, and basically run it 4 times to get 4 different time periods in the dates, but it looks like it's going to give me what I was looking for - a better distribution of the textual [blue]criterion[/blue](I think it'd be proper to refer to it as singular, since one field... but right now, it's highly possible that I'm wrong. [blush])....

I'll post the whole thing tomorrow just to show what I mean... it's looking great so far!

Thanks!
 
Well, I'm test-running the full script now, which includes the above.

What I ended up doing was running the same code 4 times with different date creteria for each, giving me the 4 date periods, and I'm inserting all that into one table.

So, one of such looks like:
Code:
;WITH Stuff AS
(
   SELECT   AllMyStuff ,'2 Years Ago' AS [Period]
           ,Row_Number() OVER (PARTITION BY c.bknCaseStatusID ORDER BY NewID()) AS Row
   FROM     MyTables
   WHERE    MyDate < GetDate() - 730
            AND MyDate > GetDate() - 1095
)
INSERT INTO #MyTempTable
SELECT	*
FROM	Stuff
WHERE Row <=1000

It works beautifully. Thanks for all the suggestions, particularly markros on the used suggestion. gmmastros mentioned the Row_Number as well which was used as part of the solution.
 
Just spent an hour doing a Google search on this for myself. I should have looked here first.

djj
The Lord is my shepherd (Psalm 23) - I need someone to lead me!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top