efficient select random from large table 1

thedaver · Jan 17, 2007

I have a table with over 4 million rows.
I want to randomly select one row.

The following works:

Code:

select * from items order by rand() limit 1;

but unfortunately it appears to "select" the entire table first, then returns the random row. Consequently this query takes an unacceptably long time.

I need to return a random row more quickly.

I'd like some advice on alternative methods to randomly select from this table.

D.E.R. Management - IT Project Management Consulting

http://www.dermanagement.com/

r937 · Jan 17, 2007

find some way to generate a primary key value randomly

then use

Code:

select * from items
 where item_primary_key =
   ( select min(item_primary_key)
       from items
      where item_primary_key >= [i]value[/i] )

r937.com | rudy.ca

thedaver · Jan 17, 2007

That ROCKS! Thanks!

D.E.R. Management - IT Project Management Consulting

http://www.dermanagement.com/

hvass · Jan 17, 2007

Alternatively

Code:

select * from items 
where rand() <.001
limit 1;

where rand()< is relatively fast
limit 1 gives you just as random a selection

thedaver · Jan 17, 2007

hvass, can you expand upon that approach? I've already plugged in the first option, but i'd like to understand what your code is doing.
Thanks!

D.E.R. Management - IT Project Management Consulting

http://www.dermanagement.com/

hvass · Jan 17, 2007

rand() function returns a pseudo random set of numbers between 0 and 1 so where rand() <.001 will take a pseudo random .1% sample. .0001 is arbitrary this can be any level that you are sure will return some records - (if your table has 4m records and the random sample was perfect a value of 0.00000025 would on average return 1 record but it might not and your file may not be as large as 4m etc.. Something like .0001 will return not too large a set of records in this case .0001*4000000=4000 records from which you use the limit 1 statement to return the first in the series of 4000 records.

So it is doing the same as order by rand() except it does not need to sort the file it uses the where statement which should be faster.

thedaver · Jan 17, 2007

Hmmm;
OK so I tried all three methods:

My original code (twice):
1 row in set (31.02 sec)
1 row in set (41.43 sec)

Code from r937 (twice):
1 row in set (0.02 sec)
1 row in set (0.02 sec)

Code from hvass (twice) [using .001]:
1 row in set (0.02 sec)
1 row in set (0.00 sec)

Problem from hvass was that the record set seemed to be limited to the first 4000 rows in the table whereas the other two solutions drew from a truly random selection of the entire table;

Hvass, did I miss something? Why is the return set so limited?

D.E.R. Management - IT Project Management Consulting

http://www.dermanagement.com/

hvass · Jan 18, 2007

where rand()<.001 gives you a random set of 4000.
limit 1 gives you first record from random set.

So the record returned is a random sample of the 4m original records.

The only reason you use a number like .001 rather than 0.00000025 is to avoid the possibility that it returns zero records.

thedaver · Jan 18, 2007

Hvass;
But it seemed to me, by running about 15 consecutive searches that the records were coming from a range in the first 5-10,000 records almost exclusively - whereas the other approaches returned records randomly throughout the entire 4mm records.

So, I understand that the sample set is a window of ~4K records - but I infer from my sampling that it was taking the sample from the same narrow, contiguous range of records.

Appreciate your continued thoughts.

D.E.R. Management - IT Project Management Consulting

http://www.dermanagement.com/

hvass · Jan 18, 2007

thedaver

wow - appologies you are right it does not work how I thought it did.

>>>where rand()<.001 gives you a random set of 4000.
is correct it does but it returns them in order so
>>>limit 1 gives you first record from random set.
but this limit 1 gives you first recrod from an 'ordered list' of the random set of 4000.

The problem is that if you know you have 4m records you should be able to say

select * from items
where rand() <(1/4000000);

and this should return you exactly 1 record - but only on average so you have to return a few records and sort the resulting records

select * from items
where rand() < .0001 #returns 4k random sample records from 4m
order by rand() #orders the 4k into random order
limit 1;

again this is not very general as it assumes you know the size of your file so writing this generally..

select @sample:=1/count(*)*10 from items; #makes sure on average you will get 10 rows returned
select
*
from items
where rand()<@sample
order by rand()
limit 1;

but then this will take longer to process...

thedaver · Jan 19, 2007

no problem, I appreciate the expanded logic.

New results from hvass' logic:
1 row in set (10.59 sec)
1 row in set (3.90 sec)
1 row in set (3.69 sec)

But NOW the selection IS yielding random results from the entire data set.

D.E.R. Management - IT Project Management Consulting

http://www.dermanagement.com/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

efficient select random from large table 1

thedaver

IS-IT--Management

r937

Technical User

thedaver

IS-IT--Management

hvass

Programmer

thedaver

IS-IT--Management

hvass

Programmer

thedaver

IS-IT--Management

hvass

Programmer

thedaver

IS-IT--Management

hvass

Programmer

thedaver

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor