Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

pagination of 1.2 million records takes too long 2

Status
Not open for further replies.

POSAPAY

IS-IT--Management
Jul 27, 2001
192
HU
Hi all,

I've got a total of 1.5 million records, with daily 100K new and 100K deleted that have expired.

I'm trying to display about 1.2 million of these records with pagination of a 25 at a time.
If I do that, it times out. I've added TOP 100, to at least have some result, but still takes too long to load 4 pages of 25 records each to display.

Here is how I open the record set:
Code:
	SQL = "SELECT TOP 100 QuarID, EmailFrom, EmailTo, Subject, MsgDate, MsgID "
	SQL = SQL & "FROM tblQuarantine "
	SQL = SQL & "WHERE " & EmailToAddresses & " "
	SQL = SQL & "ORDER BY " & order_by
	  set rs = Server.CreateObject("ADODB.Recordset")
	  rs.CursorLocation = 3
      rs.Open SQL , conn,adOpenKeyset,adLockOptimistic


   if Request("nItemsPerPage") = "" then
     If SummaryShowMaxPerPage <> "" then
       nItemsPerPage = SummaryShowMaxPerPage
     else
       nItemsPerPage = 25
     end if
   else
     nItemsPerPage = Request("nItemsPerPage")
   end if
   ' Set the page size of the recordset
   rs.PageSize = nItemsPerPage
   ' Get the number of pages
   nPageCount = rs.PageCount
   nPage = CLng(Request("nPage"))
   If nPage < 1 Or nPage > nPageCount Then
  	nPage = 1
   End If

   If NOT rs.EOF and not rs.BOF then
      rs.MoveFirst
      rs.AbsolutePage = nPage

		While Not (rs.EOF or rs.AbsolutePage <> nPage)
'.... displayed records go here
			rs.MoveNext
		Wend

Anyone any ideas to what could help speed this up a lot?
It is for display only, no editing needed at all.
 
have you tried putting the query into a stored procedure?

1.2 Million records is a lot and you should expect performance issues..calling a stored procedure may speed it up.

The other thing you can do is adding an auto increment unique key and indexing it. Your query would then select all where unique_key < 26 (returns the first 25 records) and select all where unique_key > 25 and unique_key < 51 returns the next 25 (and so on)




--------

GOOGLE is a great resource to find answers to questions like "how do i..."


--------
 
stored procedure vs. ad hoc query will probably not make a noticeable difference. My guess is.... you have an index problem. More specifically, this query probably cannot use any existing indexes on the table.

To get better help, I would encourage you to post the query after the variables are replaced in the SQL string. This will make it easier to advise you on missing indexes.

While you are at it, run this and post the results here:

Code:
sp_helpindex 'tblQuarantine'


-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Hi,Sorry I was sidetracked on a project, I still need to work on this issue.

Here is the results of that index request:
Code:
index_name	index_description				index_keys
EmailFrom	nonclustered located on PRIMARY			EmailFrom
EmailTo		nonclustered located on PRIMARY			EmailTo
MsgDate		nonclustered located on PRIMARY			MsgDate
MsgID		nonclustered located on PRIMARY			MsgID
QuarID		clustered, unique, primary key located on PRIMARY	QuarID
RejectID	nonclustered located on PRIMARY			RejectID
ServerID	nonclustered located on PRIMARY			ServerID
Subject		nonclustered located on PRIMARY			Subject
 
Don't quite understand. Do you mean someone is going to sit there going next, next, next 25 records at a time to look at 1.2 million records or is it an automated display that shows them 25 at a time for a few seconds and then goes on to the next one.

Basically I'm confused by what display means in this context.
 
Actually the database has grown to 3.3 million records, even with daily old data deletion by a stored procedure.

A subset of it would be called up to a user, that is a huge chunk. They'd probably never flip through all of it 25 at a time, but I'd like to allow them to do so if they wanted to.

More typical usage would be someone looking at the top most recent couple of hundred maybe, and also using a search through the millions to find some record.

Both of these tend to be so slow that they time-out.
 
Your SQL is:
Code:
  SQL = "SELECT TOP 100 QuarID, EmailFrom, EmailTo, Subject, MsgDate, MsgID "
    SQL = SQL & "FROM tblQuarantine "
    SQL = SQL & "WHERE " & EmailToAddresses & " "
    SQL = SQL & "ORDER BY " & order_by

There is an index on the EmailFrom and EmailTo address. If your variable "EmailToAddresses" has an "=" comparison then you are oke. But if you are doing eg a LIKE ... then SQL is probably doing a sequential search.

What is the performance when you enter the same SQL statement directly in an MS SQL query?
 
The default display for a single user's quarantine works okay, but takes a while to load. it paginates about 100-400 quarantined records.


But when I try to search the entire database for a word in the EmailFrom or Subject fields, it just times out.

Here is my last attempt's select statement:
SELECT TOP 5 QuarID, EmailFrom, EmailTo, Subject, MsgDate, MsgID, RejectDesc, RejectDetails FROM tblQuarantine, tblRejectCodes WHERE ((Deliver = 0) AND (Expire = 0) AND (tblQuarantine.RejectID = tblRejectCodes.RejectID)) AND (EmailFrom = '%Peter%' OR Subject = '%Peter%') ORDER BY MsgDate DESC, EmailFrom, Subject

now when I ran it in SQL analyzer in the SQL Server Entrprise Manager, the above statement returned nothing.

Changing the = sign to LIKE in front of the two keywords for Email From and Subject, and rerunning it on the SQL server, it have me a
[Microsoft][ODBC SQL Server Driver]Timeout expired.
 
I'm providing the users a mean to search within their quarantine items, or within their domain name, or thirdly to search across the entire database.
Not all users enjoy this level of access, just domain managers and some system managers.

But I'd like a domain manager to be able to search for an incoming e-mail that ended up in the quarantine to find it in their domain quarantine.

So the Select statement for them ends up being:
Code:
SELECT TOP 5 QuarID, EmailFrom, EmailTo, Subject, MsgDate, MsgID, RejectDesc, RejectDetails FROM tblQuarantine, tblRejectCodes WHERE ((EmailTo LIKE '%domain.com') AND (Deliver = 0) AND (Expire = 0) AND (tblQuarantine.RejectID = tblRejectCodes.RejectID)) AND (EmailFrom LIKE '%peter%' OR Subject LIKE '%peter%') ORDER BY MsgDate DESC, EmailFrom, Subject

domain.com - is their domain. I just replaced it here to protect them.

and %peter% is how they search for a keyword in the sender's e-mail address and in the subject line.

this particular search took:
The task completed in 28.15625 s (28156.25 milliseconds).

Luckily this one completed without a timeout. But previously this usually timed out.
I'd like to be able to not put "TOP 5" in it, and rather paginate the results if it is that many.
 
You showed a query with 2 tables. Can you tell me which table the following columns are in: Deliver, Expire, MsgDate, EmailFrom and Subject.

Also, how many rows are in the tblQuarantine table, and how many rows are in the tblRejectCodes table? Finally, how many rows do you estimate will be returned by the query you just showed?

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Hi,

tblQuarantine is the table that has the following columns: Deliver, Expire, MsgDate, EmailFrom and Subject.

The table tblQuarantine currently has around 3.3 million records

The table tblRejectCodes is just a quick lookup from the code, and it only has 27 records in it in two fileds: RejectID, RejectDesc.

The quarantine table gets new records through out the day by the thousands from a windows application.
So the search will produce a flexing number of records based on how many records are in the database, and the search terms.

A more specific search term, like "userxyz@subdomain.domain.com" would probably have 0 - 5 records returned.
But for a search term like: "peter" I expect domain wide to have hudreds if not thousands returned. And in the global setting, it would return around 10 times more, as this serves around 1- active domains and about a 100 non-active domains currently.

So with 25/page and say in a given search 45000 returned records, I expect my per page counter at the bottom of my GUI to show 1800 pages of results, of which 25 is displayed at a time.
Given that this is running on a DELL 1650 with two 1.26GHz processors and 4GB of RAM, usually utilizing it under 50% on average through out a day, such results to be displayed should not be pushing the limits I would hope.

-Peter
 
I asked the questions because my advice to you depended on it. I assume the tblRejectCodes table has the RejectId as primary key for the table and is therefore indexed.

I would recommend that you use ANSI style joins for your queries. Something like this:

Code:
SELECT TOP 5 QuarID, EmailFrom, EmailTo, 
       Subject, MsgDate, MsgID, RejectDesc, 
       RejectDetails 
FROM   tblQuarantine, 
       [!]Inner Join [/!]tblRejectCodes 
          [!]On tblQuarantine.RejectID = tblRejectCodes.RejectID[/!]
WHERE  EmailTo LIKE '%domain.com'
       AND Deliver = 0 
       AND Expire = 0 
       AND (EmailFrom LIKE '%peter%' OR Subject LIKE '%peter%') 
ORDER BY MsgDate DESC, EmailFrom, Subject

That being said, there may be some index changes that you can make to your table to speed things up. Looking at your current indexes, I'm kinda surprised that all of them are single column indexes. Having single column indexes isn't necessarily bad, but it's not uncommon for multi-column indexes to work better than single column indexes. In this case, a multicolumn index on MsgDate DESC and RejectId may give better performance than a single index on MsgDate. It's hard to say without testing it though.

Ultimately, I suspect your results will benefit the most if you enable full text indexing on your database. I have no experience with full text indexing but I would encourage you to do some research on it, and play with it a little (on a backup copy of your database).

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thanks, I'll test out the JOIN change.

As for full text, with 3.3 million records that change daily by about half a million records, I'm a bit scared of trying to index that much on a full index basis.

What do you think?

Also doing it on a backup DB is a bit hard to move 3.3 million records around.
-Peter
 
the INNER JOIN syntax helped a bit, down by 2 seconds:
The task completed in 26.1875 s (26187.5 milliseconds).

Second time attempt timed out.

Third time:
The task completed in 13.57813 s (13578.13 milliseconds).

Fourth time:
The task completed in 12.625 s (12625 milliseconds).



I'm a bit scared of the full text index. Could someone put me at ease? What does it do when I set that up to a 3.3 million record table?
 
You're making regular backup anyway, right? It should be 'relatively' simple to install an express instance on a computer, and then restore your full DB to the new instance. Enable full text indexing and then spend an hour playing around with it.

Of course, the computer you test this one will not likely be "as good as" the actual server, but that's not a bad thing either. If you can make the query acceptably fast on a slower computer, it'll be even better on the server computer. You don't want to use a super old computer (think paper weight here), but something a little older is good to test on.

Just curious here. If you have .5 million records change every day, with only 3.3 million total, does this mean you have one week's worth of data at any given point? Also, can you explain how you archive old data (done daily, weekly, hourly, etc...) and how new data is added... is this done continuously throughout the day, or are the emails loaded during the night?

You see, the real problem here is with the way you are doing your searches. Let's take the subject line of the email as an example. Suppose the subject line for the email is "Peter is on vacation next week". When you index this column, SQL creates an index that has a pointer to the row in the table where the data is found and another column with the contents of the data (the full text). But, because it is an index, the column is sorted. Now, if you do a search starting at the beginning of the data, SQL can quickly jump to the data you are looking for. Imagine if you were to look in a dictionary for words that start with ST. You would probably be able to quickly skip to part of the book where the ST words begin, and then skip to the last entry where ST words begin. SQL does something similar, and it's called an index seek. Now suppose you wanted to find all the words in a dictionary where the letters ST appears ANYWHERE within the word. It would take you much longer because you would have to start with the first word in the dictionary and examine every word, up through the last one. This would take a lot longer to do. In SQL, this would be considered and index scan (or table scan). Scans are MANY times slower than seeks.

Let's take a closer look at your query.

[tt][blue]
WHERE [!]EmailTo LIKE '%domain.com'[/!]
AND Deliver = 0
AND Expire = 0
AND (EmailFrom LIKE '%peter%' OR Subject LIKE '%peter%')
[/blue][/tt]

Your EmailTo search is a problem because you are looking at the END of the data. Again, imagine you wanted to find all the words in the dictionary that end in ST. You would need to scan all the words to pick out the ones you want, and this would be considerably slower.

[tt][blue]
WHERE EmailTo LIKE '%domain.com'
AND Deliver = 0
AND Expire = 0
AND [!](EmailFrom LIKE '%peter%' OR Subject LIKE '%peter%')[/!]
[/blue][/tt]

You also have 2 columns where you are searching ANYWHERE within the string. Again, you are causing an index/table scan with that search.

Full Text indexing was designed to make these types of searches faster, which is exactly why I recommend that you research this a bit.

Without full text indexing, there are other ways that you MAY be able to speed up this query, but it will take considerable work to do so.

For example, when you load the data in to the table, you could store a "ReverseEmailTo" column. At the time you load the data, you use the Reverse function to store the reverse of the EmailTo column so "email@domain.com" also gets stored (in another column) as "moc.niamod@liame". You could then index this column and change your query to:

[tt][blue]
WHERE [!]ReverseEmailTo LIKE Reverse('%domain.com')[/!]
AND Deliver = 0
AND Expire = 0
AND (EmailFrom LIKE '%peter%' OR Subject LIKE '%peter%')
[/blue][/tt]

Now, instead of an index scan, you'll get an index seek on that column.

Another thing you could do is to create another table that stores individual words for your subject... something like this...

[tt]
tblEmailSubject
QuarID SubjectWord
------- -----------
9999 Peter
9999 is
9999 on
9999 vacation
9999 next
9999 week
[/tt]

Of course, you would want to have a multi-column index on this table where the Subject word appears first and the QuarID appears second.

[tt]Create Index tblEmailSubject_SubjectWord_QuarID On tblEmailSubject(SubjectWord, QuarID)[/tt]

Then you could search through this table and you would get really fast index seeks. Ex:

[tt][blue]
Select Distinct QuarID
From EmailSubject
Where SubjectWord = 'Peter'
[/blue][/tt]

Now, I recognize that each email probably has a subject with 10 words (or more) so this table would have 30 million rows instead of just 3 million. And I know this sounds like a lot, but... this is just for one column. Even though there will be more rows, each row will be smaller and the total size of the table will be smaller. Plus, you'll get index seeks on this table so the performance will be MANY times better than your current query.

Similarly... create another table for the EmailTo column. To store peter@domain.com, the data would look like this:

[tt]
tblEmailTo
QuarID EmailPart
------- ---------
9999 Peter
9999 domain
9999 com
[/tt]

Again, since there will be several parts to your email, there will be more rows in this table than your original, but the table size will be smaller and the performance will be much better. With these two additional tables and your reverse email from column, your query would probably look something like this:

Code:
SELECT TOP 5 QuarID, EmailFrom, EmailTo,
       Subject, MsgDate, MsgID, RejectDesc,
       RejectDetails
FROM   tblQuarantine
       Inner Join tblRejectCodes
          On tblQuarantine.RejectID = tblRejectCodes.RejectID
       Inner Join (
          Select QuarId
          From   tblEmailSubject
          Where  SubjectWord = 'peter'

          Union

          Select QuarId
          From   tblEmailFrom
          Where  EmailPart = 'peter'
          ) as SubjectEmailSearch
       On tblQuarantine.QuarId = SubjectEmailSearch.QuarId
WHERE  ReverseEmailTo LIKE Reverse('%domain.com')
       AND Deliver = 0
       AND Expire = 0
ORDER BY MsgDate DESC, EmailFrom, Subject

I know this sounds like a lot of work, but if you are unable or unwilling to use full text indexing, this is probably the best way to get acceptable performance. All things considered, I would expect your performance to drop from 28 seconds to less than 1 second.

Sad but true... often times code needs to get more complicated in order to get better performance.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Kinda quickly answer your initial questions:

This is a spam quarantine. No need for backup, so no backup is performed on it.
This server serves the MX records of about a 100 domains, and forwards good e-mails to the e-mail server, and adds a record in the SQL db for filtered ones.

A Job triggers a StoredProcedure to delete anything older than 7 days currently. We may allow up to 30 days of quarantine storage later.

So no archiving, no backup, and at this point nothing older than 7 days.

Nightly a report is e-mailed to each user with a list of their quarantined items. Due to a limitation we bumped into, we limited it to their most recent 10 quarantined items currently. To be fixed later.

I don't really have alternate hardware currently to test on, so I'll be doing my testing on the production box. (usually I wouldn't do it this way, but I'm short on resources.)
ok.. cont. to read your full of infos/ideas post. Just wanted to answer the initial questions.
-peter
 
Ok, got through your entire post, and I understand your points.

Question: How does the full index work when daily 500,000 records are added and older 500,000 records deleted?

Does the Full Index update after each INSERT and each DELETE?
How resource intensive is that?
-Peter
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top