Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Is there an alternative to expensive correlated subqueries? 3

Status
Not open for further replies.

bawtry

Programmer
Jul 17, 2001
8
GB
Is there an alternative to the correlated subquery. Running the subquery for each row returned by the outer query must be expensive.
Any help greatly appreciated.
Graham Robbins

The following is an example of a correlated subquery:

/*
Ensure that only the most recently added patient contact of each type is current per patient
*/

UPDATE tblAddressPatient
SET tblAddressPatient.Status = 'Obsolete'
WHERE tblAddressPatient.Id <> (
SELECT TOP 1 tblAP.Id
FROM tblAddressPatient AS tblAP
WHERE tblAP.CIN = tblAddressPatient.CIN
AND tblAP.AddressTypeID = tblAddressPatient.AddressTypeID
ORDER BY tblAP.Id DESC
)
 
You can try this


UPDATE tblAddressPatient
SET tblAddressPatient.Status = 'Obsolete'
FROM (SELECT CIN , AddressTypeID , max(Id)
FROM tblAP GROUP BY CIN , AddressTypeID) AS tblAP
WHERE tblAP.CIN = tblAddressPatient.CIN
AND tblAP.AddressTypeID = tblAddressPatient.AddressTypeID
AND tblAddressPatient.Id <> tblAP.ID


You can further filter the datasource in from clause adding a join with the AddressPatient table to get only relevant records.

RT
 
Generally, correlated sub-queries are inexoensize if the table is properly indexed. SQL doesn't have to scan the entire table, just the correlated records for the criteria used in the sub-query.

I think the following will be more efficient than the query you posted.

UPDATE tblAddressPatient
SET Status = 'Obsolete'
WHERE Status <> 'Obsolete'
AND EXISTS
(SELECT *
FROM tblAddressPatient AS tblAP
WHERE tblAP.CIN = tblAddressPatient.CIN
AND tblAP.AddressTypeID = tblAddressPatient.AddressTypeID
AND tblAP.Id > tblAddressPatient.Id)

The EXISTS subquery stops processing as soon as the first row matching the condition is found. The added condition - Status <> 'Obsolete' - prevents updating unnecessarily. Terry L. Broadbent - DBA
Computing Links:
faq183-874 contains &quot;Suggestions for Getting Quick and Appropriate Answers&quot; to your questions.
 
Hi Terry,

Your comment triggered off some thought process and I did a sample query setup on my database as follows

1. Query on lines of what I had written

update cntr_audit_stts
set stts = 'Obsolete'
from (select cntr_cd , max(prcss_tm) as imprt_dt from rplctn_imprt_hstry
where dscrptn like 'End%' group by cntr_cd) as b
where cntr_audit_stts.cntr_cd = b.cntr_cd
and b.imprt_dt >= isnull(lst_ordr_rcpt_dt,'1960/01/01')

2. Query on lines of exists check what you had suggested

update cntr_audit_stts
set stts = 'Obsolete'
where stts <> 'Obsolete'
and exists (select * from rplctn_imprt_hstry b
where b.cntr_cd = cntr_audit_stts.cntr_cd
and dscrptn like 'End%'
and b.prcss_tm >= isnull(cntr_audit_stts.lst_ordr_rcpt_dt,'1960/01/01'))


-cntr_audit_stts has clustered primary on cntr_cd (1113 records)
-rplctn_imprt_hstry has clustered primary on cntr_cd , prcss_tm (40776 records - avg 36 records per center , median 18 )

So you can see there are no indexing issues and the setup is quite similar to what was posted in the original problem. And I discovered the following

1. If the records likely to be affected in the table to be updated are less than 50% of total records, then second query works significantly better.
2. If the records likely to be affected are more than 80%, then first query works better (gets it done in half the time as second if all records are affected)
3. In the remaining cases, the timings are comparable.

I wondered about case 2 results and thought that the reason may be as follows
- In option 1, I will be doing 40776 ops (number of records to be scanned)
- In option 2, I will be seeking the center index (approx. 11 binary seeks - 2^10 = 1024 and 2^11 = 2048) and assuming I am able to stop the exist query in half the time (approx. 9 (median/2) such ops for each center) which means that for each center correlation subquery I will end up doing approx 99 ops and if you take all 1113 centers into consideration it comes to close to 100K ops which is nearly twice as in option 1

So what I could figure out was that the performance will depend on the actual index statistics of the table and the frequency of the query. For example, if the patient records are to be updated every hour, then correlated queries will definitely be better but if it is to be run once a month/quarter, then one can expect most of the records to be affected and the preprocessed scan join will be better.

I would appreciate, if you could tell me whether I am right in my understanding / assumptions or is it that I am missing some important aspect and a correlated query will always be faster than a full table scan query.

Thanks

RT
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top