Hi Terry,
Your comment triggered off some thought process and I did a sample query setup on my database as follows
1. Query on lines of what I had written
update cntr_audit_stts
set stts = 'Obsolete'
from (select cntr_cd , max(prcss_tm) as imprt_dt from rplctn_imprt_hstry
where dscrptn like 'End%' group by cntr_cd) as b
where cntr_audit_stts.cntr_cd = b.cntr_cd
and b.imprt_dt >= isnull(lst_ordr_rcpt_dt,'1960/01/01')
2. Query on lines of exists check what you had suggested
update cntr_audit_stts
set stts = 'Obsolete'
where stts <> 'Obsolete'
and exists (select * from rplctn_imprt_hstry b
where b.cntr_cd = cntr_audit_stts.cntr_cd
and dscrptn like 'End%'
and b.prcss_tm >= isnull(cntr_audit_stts.lst_ordr_rcpt_dt,'1960/01/01'))
-cntr_audit_stts has clustered primary on cntr_cd (1113 records)
-rplctn_imprt_hstry has clustered primary on cntr_cd , prcss_tm (40776 records - avg 36 records per center , median 18 )
So you can see there are no indexing issues and the setup is quite similar to what was posted in the original problem. And I discovered the following
1. If the records likely to be affected in the table to be updated are less than 50% of total records, then second query works significantly better.
2. If the records likely to be affected are more than 80%, then first query works better (gets it done in half the time as second if all records are affected)
3. In the remaining cases, the timings are comparable.
I wondered about case 2 results and thought that the reason may be as follows
- In option 1, I will be doing 40776 ops (number of records to be scanned)
- In option 2, I will be seeking the center index (approx. 11 binary seeks - 2^10 = 1024 and 2^11 = 2048) and assuming I am able to stop the exist query in half the time (approx. 9 (median/2) such ops for each center) which means that for each center correlation subquery I will end up doing approx 99 ops and if you take all 1113 centers into consideration it comes to close to 100K ops which is nearly twice as in option 1
So what I could figure out was that the performance will depend on the actual index statistics of the table and the frequency of the query. For example, if the patient records are to be updated every hour, then correlated queries will definitely be better but if it is to be run once a month/quarter, then one can expect most of the records to be affected and the preprocessed scan join will be better.
I would appreciate, if you could tell me whether I am right in my understanding / assumptions or is it that I am missing some important aspect and a correlated query will always be faster than a full table scan query.
Thanks
RT