Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Left Outer Join tune up...

Status
Not open for further replies.

m0nkey

MIS
Mar 20, 2003
31
0
0
Can someone please look over this code and tell me if there is anything that they can recommend to speed it up...I have all the qualified fields indexed or clustered and the thing is, it runs fast (<1 sec) if i run it without the last line: "AND A.zipcode is null", otherwise it runs for 3.5 minutes... :(

I am not sure why this is slowing this query down so much. The source table [20054_dup] is large:90 million entries but only 6 columns. acct_no is clustered and zipcode is indexed. the avoid_areas table is only 10K records and has 2 columns. The zipcode is clustered. i have tried this with a hint and without. the execution plan shows the additional hash match/right ourter joins, parallelism and clustered index scans when i add this final paft of the left outer join.

Am i missing something with this query? Please let me know if you see something awry or can sugesst a tune somewhere...

Select distinct top 1000 D.acct_no, D.zipcode
from [20054_dup] D With (tablockx, holdlock)
left outer join avoid_areas A WITH(INDEX(IX_avoid_areas)) on A.zipcode = D.zipcode
left outer join repo_35 R on R.acct_no= D.acct_no
WHERE D.timezone = '6'
AND R.acct_no is null
AND D.problem = '0'
AND A.zipcode is null
 
Perhaps I don't understand what the select is trying to do, but... is it possible that you have the tables joining backwards?

Change on A.zipcode = D.zipcode
To on D.zipcode = A.zipcode


-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Thanks for the reply...
I am trying to filter out all the records in the avoid_areas table by zipcode...
i was using a 'where zipcode not in(select zipcode from avoid_areas)' before but i was trying to speed it up by forcing the index and use a left outer join.
it will work and return the corect result set if i use either the left outer before or the not in subquery...

the query will not work if i reverse the join...

thanks
 
Change on A.zipcode = D.zipcode
To on D.zipcode = A.zipcode"

Does this matter George (or anyone)? I've often wondered if it mattered if the ON statement was ordered the same as the tables.
 
Nope. Base/auxiliary tables are determined with order of tables in FROM clause and JOIN type. In blah1 left join blah2, blah1 will always be base table no matter how join expression looks like.

------
"There's a man... He's bald and wears a short-sleeved shirt, and somehow he's very important to me. I think his name is Homer."
(Jack O'Neill, Stargate)
[banghead]
 
Now I know, and knowing is half the battle. I always put them in order because, just like ducks, I like to have things all in a row.
 
The left outer join depends on which table is specified on the left side of the 'LEFT OUTER JOIN" clause.

The order of fields in the ON clause should not matter.

Regards,
AA
 
Here I am, thinking I'm all that. Like Travis, I always put them in a certain order. I thought it made a difference, but I now realize it doesn't. I played around with this a little to verify.

I suspect my confusion came about because I learned about left and right joins back when the syntax looked like...

From A, B
Where A.Id *= B.Id

or

From A, B
Where A.Id =* B.Id




-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
I also put them in order but "reversed" one (A join B on B.foo=A.bar). Why exactly - dunno, probably something mental.

Back to original question (to: m0nkey)... How many rows query returns without last line? Can you post exec plans for both cases? Use SET SHOWPLAN_TEXT ON/GO before query to get plan as text output.

------
"There's a man... He's bald and wears a short-sleeved shirt, and somehow he's very important to me. I think his name is Homer."
(Jack O'Neill, Stargate)
[banghead]
 
After all that, m0nkey, I have no idea why the NULL filter would slow things down so much. If I was taking a wild guess, I'd say the DISTINCT clause has that many more comparisons to make when you invoke the NULL filter. If there is a way to group or otherwise find distinct records, it might be a bit faster. Even a subquery, maybe? I'd try a group first, but just for fun, try something like this:

Code:
SELECT ACCT_NO, D_ZIPCODE 
FROM 
(Select distinct top 1000 D.acct_no AS ACCT_NO, D.zipcode AS  D_ZIPCODE, A.zipcode AS A_ZIPCODE
    from [20054_dup] D With (tablockx, holdlock)
    left outer join avoid_areas A WITH(INDEX(IX_avoid_areas)) on A.zipcode = D.zipcode
    left outer join repo_35 R on R.acct_no= D.acct_no
    WHERE D.timezone = '6'
    AND R.acct_no is null
    AND D.problem = '0') e
WHERE A_ZIPCODE is null
 
here it is without the null:
--Top(1000)
|--Stream Aggregate(GROUP BY:([D].[ACCT_NO]) DEFINE:([D].[ZIPCODE]=ANY([D].[ZIPCODE])))
|--Filter(WHERE:([R].[ACCT_NO]=NULL))
|--Merge Join(Left Outer Join, MERGE:([D].[ACCT_NO])=([R].[ACCT_NO]), RESIDUAL:([R].[ACCT_NO]=[D].[ACCT_NO]))
|--Clustered Index Scan(OBJECT:([ACCT_NOstore].[dbo].[20054_dup].[IX_20054_dup_3] AS [D]), WHERE:([D].[timezone]='6' AND [D].[problem]='0') ORDERED FORWARD)
|--Clustered Index Scan(OBJECT:([ACCT_NOstore].[dbo].[REPO_35].[CIX_REPO_35] AS [R]), ORDERED FORWARD)

HERE IT IS WITH:
Select distinct top 1000 D.ACCT_NO, D.zipcode from [20054_dup] D With (tablockx, holdlock) left outer join avoid_areas A WITH(INDEX(IX_avoid_areas)) on
D.zipcode = A.zipcode left outer join repo_35 R on R.ACCT_NO = D.ACCT_NO WHERE D.timezone = '6

|--Top(1000)
|--Stream Aggregate(GROUP BY:([D].[ACCT_NO]) DEFINE:([D].[ZIPCODE]=ANY([D].[ZIPCODE])))
|--Filter(WHERE:([R].[ACCT_NO]=NULL))
|--Parallelism(Gather Streams, ORDER BY:([D].[ACCT_NO] ASC))
|--Nested Loops(Left Outer Join, OUTER REFERENCES:([D].[ACCT_NO]))
|--Sort(ORDER BY:([D].[ACCT_NO] ASC))
| |--Filter(WHERE:([A].[zipcode]=NULL))
| |--Hash Match(Right Outer Join, HASH:([A].[zipcode])=([D].[ZIPCODE]), RESIDUAL:([D].[ZIPCODE]=[A].[zipcode]))
| |--Parallelism(Repartition Streams, PARTITION COLUMNS:([A].[zipcode]))
| | |--Clustered Index Scan(OBJECT:([ACCT_NOstore].[dbo].[avoid_areas].[IX_avoid_areas] AS [A]))
| |--Parallelism(Repartition Streams, PARTITION COLUMNS:([D].[ZIPCODE]))
| |--Clustered Index Scan(OBJECT:([ACCT_NOstore].[dbo].[20054_dup].[IX_20054_dup_3] AS [D]), WHERE:([D].[timezone]='6' AND [D].[problem]='0'))
|--Clustered Index Seek(OBJECT:([ACCT_NOstore].[dbo].[REPO_35].[CIX_REPO_35] AS [R]), SEEK:([R].[ACCT_NO]=[D].[ACCT_NO]) ORDERED FORWARD)




WHy does it have 2 statements on the second one?


I will test Travis Suggestion now...Thanks!!!
 
well Travis, that did not shave any time off...still running at 3.5+ minutes...
thanks...

Beuller,Beuller,Beuller...I'm just panhandling here i guess...

thanks for the suggestions in advance..
 


m0nkey,

The plan tells clearly why the extra filter badly affect the performance. The first plan shows that the table 20054_dup left outer join to table REPO_35 and then apply the filter (R.acct_no is null). Here SQL Server is smart enough to know that table 20054_dup left outer join to avoid_areas is not useful because there is no filter for table avoid_areas therefore this outer join won't affect the result given the select list is "distinct top 1000 D.acct_no, D.zipcode ", so the first plan saves a left outer join and that's why it's faster than the second. The second plan involves 2 outer join because the extra filter apply on the third table and in turn slow down the query.
 

Hi Guys!

Have anyone noticed following

Code:
Select distinct top 1000 D.acct_no, D.zipcode
    from [20054_dup] D With (tablockx, holdlock)
    left outer join avoid_areas A WITH(INDEX(IX_avoid_areas)) on [B]A.zipcode = D.zipcode [/B]
    left outer join repo_35 R on R.acct_no= D.acct_no
    WHERE D.timezone = '6'
    AND R.acct_no is null
    AND D.problem = '0'
    [B]AND A.zipcode is null[/B]

M0nkey, When you are joining A.ZipCode = D.ZipCode then why you need to check for Null Zipcode?

Will you please clearfy, It will clear your requirement and could help us to Help you.



Regards,


"There are no secrets to success. It is the result of preparation, hard work, and learning from failure." -- Colin Powell
 

Am I wrong? Why you guys have no comments anymore?
 
SajidAttar said:
When you are joining A.ZipCode = D.ZipCode then why you need to check for Null Zipcode?
because that's what you do when you use a LEFT OUTER JOIN to find unmatched rows in the left table


monkey, if you do indeed want [20054_dup] rows which have no matching avoid_areas rows, then presumably {acct_no,zipcode} will be unique, so i'm betting the DISTINCT is superfluous

however, DISTINCT is very expensive, since it involves a sort of the entire result set on all columns, so your query will run a lot faster if you remove it





r937.com | rudy.ca
 
maswien said:
Am I wrong? Why you guys have no comments anymore?
Yes, you are right.

Without last line left join becomes "dummy" - no columns from A were used and query semantics remains intact - and server simplifies exec plan.

With last line (AND A.zipcode is null) table A cannot be ignored and exec time increases to 3.5 minutes.

Depending on strength/configuration of hardware, this response time is reasonable for 90M outer join filtering.

------
"There's a man... He's bald and wears a short-sleeved shirt, and somehow he's very important to me. I think his name is Homer."
(Jack O'Neill, Stargate)
[banghead]
 
Hi m0nkey,

can you post there more info about table 'repo_35' ?
I was trying simulate your problem, having 87 milion rows in '[20054_dup]' table, 12K rows in 'avoid_areas' table and 0 rows in 'repo_35' table and query runs for a 1 second,
but then I insert 800K rows in 'repo_35' table and quwry was too slow.

So I want to know, what indexes you has in that table, how may rows ( aproximately ) it has, and if the values in colums 'acct_no' of that table is unique or not.

Zhavic


---------------------------------------------------------------
In the 1960s you needed the power of two Comodore64s to get a rocket to the moon. Now you need a machine which is a vast number of times more powerful just to run the most popular GUI.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top