Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

[DB2- AS400] S F WHERE col1 IN (SELECT ...) very very very slow !

Status
Not open for further replies.

JFDELGES

Programmer
Nov 24, 2005
8
BE
Hi,

I have a problem with a simple querry:

select * from
tmissdta.prr000f
WHERE NRPOL in
(
SELECT NRPOL
from TMISSDTA.PRR130F
where nrprs = 01545454
)


I think that the sub-querry is executed for each row ... !
:-S


It take approximatively 150 secondes to be executed!

(And this is not a CORRELATIVE querry! like EXISTS...)


PS: i don't want to use a JOIN , because it will be an UPDATE querry ...


Best regards,
JF Delges, Belgium
 
Did you do any explains ?
You have indexes on nrprs?
optimal index probably on nrprs and nrpol in table TMISSDTA.PRR130F, then the subselect would be an index-only access. second index useful on nrpol on table missdta.prr000f

Whether the index creation is generally useful depends also a lot on the rest of your application

Juliane
 
Sometimes I've found that a correlated sub-query using an exists clause is faster than an in clause if indexes are being used. If you have an index on TMISSDTA.nrpol, then the following query might be faster:
Code:
select * from
tmissdta.prr000f 
WHERE exists (
  SELECT NRPOL
  from TMISSDTA.PRR130F 
  where nrprs = 01545454
    and PRR130F.NRPOL = prr000f.NRPOL
)



 
Thank you for your responses.

I agree with you, ddiamonde, the querry with EXISTS is faster...

This is the prove that DB2 is executing my "IN querry" like if it was a correlative sub-querry...
(EXISTS operator is more powerfull for correlative sub-querries)

But this is not a correlative sub-querry, it could have been run once, and the result could have been used for all rows in the main querry...


I conclue that IBM has to improve a lot of things for SQL. DB2 is really bad for this kind of querry (not like Oracle or SqlServer...)


Thanks for your interests.
 
JFDELGES,

I think that the first query is an uncorrelated sub-query. There is no link (correlation) between the top part of the query and the bottom part.

I think that DB2 will create an interim results table for every row in the top part.

Ddiamonde's query is correlated, as there is a link between the two parts of the query.

I think that DB2 will create one interim results table in this 2nd query, hence the improvement in speed.

Marc
 
I think that DB2 will create an interim results table for every row in the top part.

I guess the question is, why isn't DB2 smart enough to create just one temp table for the none correlated sub-query instead of regenerated the same temp table for every row. Both Oracle and SQL-Server are usually smart enough to create just one temp table. I can only conclude that DB2 iSeries just isn't there yet, although I've only played with V5R2. I've heard that the optimizer for V5R3 has many enhancements.

FDELGES, if the version with the correlated sub-query is not fast enough for you, you could always write a stored proc which explicitly creates a temp table.
 
All,
after sending my post yesterday and reading ddiamonde's reply, I realised that what I was saying was nonsense, for which I apologise. I think I was blinded by the results that JFDelges presented to us, without looking at the facts.

I've done a little bit of digging and these quotes are from the DB2 UDB for iSeries SQL Programming Concepts V5R2 manual:
The purpose of a subquery is to supply information needed to evaluate a predicate for a row (WHERE clause) or a group of rows (HAVING clause). This is done through the result table that the subquery produces. Conceptually, the subquery is evaluated whenever a new row or group of rows must be processed. In fact, if the subquery is the same for every row or group, it is evaluated only once. Subqueries like this are said to be uncorrelated.

and

You can also write a subquery that SQL may have to re-evaluate as it examines each new row (WHERE clause) or group of rows (HAVING clause) in the outer-level SELECT. This is called a correlated subquery.

There is also a page here: that illustrates this with a step by step diagram etc.

Reading this leads me to believe that the uncorrelated subquery in JFDelges problem should produce one interim results table, whereas the code supplied by ddiamonde should cause DB2 to create numerous IRTs, as the value changes with each row.

On the surface of it, ddiamonde's query should be slower, but as it is not, this must be down to indexes and table size. JFDelges, does the IRT from the sub select contain a large amount of data? Can you EXPLAIN the query to see what indexes are being hit by the two different versions of the query?

I'd be interested in the results, and once again, apologies for my erroneous post of yesterday.

Marc
 
MarcLodge,

- The interim result table contains 2 rows;
- The non-corralated sub querry is executed
for each row in the main querry! (Visual Explains helps me...)

Thanks for your help, but i can't find any solution, the problem seems to come from DB2 :-( (and i'm on V5R3)


PS: don't apologize, my english is not perfect and my problem was not perfectly explained... ;-)



 
ddiamond,

I also think that writing a stored procedure is the only solution... :)


PS: IBM, WAKE UP, what's that !?
 
JFDelges,

What happens in Visual Explain if you run the EXISTS query?

Marc
 
EXISTS is faster than the "correlated-IN-sub-querry"
because EXISTS stops as soon as possible:
It don't read all the records of the subquerry
(in oposite of corralated IN, that do it...)


The problem is that the sub-querry is NON CORRALATED, and then, should be run once and not for each row.
--> DB2 interpret the "non correlated-IN-sub-querry"
like a "correlated-IN-sub-querry"...


 
My guess would have been that it is an index issue. But since the interim table only contains 2 rows, that cannot be the case. It really does sound like it is treating the non-correlated query as a correlated query (as JFDELGES suggested), although this contradicts the quotes above from DB2 UDB for iSeries SQL Programming Concepts V5R2.
 
Is somebody running on AS400 V5R2 / V5R3 ?

If that's the case, could you test the same kind of querry?

I think that's a strange probleme...
 
I could try on a Z/OS mainframe if you like. What are the indexes on the two tables.

Also, when was RUNSTATS last run against the tables. I've seen DB2 do some odd things with tables when this hasn't been run for a while. If it is a while ago, try running RUNSTATS and see if this makes any difference.

Marc
 
I'm running on V5R2, but I don't know how to duplicate your table structures, indexes, and data.
 
We can do a simple test...

Just take a simple JOIN querry like this:

select table1.*
from table1, table2
where table1.id = table2.id
and table2.col3 = 'TEST';


And transform it like this:

select *
from table1
where table1.id IN
( SELECT id
FROM table2
WHERE col3 = 'TEST');

The result should be the same... I think that the time of execution should be quite the same... (If there is a big difference, like me, i think there is a problem in DB2)

NB: There is only indexes on the ID columns.
(No primary key... in fact, where i work,
this is not SQL Tables but OS400 old files...)
 
Well, I have access to both a V5R2 and a V5R3 machine, so I could test whether there is a difference...

AFAIK there is no option to perform runstats on AS400, I would be glad to be corrected on this, cause complex queries are a nightmare on AS400 (compared to my tiny DB2 UDB box on NT)

Ties Blom
Information analyst
 
Well, I ran the subquery vs the join in UDB AIX, first in test (small) and then production (real big). In test there was no difference. In production, the subquery took about 3 times as long. I remember reading that a subquery can have varioius effects on the path that the Optimizer picks. In other words, a subquery could cause the Optimizer to pick a less efficient path vs a joined query.

As for correlated queries, apparently you can't assume that because they run the subquery multiple times, it will be a drag on processing. I took this off R Limeback's site:

It may indeed be convenient to imagine a correlated subquery as being "evaluated" or "executed" once for each row of the outer query; do not, however, be misled into thinking that the database optimizer actually executes it in that fashion, because this will lead you to believe that it's not very efficient, when in fact most optimizers will process a correlated subquery very efficiently, as a join.


Although JF's original query seems to contradict this contention.


 
PruSQLer,

JF was actually comparing a correlated sub-query to a non-correlated sub-query. He wasn't comparing a correlated sub-query to a join. So I don't know if JF's original non-correlated sub-query really contradicts R Limeback's quote.

Now in the spirit of R Limeback, it may be convenient to imagine a non-correlated sub-query only excutes once, but will the optimizer actually execute it in this fashion?

Does R Limeback say anything about non-correlated sub-queries?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top