Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Access qry with subquery exp(sum(log(..)) to SQL qry 1

Status
Not open for further replies.

blackduck

Programmer
Jun 11, 2002
119
AU
I have a qry that gets the correct output result in Access, but it is way too big for it to process. As suggested I have been able to get it put on MS SQL Server 2008, but it wasn't successful.

In summary, the subqry matches heaps of records, works out the product in groups, then the qry sums groups of these product results.

The Access qry is:
SELECT subqryProduct.TimeID, Sum(subqryProduct.ProductOfProbability) AS SumOfProductOfProbability INTO [Test sum product for sql]
FROM subqryProduct
GROUP BY subqryProduct.TimeID;

The Access sub qry is:
SELECT tblProbablility.TimeID, Exp(Sum(Log([tblProbablility].[Probablility]))) AS ProductOfProbability
FROM tblOverall INNER JOIN (tblPossible LEFT JOIN tblProbablility ON (tblPossible.WinnerID = tblProbablility.WinnerID) AND (tblPossible.MatchID = tblProbablility.MatchID)) ON tblOverall.PossibleID = tblPossible.PossibleID
GROUP BY tblProbablility.TimeID, tblOverall.OverallResult, tblPossible.PossibleID
HAVING (((tblProbablility.TimeID)=1) AND ((tblOverall.OverallResult)=2));

The sql I have tried is:
SELECT tblProbablility.TimeID, EXP(SUM(LOG(tblProbablility.Probablility))) AS Expr1, tblOverall.OverallResult, tblPossible.PossibleID
FROM tblOverall INNER JOIN
tblPossible ON tblOverall.PossibleID = tblPossible.PossibleID INNER JOIN
tblProbablility ON tblPossible.MatchID = tblProbablility.MatchID AND tblPossible.WinnerID = tblProbablility.WinnerID
GROUP BY tblProbablility.TimeID, tblOverall.OverallResult, tblPossible.PossibleID
HAVING (tblProbablility.TimeID = 1) AND (tblOverall.OverallResult = 2)

If anyone has ideas of where I have gone wrong, it would be greatly appreciated.
 
The first thing I would do is write this as a derived table, like this:

Code:
SELECT subqryProduct.TimeID, 
       Sum(subqryProduct.ProductOfProbability) AS SumOfProductOfProbability 
FROM   [!]([/!][green]
       SELECT tblProbablility.TimeID, 
              Exp(Sum(Log([tblProbablility].[Probablility]))) AS ProductOfProbability
       FROM   tblOverall 
              INNER JOIN tblPossible 
                ON tblOverall.PossibleID = tblPossible.PossibleID
              LEFT JOIN tblProbablility 
               ON tblPossible.WinnerID = tblProbablility.WinnerID
               AND tblPossible.MatchID = tblProbablility.MatchID
       GROUP BY tblProbablility.TimeID, 
              tblOverall.OverallResult, 
              tblPossible.PossibleID
       HAVING tblProbablility.TimeID=1 
              AND tblOverall.OverallResult=2[/green]
       [!]) As subqryProduct[/!]
GROUP BY subqryProduct.TimeID

Notice that the "subquery" part is essentially the same as the access subquery. the only changes I made were formatting.

More importantly though, notice the parts in red. This is how you make a derived table.

Lastly.... the Exp(Sum(Log stuff. I checked, and this should work in SQL Server, but it's not necessarily how I would do it. I suspect this was done to prevent overflow errors in Access. Summing big numbers creates bigger numbers. Eventually this can lead to overflows. Now that it's in SQL server, You have other options that will probably execute faster.

If probability is an integer column, the you could convert to big int. If it's a decimal data type, then you probably don't need to do anything (sql should automatically expand the size of the decimal as needed). Worst case... convert to float.

Anyway... I hope this helps.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thanks George. Probability is a double eg 0.326086956521739 and there are no negative or zero numbers.

The data is massive so efficiency is a big issue. I am going to end up with a total of 82,893 records (27,631 records for each of the 3 available OverallResults).

I couldn't output the subquery first because it ends up with 531,442 results of ProductOfProbability multiplied by the 27,631 TimeIds I have.

Now given how big it actually is do you think it will work, because you mentioned it wasn't necessarily how you would do it.

Thanks
Kerri
 
There is no 'double' data type in SQL Server. There's float, real, and decimals. If the data type of the column is float, I would just remove the exp and log part. I really don't think it's needed.

Change:

Exp(Sum(Log([tblProbablility].[Probablility]))) AS ProductOfProbability

to:

Sum([tblProbablility].[Probablility]) AS ProductOfProbability

If I'm not mistaken, that should produce the same results, but with less calculations so it will execute faster.

Have you tried running the query? Does it produce the results you expect? Is it fast enough? With the amount of data you mentioned, I would expect the results to return in less than a second. If it's significantly slower than this, then you probably don't have the necessary indexes to make it fast.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
The result for ProductOfProbability is eg. 5.32829322812585E-05 so I suppose it would be a real?

I won't be able to remove the exp and log because then it just adds up these numbers, I need them to be multiplied by each other first. (The results of the multiplications are then grouped and added at the end).

Regards
Kerri
 
I know its been awhile since posts, but I just want to thank you George (gmmastros) because I went back and used the first query you sent and it worked successfully.

I have now finished running the query over all the data, finally. (It took approx 2hrs per 1000 results).

Another point I went wrong was that from TimeId 21480 onward there are in fact probabilities of 0. So I then had to run a slightly different query for just those remaining TimeIds.

Sent you a star and a big thank you.
Regards
Kerri
 
2 hours per 1,000 results? ouch.

If you just recently upgraded from Access, I strongly encourage you learn a bit about indexes. Access can be a little more forgiving in regards to indexes.

If you are interested in making this query run faster, let me know and I can help guide you through the process. Basically, you want indexes on columns you join to other tables with.

For example:

[tt][blue]
LEFT JOIN tblProbablility
ON tblPossible.WinnerID = tblProbablility.WinnerID
AND tblPossible.MatchID = tblProbablility.MatchID
[/blue][/tt]

The tblProbability table should have a composite index on WinnerId and MatchId.

There are other indexes that would be useful for this query. When indexing your tables, you need to be a little careful that you don't put too many indexes on it because indexes can slow down your inserts, updates and deletes.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Hi gmmastros.

Cant believe I didn't check and just assumed relationships and keys would copy across to sql from access. Anyway have done some research and worked out how to add the primary keys and relationships, just accepted clustered indexes i think.

My tables and keys are as follows, I put primary keys in upper case. Also shown you the number of records in each table.

tblTime (TIMEID, TimeObservation) 27,631
tblWinner (WINNERID, WinnerName) 3
tblProbability (TIMEID, MATCHID, WINNERID, Probability) 994,716(27,631x12x3)
tblMatch (MATCHID) 12
tblPossible (MATCHID, POSSIBLEID, WinnerId) 6,377,292(12x531,441)
tblOverall (POSSIBLEID, OverallResult) 531,441
 
ps. I won't be needing to insert, update or delete records in this database. The results from the query will be analysed. This is going to be done for a few more sets of different source data.
 
Is the performance acceptable now. Based on the numbers you provided, I would expect this query to take less than 1 minute.... probably less than 10 seconds.

Since you won't be needing to insert, update, or delete data, having lots of indexes will not be a problem. As such, you should probably run the Database Engine Tuning Advisor. Do this:

Open SQL Server Management Studio.
Load your query in to a query window.
On the top menu bar, click Query -> Analyze query in Database Engine Tuning Advisor
Click the analyze button.

You will probably see multiple recommendations to improve the performance.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
I have run the Tuning Advisor and the results were 0% Estimated Improvement - which I assume is good.

SQLServer has been installed on a pc I log into remotely now
Time it takes to run just 100 records = 36 minutes

I also timed it running the same query on a company sql server
Time taken for same 100 records = 5 mins

Both have pk and fk and indexes set from what I can see and the analysis for both is 0% Estimated Improvement.

Any ideas on how I could speed this up? please.


-------------------
My query is:
SELECT subqryProduct.OverallResult, subqryProduct.TimeID,
Sum(subqryProduct.ProductOfProbability) AS SumOfProductOfProbability into TimedTestProbabilityQry100
FROM (
SELECT tblOverall.OverallResult, tblProbability.TimeID,
Exp(Sum(Log([tblProbability].[Probablility]))) AS ProductOfProbability
FROM tblOverall
INNER JOIN tblPossible
ON tblOverall.PossibleID = tblPossible.PossibleID
LEFT JOIN tblProbability
ON tblPossible.WinnerID = tblProbability.WinnerID
AND tblPossible.MatchID = tblProbability.MatchID
GROUP BY tblProbability.TimeID,
tblOverall.OverallResult,
tblPossible.PossibleID
HAVING tblProbability.TimeID>0 AND tblProbability.TimeID<101
AND tblOverall.OverallResult=2
) As subqryProduct
GROUP BY subqryProduct.TimeID, subqryProduct.OverallResult
ORDER BY subqryProduct.TimeID asc
----------------
 
I have run the Tuning Advisor and the results were 0% Estimated Improvement - which I assume is good.

No. Not good. This query can be sped up, but the tuning adviser doesn't know how to do it.

First, let's see what indexes already exist. Run this.

Code:
sp_helpindex 'tblTime'
sp_helpindex 'tblWinner'
sp_helpindex 'tblProbability'
sp_helpindex 'tblMatch'
sp_helpindex 'tblPossible'
sp_helpindex 'tblOverall'



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
sp_helpindex 'tblTime'
PK_tblTime clustered, unique, primary key located on PRIMARY TimeID
sp_helpindex 'tblWinner'
PK_tblWinner clustered, unique, primary key located on PRIMARY WinnerID
sp_helpindex 'tblProbability'
PK_tblProbability clustered, unique, primary key located on PRIMARY TimeID, MatchID, WinnerID
sp_helpindex 'tblMatch'
PK_tblMatch clustered, unique, primary key located on PRIMARY MatchID
sp_helpindex 'tblPossible'
PK_tblPossible clustered, unique, primary key located on PRIMARY MatchID, PossibleID
sp_helpindex 'tblOverall'
PK_tblOverall clustered, unique, primary key located on PRIMARY PossibleID

It gets worse, on this new server I cant even run 1000 records, got following error and tempdb size went to 16,063,424.
Msg 1101, Level 17, State 10, Line 1
Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
 
Hmmm..... the only indexes you have are primary key indexes. This isn't necessarily bad, but it is a little surprising.

From a performance perspective, we usually look at the biggest tables first. In this case, it would be tblPossible and tblProbability. This means we need to be very careful to get the best possible performance when joining these two tables, so let's examine that first.

[tt][blue]
FROM tblOverall
INNER JOIN tblPossible
ON tblOverall.PossibleID = tblPossible.PossibleID
LEFT JOIN tblProbability
ON tblPossible.WinnerID = tblProbability.WinnerID
AND tblPossible.MatchID = tblProbability.MatchID
[/blue][/tt]

Left joins are slower than inner joins, so if it's possible to change this to an inner join, I would recommend you do that.

But... I think the biggest performance gain to be had here is to examine the indexes.

sp_helpindex 'tblPossible'
PK_tblPossible MatchID, PossibleID

sp_helpindex 'tblProbability'
PK_tblProbability TimeID, MatchID, WinnerID

A primary key is used to uniquely identify data. In your case, you have multiple columns participating in the primary key. This is not at all unusual. It simply means that the combination of those multiple columns must be unique. But here's the kicker...

The order in which columns appear in the primary key is important. Your query is joining those 2 tables on WinnerId and MatchId, but the probability table has TimeId as the first column in the index. This means SQL will have to work extra hard to match up the data in the multiple columns. So, in a best case scenario, the 2 largest tables would have similar primary keys.

I suggest the following.

1. Change the order of the primary keys columns in tblProbability. To do this, open SQL Server Management Studio. Drill down to tblProbability. Right click -> design. Click Table Designer -> Indexes/Keys. Select PK_tblProbability. On the right, click Columns, and then click the ... button to the right of the column list. Change the list of columns so they are in this order: MatchId, WinnerId, TimeId. Click OK, Click Close. Close the table. On 'Save Changes', click yes. You can expect this to take a little while, perhaps even several minutes.

2. Change the primary key for tblPossible. I suggest you add WinnerId to the primary key. Obviously you don't need it (because things are working without it). By adding it to the primary key, it will be included in the clustered index created for the primary key. To do this, open the index designer for tblPossible and set the column order to MatchId, WinnerId, PossibleId

3. There is a join between tblPossible and tblOverall. We should create an index on tblPossible so that it can be used for the join.

Code:
Create Index idx_tblPossible_PossibleId On tblPossible(PossibleId)

4. I notice you have a HAVING clause that doesn't use aggregate data. As such, I would encourage you to change the query slightly.

Code:
SELECT subqryProduct.OverallResult, subqryProduct.TimeID,
       Sum(subqryProduct.ProductOfProbability) AS SumOfProductOfProbability into TimedTestProbabilityQry100
FROM   (
       SELECT tblOverall.OverallResult, tblProbability.TimeID,
              Exp(Sum(Log([tblProbability].[Probablility]))) AS ProductOfProbability
       FROM   tblOverall
              INNER JOIN tblPossible
                ON  tblOverall.PossibleID = tblPossible.PossibleID
                [!]AND tblOverall.OverallResult=2[/!]
              LEFT JOIN tblProbability
               ON tblPossible.WinnerID = tblProbability.WinnerID
               AND tblPossible.MatchID = tblProbability.MatchID
               [!]And tblProbability.TimeID>0 
               AND tblProbability.TimeID<101[/!]
       GROUP BY tblProbability.TimeID,
              tblOverall.OverallResult,
              tblPossible.PossibleID
       ) As subqryProduct
GROUP BY subqryProduct.TimeID, subqryProduct.OverallResult
ORDER BY subqryProduct.TimeID asc

I think that if you do the 4 things listed above, this query will be MANY times faster. Please let me know how this works out for you.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Thank you heaps George.

I made all 4 changes you suggested and it works much faster.
My last test of 100 records took 36mins, this test took just 9mins to run after all your improvements.

I am running a test of 1000 records now (which failed with my previous setup) and I can see the tempdb file growing huge again now. Its just over 1 hour now and still running.

Still something a little wrong with new server setup I think
Old query 100 records 36 mins on new server
Old query 100 records 5 mins on old server
New query 100 records 9 mins on new server
New query (cant test on old server just yet)

Again, thank you for the solution you provided, I have learnt so much from you.
 
I'd like to see the query plan for this. Here's how...

Open a new query window.
Right click in the window -> "Results To" -> "Results To Text"
Type this:
[tt][blue]Set ShowPlan_Text On[/blue][/tt]
Press F5 to run it.
Clear the window and copy/paste your query.
Press F5 to run it.
Copy/Paste the contents of the Results window here.


-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT subqryProduct.OverallResult, subqryProduct.TimeID,
Sum(subqryProduct.ProductOfProbability) AS SumOfProductOfProbability
into TestProbabilityRecords5
FROM (
SELECT tblOverall.OverallResult, tblProbability.TimeID,


(1 row(s) affected)

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|--Table Insert(OBJECT:([TestProbabilityRecords5]), SET:([TestProbabilityRecords5].[OverallResult] = [Ryders].[dbo].[tblOverall].[OverallResult],[TestProbabilityRecords5].[TimeID] = [Ryders].[dbo].[tblProbability].[TimeID],[TestProbabilityRecords5].[SumO
|--Top(ROWCOUNT est 0)
|--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1027]=(0) THEN NULL ELSE [Expr1028] END))
|--Stream Aggregate(GROUP BY:([Ryders].[dbo].[tblProbability].[TimeID]) DEFINE:([Expr1027]=COUNT_BIG(exp([Expr1012])), [Expr1028]=SUM(exp([Expr1012])), [Ryders].[dbo].[tblOverall].[OverallResult]=ANY([Ryders].[dbo].[tblOverall].[OverallRes
|--Sort(ORDER BY:([Ryders].[dbo].[tblProbability].[TimeID] ASC))
|--Compute Scalar(DEFINE:([Expr1012]=CASE WHEN [Expr1025]=(0) THEN NULL ELSE [Expr1026] END))
|--Hash Match(Aggregate, HASH:([Ryders].[dbo].[tblProbability].[TimeID], [Ryders].[dbo].[tblPossible].[PossibleID]), RESIDUAL:([Ryders].[dbo].[tblProbability].[TimeID] = [Ryders].[dbo].[tblProbability].[TimeID] AND [Ryders].
|--Compute Scalar(DEFINE:([Expr1015]=log([Ryders].[dbo].[tblProbability].[Probability])))
|--Hash Match(Right Outer Join, HASH:([Ryders].[dbo].[tblProbability].[WinnerID], [Ryders].[dbo].[tblProbability].[MatchID])=([Ryders].[dbo].[tblPossible].[WinnerID], [Ryders].[dbo].[tblPossible].[MatchID]))
|--Clustered Index Scan(OBJECT:([Ryders].[dbo].[tblProbability].[PK_tblProbability]), WHERE:([Ryders].[dbo].[tblProbability].[TimeID]>(0) AND [Ryders].[dbo].[tblProbability].[TimeID]<(6)))
|--Merge Join(Inner Join, MERGE:([Ryders].[dbo].[tblOverall].[PossibleID])=([Ryders].[dbo].[tblPossible].[PossibleID]), RESIDUAL:([Ryders].[dbo].[tblOverall].[PossibleID]=[Ryders].[dbo].[tblPossible].[Possible
|--Clustered Index Scan(OBJECT:([Ryders].[dbo].[tblOverall].[PK_tblOverall]), WHERE:([Ryders].[dbo].[tblOverall].[OverallResult]=(2)) ORDERED FORWARD)
|--Index Scan(OBJECT:([Ryders].[dbo].[tblPossible].[idx_tblPossible_PossibleId]), ORDERED FORWARD)

(13 row(s) affected)

 
Try this:

Code:
Create Index idx_tblProbability_TimeId On tblProbability(TimeId)



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top