Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

COUNT() without GROUPing / Optimization

Status
Not open for further replies.

graefalexander

Programmer
Nov 8, 2004
7
DE
Hello,

I do have two tables:

CREATE TABLE People
(
PeopleID INTEGER PRIMARY KEY,
LocationID INTEGER,
Name VARCHAR
);

PeopleID LocationID Name
---------------------------------------
1 1 Andreas
2 1 Alexander
3 1 Anette
4 1 Aaron
5 2 Babette
6 2 Balthasar
7 2 Barbara
8 3 Christian
9 3 Calvin


CREATE TABLE Location
(
LocationID INTEGER PRIMARY KEY,
LocationName VARCHAR
);

LocationID Name
---------------------------------------
1 Amsterdam
2 Bitburg
3 Chainsville

I want to know where everyone lives:

SELECT * FROM People INNER JOIN Location
ON People.LocationID = Location.LocationID

So I'll get a list with all people and their locations.
Now I want to know how many people live in each location:

SELECT Location.LocationName, COUNT(*)
FROM People INNER JOIN Location
ON People.LocationID = Location.LocationID
GROUP BY Location.LocationID

Now I'll get a list with all locations and a number telling how many
people live there.
However, I want a list with all people, their locations and a number
how many people live there. The only way I have found out was this
(actually doing both queries via subselect and joining them together):

SELECT * FROM People INNER JOIN
(SELECT Location.*, COUNT(*) AS LocationCount
FROM People INNER JOIN Location
ON People.LocationID = Location.LocationID
GROUP BY Location.LocationID) AS Location
ON People.LocationID = Location.LocationID

This works, I'll get a list with all people, their Locations and the number
how many people totaly live in this location. But if I want to filter, for
instance only on people with their name containing an "R", the thing gets
complicated because the WHERE-clause has to be repeated (I only want to know
how many people fullfilling the WHERE-clause live in the location):

SELECT * FROM People INNER JOIN
(SELECT Location.*, COUNT(*) AS LocationCount
FROM People INNER JOIN Location
ON People.LocationID = Location.LocationID
WHERE People.Name LIKE "%r%"
GROUP BY Location.LocationID) AS Location
ON People.LocationID = Location.LocationID
WHERE People.Name LIKE "%r%"

This also yields the right result set.
In the real-world application, the tables are much more complicated, data are several MBs,
the query filters on various rows, and the whole query takes about 800msecs, which is too
much for my application. So my question is:
Is there a combination of the first and the second query, that gives me a list containing
all people with their corresponding locations, and the number of people who live there, without
the overhead of my solution? I tried so many things, but couldnt find a solution.

Thank you in advance, Alex
 
Hmmm.

The summary by location would not run in SQL Server because the column Location.LocationName is not in the GROUP BY list. I dont know about other RDBMS. But that is easy to fix.
Code:
SELECT Location.LocationID,
       COUNT(*) AS nWarrenInhabitants
FROM People
INNER JOIN Location ON
     People.LocationID = Location.LocationID
GROUP BY Location.LocationID
I added a column alias there as well. This query is an intermediate step, I dont need the name for my purpose. However, if this were a result we wanted to display we could use LocationName in the SELECT list and in the GROUP BY list, or use both ID and Name in both lists, or at least in the GROUP BY list.


To get a list of the people and how many others live in their warren I simply JOIN that subquery to People.
Code:
SELECT People.*, Warrens.nWarrenInhabitants
FROM People
INNER JOIN (
             SELECT Location.LocationID,
             COUNT(*) AS nWarrenInhabitants
             FROM People
             INNER JOIN Location ON
                 People.LocationID = Location.LocationID
             GROUP BY Location.LocationID
           ) AS Warrens ON
     Warrens.LocationID = People.LocationID
This is not an especially intense query, resource-wise.


And if I want a similar result for r-type folks
Code:
SELECT People.*, Warrens.nWarrenInhabitants
FROM People
INNER JOIN (
             SELECT Location.LocationID,
             COUNT(*) AS nWarrenInhabitants
             FROM People
             INNER JOIN Location ON
                 People.LocationID = Location.LocationID
             GROUP BY Location.LocationID
           ) AS Warrens ON
     Warrens.LocationID = People.LocationID
WHERE People.Name LIKE '%r%'


Or if I want to know how many r-type people are living with other r-type people.
Code:
SELECT People.*, Warrens.nWarrenInhabitants
FROM People
INNER JOIN (
             SELECT Location.LocationID,
             COUNT(*) AS nWarrenInhabitants
             FROM People
             INNER JOIN Location ON
                 People.LocationID = Location.LocationID
             WHERE People.Name LIKE '%r%'
             GROUP BY Location.LocationID
           ) AS Warrens ON
     Warrens.LocationID = People.LocationID
WHERE People.Name LIKE '%r%'



Which is rather similar to your final query-
Code:
SELECT *
FROM People
INNER JOIN (
             SELECT Location.*,
             COUNT(*) AS LocationCount
             FROM People
             INNER JOIN Location ON
                   People.LocationID = Location.LocationID
             WHERE People.Name LIKE "%r%"
             GROUP BY Location.LocationID
           ) AS Location
         ON People.LocationID = Location.LocationID
WHERE People.Name LIKE "%r%"
So why the performance issue? Hmmm. It is a bit scary that this query has an alias, Location, which is the same as a table name. Dont know what effect that might have; maybe none. And that possibly mal-formed SELECT list. Are you getting syntax errors? What RDBMS are you using?
 
Instead of this:
SELECT * FROM People INNER JOIN
(SELECT Location.*, COUNT(*) AS LocationCount
FROM People INNER JOIN Location
ON People.LocationID = Location.LocationID
GROUP BY Location.LocationID) AS Location
ON People.LocationID = Location.LocationID
You may try this:
SELECT * FROM (People INNER JOIN
(SELECT LocationID, COUNT(*) AS LocationCount
FROM People GROUP BY LocationID) As PLoc
ON People.LocationID=PLoc.LocationID)
INNER JOIN Location
ON People.LocationID = Location.LocationID

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
So why the performance issue? Hmmm. It is a bit scary that this query has an alias, Location, which is the same as a table name. Dont know what effect that might have; maybe none.

No Effect, the alias "AS Location" could be any other name.

And that possibly mal-formed SELECT list. Are you getting syntax errors?

No, not! All quoted queries are tested against a real database, and all values I used were in the database this way, only to be sure the result can be reproduced. The real world application is a bit more complicated, too complicated to let another person understand it quickly, so I set up this simple database, ran the queries and wrote this post. (However you may notice performance wont be an issue for this test database)

What RDBMS are you using?

Its SQLite for an embedded project, which runs quite fast. I have set up indices on the rows, and beside the final query, execution is fast enough. The real world application consists of 3 tables, and the filter runs on various cols from all three tables, thats because I'm afraid repeating the WHERE-Clause is such a performance penalty. However, using only the first or the second query is fast enough (~150ms).

Thank you for your time,
Alex
 
Instead of this:
...
You may try this:

After condensing your query, it seems there is only one additional join to Location wihtout any advantage. If there will be any effect on performance, it wont be positive. And the old problem remains, the WHERE-clause needs to be inserted double, if the result should only include and count for people fulfilling the requirement.

The main problem with my query is, that the table with the people is getting filtered two times, one time for executing the aggregate COUNT(), and a second time to join with the now-filtered Location/COUNT(Location) table. This makes the whole query slow, because the whole table content needs to be compared two times with the LIKE-statement, without the chance of using an index.

I'm afraid the only solution is using a temporary table, in which the first query with filtered results (Name LIKE "%r%") gets stored. But I think there must be a way to accomplish the same thing, without doing nasty things in the database.

The count of People fulfilling the WHERE-clause is necessary for the ranking. Basically, people of one Location are more important, if the ratio of R/Non-R-People in the Location is high. For this, there is also a column Location.TotalPeople which gets updates by TRIGGERs. At the end, I simply sort by the quote of total people in Location vs. people of R-Type in Location. But thats not much of importance, I simply want to get the count of people in the Location matching the criteria without repeating the criteria.

Thanks for your time,
Alex
 
it seems there is only one additional join to Location wihtout any advantage
the subquery has no join, speeding the aggregate stuff.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
the subquery has no join, speeding the aggregate stuff.

I'm sorry, you're right. I didnt understand the query fully. Adding the WHERE-clause would yield the following:

Code:
SELECT * FROM
       ( People INNER JOIN
               (SELECT LocationID, COUNT(*) AS LocationCount
                       FROM People
                       WHERE People.Name LIKE "%r%"
                       GROUP BY LocationID) As PLoc
               ON People.LocationID=PLoc.LocationID ) AS People
INNER JOIN Location
      ON People.LocationID = Location.LocationID
WHERE People.Name LIKE "%r%"

However, there are two problems:

1) The WHERE-clause is still repeated, making the query as fast as before (filtering is the CPU-intensive part because no indices can be used). To test the real performance gain/loss, I will need to have to port the query to my own, more complicated table layout.

2) I'm no longer able to extend the query to fields other than of the table People. So what about not only filtering on People.Name, but also on Location.LocationName? Unfortunately, in my application I'm filtering on 3 tables, which all need to be joined in the inner aggregate statement so that the count is correctly computed.

Thank you for your time!
Alex
 
Hello,

I did some tweaking on the database itself, and now get about 500msec for my query. However
I tried to do it in steps, to see where much time goes in. These are the actual queries, which correspond
to the sample queries above:

Code:
SELECT Albums.Artist, Albums.Title, Albums.TrackCount, COUNT(*) AS Rank
FROM Tracks
       INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
WHERE
       (Albums.Artist LIKE "%Aphex Twin%" AND Albums.Artist != "") OR
       (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")
GROUP BY
        Albums.AlbumID

This gives me a summary of all albums with matching artist or tracks with matching artist,
with a total count of tracks in it, and the num of matched tracks. This runs in 60msecs and
returns 12 rows.

Code:
SELECT AlbumsHit.Artist, AlbumsHit.Title, Tracks.Artist,
       Tracks.Title, AlbumsHit.TrackCount, AlbumsHit.Rank
       FROM Tracks
       INNER JOIN (
             SELECT Albums.Artist, Albums.Title, Albums.AlbumID, Albums.TrackCount, COUNT(*) AS Rank
             FROM Tracks
             INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
             WHERE
                  (Albums.Artist LIKE "%Aphex Twin%" AND Albums.Artist != "") OR
                  (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")
             GROUP BY
                   Albums.AlbumID ) AS AlbumsHit
             ON Tracks.AlbumID = AlbumsHit.AlbumID

Now thats the list with all tracks in the albums that match the query, with the count
of matched tracks (for the relevance ranking) and the count of total tracks in it. Of course
the list includes tracks which do not match the criteria. This list requires about 180msecs and
returns 195 rows. Now match the whole thing with the discs:

Code:
SELECT AlbumsHit.Artist, AlbumsHit.Title, Tracks.Artist,
       Tracks.Title, Discs.Name, AlbumsHit.TrackCount, AlbumsHit.Rank
       FROM Tracks
       INNER JOIN (
             SELECT Albums.Artist, Albums.Title, Albums.AlbumID, Albums.TrackCount, COUNT(*) AS Rank
             FROM Tracks
             INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
             WHERE
                  (Albums.Artist LIKE "%Aphex Twin%" AND Albums.Artist != "") OR
                  (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")
             GROUP BY
                   Albums.AlbumID ) AS AlbumsHit
             ON Tracks.AlbumID = AlbumsHit.AlbumID
        LEFT OUTER JOIN Discs
             ON Tracks.DiscID = Discs.DiscID

Still about 190 msecs / 195 rows. Now we try to filter out all tracks that do not match the criteria:

Code:
SELECT AlbumsHit.Artist, AlbumsHit.Title, Tracks.Artist,
       Tracks.Title, Discs.Name, AlbumsHit.TrackCount, AlbumsHit.Rank
       FROM Tracks
       INNER JOIN (
             SELECT Albums.Artist, Albums.Title, Albums.AlbumID, Albums.TrackCount, COUNT(*) AS Rank
             FROM Tracks
             INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
             WHERE
                  (Albums.Artist LIKE "%Aphex Twin%" AND Albums.Artist != "") OR
                  (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")
             GROUP BY
                   Albums.AlbumID ) AS AlbumsHit
             ON Tracks.AlbumID = AlbumsHit.AlbumID
        LEFT OUTER JOIN Discs
             ON Tracks.DiscID = Discs.DiscID
        WHERE
             (AlbumsHit.Artist LIKE "%Aphex Twin%" AND AlbumsHit.Artist != "") OR
             (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")

Now we have about 440 msec / 130 rows (twice the time of the previous query). Of course
the output list should include much more information:

Code:
SELECT AlbumsHit.*, Tracks.*, Discs.*, AlbumsHit.Rank
       FROM Tracks
       INNER JOIN (
             SELECT Albums.*, COUNT(*) AS Rank
             FROM Tracks
             INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
             WHERE
                  (Albums.Artist LIKE "%Aphex Twin%" AND Albums.Artist != "") OR
                  (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")
             GROUP BY
                   Albums.AlbumID ) AS AlbumsHit
             ON Tracks.AlbumID = AlbumsHit.AlbumID
        LEFT OUTER JOIN Discs
             ON Tracks.DiscID = Discs.DiscID
        WHERE
             (AlbumsHit.Artist LIKE "%Aphex Twin%" AND AlbumsHit.Artist != "") OR
             (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")

So thats about 480 msec / 130 rows, 8 times (!) the time of the first query, and too much
to be called realtime :). The last filtering doubles the time required for the query. If only
I could re-use the filtered list:

Code:
SELECT Albums.*, Tracks.*
       FROM Tracks
       INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
       WHERE
            (Albums.Artist LIKE "%Aphex Twin%" AND Albums.Artist != "") OR
            (Tracks.Artist LIKE "%Aphex Twin%" AND Tracks.Artist != "")
       AS TEMPTABLE FilteredTracks;

SELECT AlbumsHit.*, Tracks.*, Discs.*, AlbumsHit.Rank
       FROM FilteredTracks AS Tracks
       INNER JOIN (
             SELECT *, COUNT(*) AS Rank
             FROM FilteredTracks
             GROUP BY
                   AlbumID ) AS AlbumsHit
             ON Tracks.AlbumID = AlbumsHit.AlbumID
        LEFT OUTER JOIN Discs
             ON Tracks.DiscID = Discs.DiscID;

Of course, this is not valid SQL syntax, and doesnt work. However I tried to create
a view with the first statement, and this speeds up execution to about 134msecs/130rows,
and perhaps selecting the data into a temporary table would do another speed up. What
can I do to get the query fast enough?

Regards, Alex
 
I'n my dreams, I had the solution:

Code:
SELECT Albums.Artist, Albums.Title, Tracks.Artist, Tracks.Title, Tracks.Extended, Discs.Ordinal, Tracks.Ordinal,
       Albums.TrackCount, Albums.TrackCount AS TracksMatched,
       ((Albums.TrackCount-Albums.TrackCount)*100)/(Albums.TrackCount) AS Rank FROM Tracks
    INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
    LEFT OUTER JOIN Discs ON Tracks.DiscID = Discs.DiscID
    WHERE
         (Albums.Artist LIKE "%Aphex%" AND Albums.Artist != "")
         AND
         (Albums.Title LIKE "%Digeridoo%" AND Albums.Title != "")

UNION ALL

SELECT Albums.Artist, Albums.Title, Tracks.Artist, Tracks.Title, Tracks.Extended, Discs.Ordinal, Tracks.Ordinal,
       Albums.TrackCount, TracksMatched,
       ((Albums.TrackCount-TracksMatched)*100)/(Albums.TrackCount) AS Rank FROM Tracks
    INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
    LEFT OUTER JOIN Discs ON Tracks.DiscID = Discs.DiscID
    INNER JOIN (
          SELECT Albums.AlbumID, COUNT(*) AS TracksMatched FROM Tracks
                 INNER JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
                       WHERE
                          ((Tracks.Artist LIKE "%Aphex%" AND Tracks.Artist != "") OR
                          (Albums.Artist LIKE "%Aphex%" AND Albums.Artist != ""))
                          AND
                          (Tracks.Title LIKE "%Digeridoo%" AND Tracks.Title != "")
                       GROUP BY
                           Albums.AlbumID) AS AlbumsTracksMatched ON
          AlbumsTracksMatched.AlbumID = Tracks.AlbumID
    WHERE
         (((Tracks.Artist LIKE "%Aphex%" AND Tracks.Artist != "") OR
         (Albums.Artist LIKE "%Aphex%" AND Albums.Artist != ""))
         AND
         (Tracks.Title LIKE "%Digeridoo%" AND Tracks.Title != ""))
ORDER BY
        Rank ASC, Albums.Artist, Albums.Title, Discs.Ordinal, Tracks.Ordinal

Not exactly similar, but gives the same result with ~90msec...

Thanks for your time,
regards Alex
 
((Albums.TrackCount-Albums.TrackCount)*100)/(Albums.TrackCount) AS Rank FROM Tracks
This is really what you want (always 0)?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
No its not, youre right. However, its not very important, because i could simply write "0" :). Albums which did fully match dont have a special ranking... i simply changed the count of matched tracks to the total of matched tracks, thats why it now reads

Code:
Albums.TrackCount-Albums.TrackCount

Simply read it as "Tracks in Album" minus "Tracks in Album that matched" (which is actually the whole album)...

Regards
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top