Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

These results are mathematically impossible!

Status
Not open for further replies.

dmg2206

MIS
Feb 15, 2002
54
US
A little background: I work for a small not-for-profit organization. We have a PostgreSQL server containing all of our client information dating back to 2001. I've been asked to do some data mining. The problem is that I'm getting weird results.

Note: "tbclients" is the table containing basic information on all clients. "tbvisits" is a table recording when a client comes in. The "clientid" field in the "tbvisits" table is linked to the "id" field in the "tbclients" table.

Here are some queries I've run, and the results the system has given me:

Code:
select count(*) from tbclients
where id in (select distinct clientid from tbvisits
where date > '2004-05-29'
and date <= '2005-05-29');

Number returned: 1840


Code:
select count(*) from tbclients
where id in (select distinct clientid from tbvisits
where date > '2005-05-29'
and date <= '2005-06-29');

Number returned: 425


Code:
select count(*) from tbclients
where id in (select distinct clientid from tbvisits
where date > '2004-05-29'
and date <= '2005-05-29')
and id not in (select distinct clientid from tbvisits
where date > '2005-05-29'
and date <= '2005-06-29');

Number returned: 0

Now shouldn't the third number be at least 1415? Or am I totally missing something here?
 
simple example:
Code:
 id	date	filter match
--------------------------
1	2002-01 -
2	2002-01	-
3	2002-01	-

1	2003-01 -
2	2003-01	-
3	2003-01	-

1	2004-06	1
2	2004-06	1

2	2005-06	2
3	2005-06	2

select count(*) from tbclients
where id in (1, 2)
and id not in 
	(2, 3);
-> id in (1), 3 rows
Code:
select count(*) from tbclients
where id in (1, 2)
7 rows
Code:
select count(*) from tbclients
where id not in (2, 3)
3 rows

Your arithmetic would work, if you were looking for the number of distinct IDs, but you're counting the rows matching those IDs.

don't visit my homepage:
 
I think you missed the point of my issue.

Let's take your example, except I'll shorten the table to make things more clear.

Code:
 id    date    filter match
--------------------------
1    2002-01    -
2    2002-01    -
3    2002-01    -

1    2003-01 -
2    2003-01    -
3    2003-01    -

1    2004-06    1
2    2004-06    1

Now let's check the following queries.

Code:
select count(*) from tbclients
where id in (1, 2)
This will return 6 rows.

Code:
select count(*) from tbclients
where id in (2, 3)
This will return 5 rows. Note that in my second query, there is no "not".

For the third query, I'm asking for a count of everything that was counted in the first query but not counted in the second query. I can assume that the number of rows returned should be at least 1, and probably more than that.
Code:
select count(*) from tbclients
where id in (1, 2)
and id not in (2, 3);
In this case, we should have 3 rows. However, my system is insisting that 0 rows are returned.
 
The queries in my original post are the exact queries I'm using.
 
then what is this returing

select count(*) from (
select distinct clientid from tbvisits
where date > '2004-05-29'
and date <= '2005-05-29'
EXCEPT
select distinct clientid from tbvisits
where date > '2005-05-29'
and date <= '2005-06-29'
) AS tmp;
 
ceco, that query does work. Thank you.

It appears that there's a bug in the way PostgreSQL (ver 8.2.4) handles "not in" clauses, particularly when it contains a subquery that returns a large number of results.

The following works.
Code:
select count(*) from tbclients
where id not in (
select clientid from tbvisits
where date > '2005-05-29'
and date <= '2005-06-27');


Result:

 count 
-------
  6526

The following does not work.
Code:
select count(*) from tbclients
where id not in (
select clientid from tbvisits
where date > '2005-05-29'
and date <= '2005-06-28');


Result:

 count 
-------
     0
 
Hi

Do you have some [tt]null[/tt] values in the clientid column ? The [tt]in[/tt] operator implies equality check and checking equality on [tt]null[/tt] gives unpredictable result.

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top