Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Droping Duplicate data 1

Status
Not open for further replies.

jhall01

Programmer
Jul 3, 2003
328
US
I am pulling several different tables and something is causing me to get duplicate data. I can't figure it out but I have an easy work around (or at least i think it is) but i don't know how to do it.

I want to know if i can limit the number of records to display based upon 2 columns. ie FirstName, LastName

In other words i only want to display 1 John Smith and not 7 John Smith's I don't care about the other data in the other columns. I just want to limit 1 record based upon those 2 columns.

Thanks
 
Alternatively you could try:

SELECT FIRSTNAME, LASTNAME
GROUP BY 1,2;

Martin
 
i use

Select Distinct already but i have other columns i have to display ie

SELECT DISTINCT
firstname, lastname, address, city, zip, state,
FROM
mytable

now i don't care about the address info but i do need it there.

This is just an example. the data i am actually pulling is revenue data and can't use it as an example.

I will try the group thing but i think it also throws an error i will try it tomorrow
 
So, are you saying that "Fred Bloggs" may have multiple addresses and you only want one of them?

e.g
FirstName, LastName, Address...
FRED, BLOGGS, 1234 RED STREET
FRED, BLOGGS, 2468 GREEN STREET
FRED, BLOGGS, 1234 BLUE STREET
etc.

This doesn't make too much business sense - each customer is unique, unless your data quality is poor.

You could, of course, check for this -

select FirstName, LastName, count(unique FirstName||LastName) from TableX
group by 1,2

You would then know how many "Fred Bloggs" there are.

If the additional information IS revenue, then the problem is much simpler -

e.g
FirstName, LastName, Revenue...
FRED, BLOGGS, $1234.56
FRED, BLOGGS, $2346.80
FRED, BLOGGS, $3692.58

then the solution is -

select FirstName, LastName, sum(revenue) from tableX
group by 1,2

Roger...
 
Okay...

so i do the group thing and it gives me an error:

"selected non-aggregate values must be part of the associated group"

i am actually dealing with phone accounts not names and addresses and i am trying to group based upon phone number to see 1 instance of any transaction for a given day on that phone number.

So my query is something like this with more detail:

SELECT DISTINCT
name,phone,blah,blah,transaction,blah,blah
FROM
table1,table2,table3,table4
WHERE
state = TX
type = bus
transaction = trantype
GROUP BY 2

so what does that error mean?
 
one more thing...

So what happens is that if there is more than one transaction in a given day on 1 phone number i get multiple records for that number. I pulled all the same type of transaction and only care about 1 single instance of that transaction type for any single phone number.

The data and table structure is so bad i cannot get rid of the duplicate records because the data comes from 3 different systems and the joins are already nasty looking. I am dealing with a system that does over 50k transactions a day.

Thanks for your help
 
If I understand your problem this might work. I have added a sequence number that breaks on phone, transaction then selects the record for phone, transaction that has sequence number 1. This should give you 1 record for each phone, transaction combination. THis assumes that you truly do not care which of the records for the phone, transaction that you choose.

SELECT name
,phone
,blah1
,blah2
,transaction
,blah3
,blah4
FROM (SELECT name
,phone
,blah1
,blah2
,transaction
,blah3
,blah4
,CSUM(1,1) AS trs_seq
FROM table1
,table2
,table3
,table4
WHERE state = 'TX'
AND type = 'bus'
AND transaction = trantype
GROUP BY phone,transaction) a
WHERE trs_seq = 1
 
I really could have misunderstood your problem here but you could try just changing the order of the columns in the select ie:

SELECT DISTINCT(PHONE),NAME, BLAH1, BLA2H,TRANSACTION,......
FROM TABLE1
TABLE2
TABLE3
TABLE4
WHERE STATE = 'TX'
AND TYPE = 'BUS'
AND TRANSACTION = 'TRANTYP';

Martin

 
Yeah I am right there with you martin i just did that before i read your post. Can,t you also make it distinct like this:

SELECT DISTINCT (phone,name),blah,blah,etc

I am just now checking my data pull to see if there is a difference between changing the column order and my original duplicate data and scimming it down in access. I am trying to avoid pulling as much dupe data as possible at this point. No matter what because the tables are so screwy i will have to clean the data up in access.
 
SELECT DISTINCT PHONE, NAME, BLAH1, BLA2H,TRANSACTION,......
FROM TABLE1
TABLE2
TABLE3
TABLE4
WHERE STATE = 'TX'
AND TYPE = 'BUS'
AND TRANSACTION = 'TRANTYP';

will work if you want distinct instances of all the columns selected, that is all the rows with a given phone number have the same values in all the other columns. You will get a row returned each time a value in any of the columns changes. If this is what you want then you can GROUP BY all columns in the select list and get the same results

phone name blah blah2 trans
-----------------------------------------------------------
1234567 fred blah blah bus
1234567 fred blah blah bus
1234567 fred blahx blah bus

SELECT DISTINCT phone
,name
,blah
,blah2
,trans
from ......

will return 2 rows for phone number 1234567
 
anytime i try to group it still gives me that non-aggregate value error. what am i doing wrong with my group by. I am trying to group by phone
 
GROUP BY works on all the values in your select list. If you have 10 columns in your select list you must either group by all 10 or perform an aggregate operation (MIN, MAX,SUM, AVG, COUNT) on the ones not included. See my previous post.
 
DUH!!!! I have been staring at the computer too much!!! Thanks alot!
 
okay jgerstb i am trying your earlier post with the subquery but i run into an error. it tells me:

All expressions in a derived table must have an explicit name

What does that mean. at the end of the From statement at the end of the Group by why do you have the 'a' there?

Thanks
 
The a names the derived table.

query should look like this

SELECT a.name
,a.phone
,a.blah1
,a.blah2
,a.transaction
,a.blah3
,a.blah4
FROM (SELECT name
,phone
,blah1
,blah2
,transaction
,blah3
,blah4
,CSUM(1,1) AS trs_seq
FROM table1
,table2
,table3
,table4
WHERE state = 'TX'
AND type = 'bus'
AND transaction = trantype
GROUP BY phone,transaction) a
WHERE a.trs_seq = 1

The "All expressions in a derived table must have an explicit name" means that you are selecting an expression within the derived table(like I did with CSUM in the above query, I named it trs_seq). Give any expression in the derived table a name like so

SELECT expression AS some_name



 
jgerstb:
Never use "csum(1,1)" on a large data set, this will result in a skewed spool with all data on a single AMP.
And don't use CSUM unless you're still running V2R3, as this is not Standard SQL ;-)

jhall01:
You just want to return any row?

Try a variation of jgerstb's query:
select ...
from table
qualify
// V2R4+
sum(1) over (partition by name, phone
order by COL) = 1
// V2R5
row_number over (partition by name, phone
order by COL) = 1

with COL = any column with a good distribution, e.g. the PI columns

This will return the same rows if you run it several times.
If you really want different/random rows try
"order by random(1,1000000)"

Dieter
 
You may not be able to use
SELECT name
,phone
,blah1
,blah2
,transaction
,blah3
,blah4
,CSUM(1,1) AS trs_seq
FROM table1
,table2
,table3
,table4
WHERE state = 'TX'
AND type = 'bus'
AND transaction = trantype
GROUP BY phone,transaction

Since you are GROUPING BY phone,transaction, all the other blah1,2,3,4 have to be aggregate functions. You cannot use Aggregate functions and OLAP function CSUM() in the query.
It will give an error :
5478: Aggregates are allowed only with Window Functions.

Even if you use a GROUP BY
SELECT DISTINCT
firstname, lastname, MAX(address), MAX(city), MAX(zip), MAX(state),
FROM
mytable
GROUP BY firstname, lastname
it will not work as preferred. Here is why .
first Last addr city zip
FRED, BLOGGS, AD1 MANCHE 45009
FRED, BLOGGS, AD10 INDIA 94556
FRED, BLOGGS, AD33 BOSTON 54345

The result will be
first Last addr city zip
FRED, BLOGGS, AD33 MANCHE 94556

which is not what you may be wanting.

The problem is to eliminate the complete duplicate records and get only one of them.

The only idea I can think of now will be to create a Global temp table with all the required fields from all the source tables, including a CSUM()

INSERT INTO GTT_sometable
SELECT CSUM(1,1) as cntr, <columns> FROM
Table1
JOIN Table2
JOIN Table3

Now do a selectfrom the GTT_somtable using either MIN or MAX

SELECT * FROM GTT_someTable t1
where cntr = (select MAX(cntr) from GTT_sometable t2 where
t1.firstname = t2.firstname and t1.lastname = t2.lastname)

This will filter out just one row. This may be a round about way. But this is one of the methods. Please suggestif anyone can think of anyother.
 
The solution I offered does not mix OLAP and aggregate functions. I tested it and it works. The GROUP BY in this case names the grouping where the CSUM should break. GROUP BY works this way for OLAP functions. It will return all rows with a sequence number which will be reset when phone, transaction group changes.

That being said Dieter has pointed out some problems with this solution that should be heeded.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top