Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

select distinct each record

Status
Not open for further replies.

peac3

Technical User
Jan 17, 2009
226
AU
Hi guys,
i have two tables would be joined using ref_n and because ref_n the same, if i joined using ref_n, it would create cartessian join.
How would I avoid it by having one auth person each records like example below.

Code:
table1
post_id ref_n  post_d
1        50    2018-10-01
2        50    2018-10-02

table2
auth_id ref_n auth_person
3        50   Nina 
4        50   Bill

If I join table1 and table2 with ref_n the result would be below


post_id ref_n  post_d        auth_person
1        50    2018-10-01    Nina
1        50    2018-10-01    Bill
2        50    2018-10-02    Nina
2        50    2018-10-02    Bill

I would like to have result like below, one auth person each for record

post_id ref_n  post_d        auth_person
1        50    2018-10-01    Nina
2        50    2018-10-02    Bill
 
Your outcome right now is this:
[pre]
post_id ref_n post_d auth_person
1 50 [blue]2018-10-01[/blue] Nina
1 50 [red]2018-10-01[/red] Bill
2 50 [red]2018-10-02[/red] Nina
2 50 [blue]2018-10-02[/blue] Bill
[/pre]
But you want it to be:
[pre]
post_id ref_n post_d auth_person
1 50 [blue]2018-10-01[/blue] Nina
2 50 [blue]2018-10-02[/blue] Bill
[/pre]
So what is the rule of getting the [blue]BLUE[/blue] dates and not the [red]RED dates[/red] in your outcome? [ponder]


---- Andy

There is a great need for a sarcasm font.
 
This is a case of bad data design and you don't make it manageable creating a join matching lowest post_id with lowest auth_id etc.

There is no such join type. So where does this data originally come from, I assume you do have a per row ID, that you simply dropped out of your lists, but that would now help make the right matching condition.

Technically you can join to orderable lists by ordering them and giving each intermediate result (subquery) a row_number to match with the "corresponding" row of the other list. But that's nonsensical in terms of database relationships and even if that solves this and similar cases, this would be a bad way of handling it.

Bye, Olaf.





Olaf Doschke Software Engineering
 
Hi Andrzejek,

I know this is not the great data design but that is the data that we deal with when we source from external party.

what we are trying to do is to avoid cartessian join and dedup them.

the result would be either one of them below by joining from ref_n

Code:
post_id ref_n  post_d        auth_person
1        50    2018-10-01    Nina
2        50    2018-10-02    Bill 

post_id ref_n  post_d        auth_person
1        50    2018-10-01    Bill
2        50    2018-10-02    Nina

The question is more like how do we dedup them from cartessian join output?

 
What I'd like to see is further data, a third table could make a normal join possible.

What you're doing is guessing which record belongs to which, ref_n isn't allowing that. I'd look for tables with post_id and auth_id in them, that would be the missing link.

Code:
SELECT COLUMN_NAME,TABLE_NAME
FROM        INFORMATION_SCHEMA.COLUMNS
WHERE       COLUMN_NAME LIKE '%post_id%' OR COLUMN_NAME LIKE '%auth_id%' OR COLUMN_NAME LIKE '%ref_n%'
ORDER BY    TableName,ColumnName;

Does this show some more tables?

Bye, Olaf.

Olaf Doschke Software Engineering
 
Both Nina and Bill have 2 records with dates of 2018-10-01 and 2018-10-02
You want your outcome to be for Nina 2018-10-01, but for Bill 2018-10-02 WHY?
What determines which date goes with which person?

I would understand if you want both people to have 2018-10-01 (Min date) or 2018-10-02 (Max date), but your example shows no logic (that I can see).

So you still did not answer my question. :-(

"the result would be either one of them below by joining from ref_n"
In my opinion - you CANNOT do it.

(Unless somebody could prove me wrong :) )

---- Andy

There is a great need for a sarcasm font.
 
Hi Andrzejek,

Here I give you more data, if you see the source data, table 2 is actually extension of table 1 but the linkage sometimes are not quite unique.
the reference number is meant to be unique for each bill but sometimes people make mistake by putting same number.

Code:
table1
post_id ref_n  post_d
1        50    2018-10-01
2        50    2018-10-02
3        11    2018-10-03
4        12    2018-10-04

table2
auth_id ref_n auth_person tran_a
3        50   Nina          25
4        50   Bill          25
5        11   Tom           20
6        12   Shan          30

At the end, we would like to get total transaction with who roughly owns the bill but we don't want double counting the number.
usual query is below

Code:
Select 
 T1.*,
 T2.*
FROM TABLE1 T1
  INNER JOIN TABLE2 T2 ON T1.REF_N = T2.REF_N

but the query above will create cartessian join for ref_n 50 with 4 records and we would like to dedup it with 1 bill for each Nina and Bill.
 
peac3 said:
the query above will create cartessian join for ref_n 50 with 4 records and we would like to dedup it with 1 bill for each Nina and Bill.

Again, with which date for Bill and Nina: 2018-10-[red]01[/red] or 2018-10-[red]02[/red]?
Or doesn't matter?



---- Andy

There is a great need for a sarcasm font.
 
Something like:

[TT]Select T1.post_id, T1.ref_n, [BLUE]MAX([/BLUE]T1.post_d[BLUE])[/BLUE] AS post_d, T2.auth_person
FROM TABLE1 T1
INNER JOIN TABLE2 T2 ON T1.REF_N = T2.REF_N [BLUE]
GROUP BY T1.post_id, T1.ref_n, T2.auth_person[/BLUE][/TT]

(Not tested)


---- Andy

There is a great need for a sarcasm font.
 
it doesn't work using either max(post_d) or min(post_d), because it will result the post_d for bill and nina the same.
and when we segregate daily, it would not be reconciled.

your codes result would be like below:
Code:
post_id ref_n  post_d        auth_person tran_a
1        50    2018-10-02    Nina         25
2        50    2018-10-02    Bill         25

we want one post_d for each which doesn't matter who so the result would be either
Code:
post_id ref_n  post_d        auth_person tran_a
1        50    2018-10-01    Nina         25
2        50    2018-10-02    Bill         25

or

post_id ref_n  post_d        auth_person tran_a
1        50    2018-10-01    Bill         25
2        50    2018-10-02    Nina         25
 
So, what you want to say to your data base is: give me any date for Bill, I don’t care which date you give me, just pick one arbitrary, random date from table1, it does not matter to me. And then, give me any date for Nina as long as it is not the same date as the date you gave me for Bill’s record.

If that’s the case, you can do it – providing you have additional fields in your tables, like a Primary Key field. But the SELECT statement will be so convoluted you will not like it, because you would have to have imbedded Select inside another Select referring to the previous record number asking for the data that does not appear in the previous record, etc. Ugly…

Unless somebody else, smarter than me (which is easy :) ) will jump in with the solution.



---- Andy

There is a great need for a sarcasm font.
 
peac3 said:
the reference number is meant to be unique for each bill but sometimes people make mistake by putting same number.
Well, there you describe the core problem. It's not impossible to fix this, is it? update one ref_n.

You can easily find double ref_n by

And technically make sure no double ref_n is allowed by making a unique constraint on that field:
A database should take care of the integrity of data, not only but also the referentiaöl integrity. You don't fix wrong data with random pick choices, that works in some examples but then you don't fix a problem, you suppress an error and seep it under a rug, that's not solving it and may introduce wrong deduping in further scenarios, eg when a ref_n is even used three times. I guess it also would be much more important if tran_a would differ.

It would be best to know what ref_n references to, what table1 and2 are, what the whole real-world meaning of this is, but one thing is clear, this case has an easy solution by changing the ref_n and ensuring no double entry in the future, you don't need complicated SQL, you just need fixed data.


Olaf Doschke Software Engineering
 
Andy,

This seems like a common table expression may be the right choice here. Commonly it is used for Hierarchy joins with multiple ancestors where you want the full chain.

My brain always hurts when I have to look up the example but this may be another choice case for one. That said, they are on the "avoid it if you can" list. - I thought you might want to look into it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top