Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Query help

Status
Not open for further replies.

sschrupp

Programmer
Feb 23, 2007
25
US
Hi everybody!

We're upgrading from an Access 2003 database to SQL Server. I'm a complete neophyte to SQL Server but of course have a passing knowledge of SQL from Access and VBA.

Currently one step in our project is to remove duplicates. What I do right now is go through the recordset and compare fields to determine if I should delete a record.

I'm pretty sure this can be achieved strictly with SQL but I'm just not sure if I know how to go about it.

We pull in 4 flat files produced from Business Objects reports. Among the 2-4 million records we sometimes have some duplicated records. The whole records aren't duplicated but some fields are and how we decide to keep a record is based on a couple date fields.

Basically we want to keep the record that has the latest date in the first date field, or if they're both the same then the latest date in the second date field.

Here is what I currently do in VBA:
Code:
    'set up the recordset for finding duplicate accounts in the import table
    ssql = "SELECT [Import Acct Data].[AcctSched], " _
        & "[Import Acct Data].[Current Date], [Import Acct Data].[Idms Region], " _
        & "[Import Acct Data].[Status Date] " _
        & "FROM [Import Acct Data] " _
        & "ORDER BY [Import Acct Data].[AcctSched], " _
        & "[Import Acct Data].[Idms Region], [Import Acct Data].[Current Date] DESC, " _
        & "[Import Acct Data].[Status Date] DESC;"
        
    Set db = CurrentDb()
    Set rs = db.OpenRecordset(ssql)
    
    previousAS = "xxxx"
    rs.MoveLast
    rs.MoveFirst
    
    Do Until rs.EOF
        currentAS = rs.Fields(0)
        currentRegion = rs.Fields(2)
        'if the current one matches the previous one, delete the record
        If currentAS = previousAS And currentRegion = previousRegion Then
            rs.Delete
        Else
            previousAS = rs.Fields(0)
            previousRegion = rs.Fields(2)
        End If
        rs.MoveNext
    Loop

So if my data was like below I'd want to delete the 2nd record:

12345, 08/02/2010, PA, 07/21/2007
12345, 08/02/2010, PA, 06/01/2005
12345, 08/02/2010, PB, 03/05/2008

Is this possible by doing a max/min on the dates and using a "IN" type statement or something? I'm just not sure since there are two dates involved. If the first date isn't the same we want to keep the first record. If the first dates are the same then we look at the second date.

Thanks!
 
Code:
DELETE FROM YourTable
FROM YourTable
INNER JOIN (SELECT AcctSched,
                   [Current Date],
                   [Idms Region],
                   MAX([Status Date]) AS StatDate
            FROM YourTable
            GROUP BY AcctSched,
                    [Current Date],
                    [Idms Region]) Tbl1
ON YourTable.AcctSched      = Tbl1.AcctSched
   YourTable.[Current Date] = Tbl1.[Current Date]
   YourTable.[Idms Region]  = Tbl1.[Idms Region]
   YourTable.[Status Date]  < Tbl1.StatDate

NOT TESTED!!!!!!!!!!
Make sure you have e pretty good backup first!

Borislav Borissov
VFP9 SP2, SQL Server 2000,2005 & 2008.
 
Thanks, bborissov. I see where you're going with that.

But what about the current date field? I need the max of that too. I need the max(current date) first, and then within that I need the max(status date).

Otherwise it's possible an older current date could have a status date that is newer than what is in a newer current date's record. So there might be:

12345, 08/02/2010, PA, 07/21/2007
12345, 08/02/2010, PA, 06/01/2005
12345, 05/01/2010, PA, 03/05/2010

With your code it would delete everything but the 3rd record if I'm reading that right, but I want to keep the first record only.

Almost there, though. Thanks!
 
SQL Server 2005 and up:
Code:
;with MaxDates as (SELECT AcctSched,
                   Max([Current Date]) as MaxCurDate,
                   [Idms Region],
                   MAX(convert(char(10),[Current Date],112) + space(1) +  convert(char(10),[Status Date],112)) AS MaxCombinedDate
            FROM YourTable
            GROUP BY AcctSched,
                    [Idms Region]))


select YourTable.*, tbl1.* from YourTable INNER JOIN cte tbl1
ON YourTable.AcctSched      = Tbl1.AcctSched
   YourTable.[Current Date] < Tbl1.[MaxCurDate]
   YourTable.[Idms Region]  = Tbl1.[Idms Region]
   YourTable.[Status Date]  < convert(datetime, right(Tbl1.MaxCombinedDate,10))

Using packed value idea from these blogs
Including an Aggregated Column's Related Values
Including an Aggregated Column's Related Values - Part 2

PluralSight Learning Library
 
Code:
DELETE FROM YourTable
FROM YourTable
INNER JOIN (SELECT YourTable.AcctSched,
                   YourTable.[Current Date],
                   YourTable.[Idms Region],
                   MAX([Status Date]) AS StatDate
            FROM YourTable
            INNER JOIN (SELECT YourTable.AcctSched,
                               MAX(YourTable.[Current Date]) AS CurrentDate,
                   YourTable.[Idms Region]
                        FROM YourTable
            GROUP BY AcctSched,
                    [Idms Region]) Tbl2 
            ON YourTable.AcctSched      = Tbl2.AcctSched AND
               YourTable.[Idms Region]  = Tbl2.[Idms Region] AND
               YourTable.[Current Date] = Tbl2.CurrentDate
            GROUP BY YourTable.AcctSched,
                    YourTable.[Current Date],
                    YourTable.[Idms Region]) Tbl1
ON YourTable.AcctSched       = Tbl1.AcctSched      AND
   YourTable.[Current Date] <= Tbl1.[Current Date] AND
   YourTable.[Idms Region]   = Tbl1.[Idms Region]  AND
   YourTable.[Status Date]  <  Tbl1.StatDate

AGAIN NOT TESTED!

Borislav Borissov
VFP9 SP2, SQL Server 2000,2005 & 2008.
 
After a little thinking (not much just little ;-))
Code:
[COLOR=blue]DELETE[/color] [COLOR=blue]FROM[/color] YourTable
[COLOR=blue]FROM[/color] YourTable
[COLOR=blue]INNER[/color] [COLOR=blue]JOIN[/color] ([COLOR=blue]SELECT[/color] YourTable.AcctSched,
                   YourTable.[[COLOR=#FF00FF]Current[/color] [COLOR=blue]Date[/color]],
                   YourTable.[Idms Region],
                   [COLOR=#FF00FF]MAX[/color]([Status [COLOR=blue]Date[/color]]) [COLOR=blue]AS[/color] StatDate
            [COLOR=blue]FROM[/color] YourTable
            [COLOR=blue]INNER[/color] [COLOR=blue]JOIN[/color] ([COLOR=blue]SELECT[/color] YourTable.AcctSched,
                               [COLOR=#FF00FF]MAX[/color](YourTable.[[COLOR=#FF00FF]Current[/color] [COLOR=blue]Date[/color]]) [COLOR=blue]AS[/color] CurrentDate,
                   YourTable.[Idms Region]
                        [COLOR=blue]FROM[/color] YourTable
            [COLOR=blue]GROUP[/color] [COLOR=blue]BY[/color] AcctSched,
                    [Idms Region]) Tbl2
            [COLOR=blue]ON[/color] YourTable.AcctSched      = Tbl2.AcctSched AND
               YourTable.[Idms Region]  = Tbl2.[Idms Region] AND
               YourTable.[[COLOR=#FF00FF]Current[/color] [COLOR=blue]Date[/color]] = Tbl2.CurrentDate
            [COLOR=blue]GROUP[/color] [COLOR=blue]BY[/color] YourTable.AcctSched,
                    YourTable.[[COLOR=#FF00FF]Current[/color] [COLOR=blue]Date[/color]],
                    YourTable.[Idms Region]) Tbl1
[COLOR=blue]ON[/color] YourTable.AcctSched       = Tbl1.AcctSched      AND
   YourTable.[Idms Region]   = Tbl1.[Idms Region]  AND
  (YourTable.[[COLOR=#FF00FF]Current[/color] [COLOR=blue]Date[/color]]  < Tbl1.[[COLOR=#FF00FF]Current[/color] [COLOR=blue]Date[/color]] OR
   YourTable.[Status [COLOR=blue]Date[/color]]   <  Tbl1.StatDate)
I think this could be better.

Borislav Borissov
VFP9 SP2, SQL Server 2000,2005 & 2008.
 
Code:
;with MaxDates as (SELECT AcctSched,
                   Max([Current Date]) as MaxCurDate,
                   [Idms Region],
                   MAX(convert(char(10),[Current Date],112) + space(1) +  convert(char(10),[Status Date],112)) AS MaxCombinedDate
            FROM YourTable
            GROUP BY AcctSched,
                    [Idms Region]))


select YourTable.*, tbl1.* from YourTable INNER JOIN cte tbl1
ON YourTable.AcctSched      = Tbl1.AcctSched
AND YourTable.[Idms Region]  = Tbl1.[Idms Region]
  -- Dates comparison
   AND ((YourTable.[Current Date] < Tbl1.[MaxCurDate]
    OR (YourTable.[Current Date] = Tbl1.[MaxCurDate]
    AMD YourTable.[Status Date] <convert(datetime, right(Tbl1.MaxCombinedDate,10)))

PluralSight Learning Library
 
Thanks you two, I'll give these a shot when I'm able. The one by bborissov seems especially straight forward to my neophyte SQL Server skills.

Markros, I'm not entirely sure I understand yours fully.

What does ";with MaxDates" do?

And then where does your "tbl1" come from and what does "cte tbl1" mean?

Sorry, I'm totally new to SQL Server. Thanks if you can clue me in on those!
 
I moved the code to get the latest Max Current date and the Max Service Date for that current date into a separate query called Common Table Expression (cte for short). I named this query MaxDates. You can check this blog post for some introduction to CTE

or just google on "Common Table Expressions". BOL is also a good start.

So, essentially you can split your complex query into steps and put intermediate steps into CTE.

In that CTE I used the packed value technique outlined in the other blogs I referenced (look at the first blog and go down - that technique is described last).

And then I just used this cte in my query and since I was lazy, I just copied Boris query and adjusted to use mine.



PluralSight Learning Library
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top