Hi everybody!
We're upgrading from an Access 2003 database to SQL Server. I'm a complete neophyte to SQL Server but of course have a passing knowledge of SQL from Access and VBA.
Currently one step in our project is to remove duplicates. What I do right now is go through the recordset and compare fields to determine if I should delete a record.
I'm pretty sure this can be achieved strictly with SQL but I'm just not sure if I know how to go about it.
We pull in 4 flat files produced from Business Objects reports. Among the 2-4 million records we sometimes have some duplicated records. The whole records aren't duplicated but some fields are and how we decide to keep a record is based on a couple date fields.
Basically we want to keep the record that has the latest date in the first date field, or if they're both the same then the latest date in the second date field.
Here is what I currently do in VBA:
So if my data was like below I'd want to delete the 2nd record:
12345, 08/02/2010, PA, 07/21/2007
12345, 08/02/2010, PA, 06/01/2005
12345, 08/02/2010, PB, 03/05/2008
Is this possible by doing a max/min on the dates and using a "IN" type statement or something? I'm just not sure since there are two dates involved. If the first date isn't the same we want to keep the first record. If the first dates are the same then we look at the second date.
Thanks!
We're upgrading from an Access 2003 database to SQL Server. I'm a complete neophyte to SQL Server but of course have a passing knowledge of SQL from Access and VBA.
Currently one step in our project is to remove duplicates. What I do right now is go through the recordset and compare fields to determine if I should delete a record.
I'm pretty sure this can be achieved strictly with SQL but I'm just not sure if I know how to go about it.
We pull in 4 flat files produced from Business Objects reports. Among the 2-4 million records we sometimes have some duplicated records. The whole records aren't duplicated but some fields are and how we decide to keep a record is based on a couple date fields.
Basically we want to keep the record that has the latest date in the first date field, or if they're both the same then the latest date in the second date field.
Here is what I currently do in VBA:
Code:
'set up the recordset for finding duplicate accounts in the import table
ssql = "SELECT [Import Acct Data].[AcctSched], " _
& "[Import Acct Data].[Current Date], [Import Acct Data].[Idms Region], " _
& "[Import Acct Data].[Status Date] " _
& "FROM [Import Acct Data] " _
& "ORDER BY [Import Acct Data].[AcctSched], " _
& "[Import Acct Data].[Idms Region], [Import Acct Data].[Current Date] DESC, " _
& "[Import Acct Data].[Status Date] DESC;"
Set db = CurrentDb()
Set rs = db.OpenRecordset(ssql)
previousAS = "xxxx"
rs.MoveLast
rs.MoveFirst
Do Until rs.EOF
currentAS = rs.Fields(0)
currentRegion = rs.Fields(2)
'if the current one matches the previous one, delete the record
If currentAS = previousAS And currentRegion = previousRegion Then
rs.Delete
Else
previousAS = rs.Fields(0)
previousRegion = rs.Fields(2)
End If
rs.MoveNext
Loop
So if my data was like below I'd want to delete the 2nd record:
12345, 08/02/2010, PA, 07/21/2007
12345, 08/02/2010, PA, 06/01/2005
12345, 08/02/2010, PB, 03/05/2008
Is this possible by doing a max/min on the dates and using a "IN" type statement or something? I'm just not sure since there are two dates involved. If the first date isn't the same we want to keep the first record. If the first dates are the same then we look at the second date.
Thanks!