I perform a monthly insert of multiple text files into a 2012 SQL server Database.
I bulk insert the text files into the respective temporary table and then use INSERT INTO to load from the temporary table into the production table after performing various validations.
The text files have duplicate records that I want to exclude from being loaded from the temporary table into the production table.
Duplicate records are defined to be any record in the text file that has the same value for [JurisdictionCode], [AccountNo], [SequenceNumber] and [SaleDate]. Per review, there are relatively few duplicate records - approximately 3% of the total records in any given text file. (Thought - Why let this relatively small number of records prevent the establishment of the composite primary key constraints? Load valid data and keep moving!)
By removing the duplicate records, I will be able to create a primary key constraint based on the afore-mentioned four fields.
The plan is to load all of the valid data. Then, investigate later the invalid data and records with duplicate data.
Using the code below, I have not been able to successfully load the data from the temporary table to the production table and exclude the duplicate records.
There are 0 records in the query result.
Currently trouble shooting but not quite able to locate the reason why there are no records in the query result.
[Bold]
What modifications are needed to load the data from the temporary table and exclude duplicate records that are within the text file?
[/Bold]
Another thought - Is it possible/feasible to bulk insert the text file into a staging table (rather than a temporary table) so I can track the records that were not loaded to the production table and facilitate future review of these "leftover" records? [Bold]Further, I could have a field that designate the problem with the record - i.e. "Not Valid Sale Date", "Duplicate Record" etc.[/Bold]
Initially I created a "Duplicate Records Table" and loaded the duplicate records. Then, attempted to join the temporary table, #EquipmentData1 with the DuplicateRecordsTable and delete the duplicate records from the temporary table. Then, I was to load the validated data excluding the duplicate records from the temporary table into the production table.
I was not able to get the following code to work.
Appreciate any additional insight regarding best practices to handle duplicate records in the source text files.
I bulk insert the text files into the respective temporary table and then use INSERT INTO to load from the temporary table into the production table after performing various validations.
The text files have duplicate records that I want to exclude from being loaded from the temporary table into the production table.
Duplicate records are defined to be any record in the text file that has the same value for [JurisdictionCode], [AccountNo], [SequenceNumber] and [SaleDate]. Per review, there are relatively few duplicate records - approximately 3% of the total records in any given text file. (Thought - Why let this relatively small number of records prevent the establishment of the composite primary key constraints? Load valid data and keep moving!)
By removing the duplicate records, I will be able to create a primary key constraint based on the afore-mentioned four fields.
The plan is to load all of the valid data. Then, investigate later the invalid data and records with duplicate data.
Using the code below, I have not been able to successfully load the data from the temporary table to the production table and exclude the duplicate records.
There are 0 records in the query result.
Currently trouble shooting but not quite able to locate the reason why there are no records in the query result.
[Bold]
What modifications are needed to load the data from the temporary table and exclude duplicate records that are within the text file?
[/Bold]
Code:
INSERT INTO Equipment_2014
SELECT RTRIM(LTRIM([JurisdictionCode])),
RTRIM(LTRIM([AccountNo])),
RTRIM(LTRIM([SequenceNumber])),
.
.
.
CASE
when len([SaleDate]) = 8
and isdate([SaleDate]) = 1
and rtrim(ltrim([SaleDate])) like '[0-9][0-9][0-9][0-9][0-1][0-2][0-3][0-9]'
then convert(date,[SaleDate], 112)
else cast(null as date)
end as [SaleDate],
)
from #EquipmentData1
where
[SaleDate] between convert(date, '20140101') and convert(date, '20141231')
AND RTRIM(LTRIM([SaleAmount])) is not null
AND RTRIM(LTRIM([SaleDate])) is not null
AND RTRIM(LTRIM([JurisdictionCode])) is not null
AND RTRIM(LTRIM([AccountNo])) is not null
AND NOT EXISTS(
SELECT t1.* FROM #EquipmentData1 t1
JOIN (
SELECT [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
FROM #EquipmentData1
GROUP BY [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
HAVING COUNT(*) > 1
) t2 ON t1.[JurisdictionCode] = t2.[JurisdictionCode]
AND t1.[AccountNo] = t2.[AccountNo]
AND t1.[SequenceNumber] = t2.[SequenceNumber]
AND t1.[SaleDate] = t2.[SaleDate]
order by [JurisdictionCode], [AccountNo], [SequenceNumber], [SaleDate]
)
GO
Another thought - Is it possible/feasible to bulk insert the text file into a staging table (rather than a temporary table) so I can track the records that were not loaded to the production table and facilitate future review of these "leftover" records? [Bold]Further, I could have a field that designate the problem with the record - i.e. "Not Valid Sale Date", "Duplicate Record" etc.[/Bold]
Initially I created a "Duplicate Records Table" and loaded the duplicate records. Then, attempted to join the temporary table, #EquipmentData1 with the DuplicateRecordsTable and delete the duplicate records from the temporary table. Then, I was to load the validated data excluding the duplicate records from the temporary table into the production table.
I was not able to get the following code to work.
Code:
DELETE FROM #EquipmentData1 A
INNER JOIN DuplicateEquipmentData B
B.[JurisdictionCode] = A.[JurisdictionCode]
AND B.[AccountNo] = A.[AccountNo]
AND B.[SequenceNumber] = A.[SequenceNumber]
AND B.[SaleDate] = A.[SaleDate]
Appreciate any additional insight regarding best practices to handle duplicate records in the source text files.