Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Delete identification Process 1

Status
Not open for further replies.

daison

Programmer
Mar 15, 2004
2
0
0
US
Hi,

We are planning to use an ETL tool(most
probably Informatica) to load data from
flat files to more Generic Database.

So i have one general question. Is
there any way I can identify the
deletion of records from the database
which is not there in the flat file.

This is how it goes..I have around 14
flat files where the data is related
within these data files. Lets take an
eg of Movie.data and Award.dat file. In
this one movie is associated to n
number of awards. Now the provider has
given yesterday 4 awards are associated
to one movie, i have linked these 4
awards to the movie in the database.
Now today he has given only 3 awards
are linked to that movie. Now I have to
identify the extra award which is
linked to that movie in the database
and knock of that record. The same way
I got come complicated relation between
these 14 flat files. And the provider
doesnt provide any information like
this record has to get deleted or
something like that.

So I need to know is there any general
pattern I can use to identify the
deletion of these records.

looking for your valuable suggestions.

Thanks
Daison
 
I am inferring that you are receiving a new snapshot of the data for each update; that is, a periodic refresh. If so, the best way to handle deletes is to keep a copy of the previous snapshot and use a system utility such as Unix diff or sdiff to find records in previous snapshot not in current snapshot.

Alternatively, load new shapshot data into temporary table with same structure as target table and write a query against the existing table where primary keyfield not in temporary table (which contains new snapshot).

If this does not make sense, or if my assumption of periodice snapshot/refresh of data is incorrect, please explain your situation further.

Best Regards.

Sometimes the grass is greener on the other side because there is more manure there - original.
 
Expanding on John's assumption, if you are keeping only current snapshot can you then do the complete refresh of required tables instead of trying to figure out which records are omitted?

Anand
 
And building on Anand's statement....if you are not required to keep track of deletes, which his method assumes, you should truncate the table, drop the indexes, load the table, and rebuild the indezes.

I think Daison needs to keep the deletes; most DW's do keep deleted records and mark them as deleted or inactive. This is because historical reports may reference the deleted record. In addition, by capturing the date of the delete, it makes it possible for reports to be reproducible, so you can report the record as active for all periods prior to logically deleting the record in the warehouse. It also depends if the delete is a business activity or the correction of an error.

As a general rule, we do not physically delete records in the DW for the reasons listed above, other similar reasons, and to maintain integrity and an audit trail.

Sometimes the grass is greener on the other side because there is more manure there - original.
 
You are right John,

We never delete or even update existing record set in DW. Reading Daison's original post however, I got the impression that it is more like correction rather than new composition of existing activity. I may have assumed it wrong.

Anand
 
Hi,

We cannot delete the records permanently from the DB. We have to flag the records because business want to know abt the deleted records and also we have to carry these deletion to the downstream systems also.

Considering millions of records which is getting periodically refreshed, we cannot truncate the table and load the data daily. This will have a serious performance issue as well as higly complex for the downstram system to figure out what has changed what not.

So we are trying to figure out a general pattern in which we can identify the deletes.

As john suggested we can do a UNIX diff and find the difference of the files which has chnaged and load it into a seperate table and finally find the difference. But considering the volume of the data and the number of tables. In our case we are having around 6 providers giving these kinds of different content and together its comes around 70-80 flat files.

So I am wondering what is the general approach in DataWarehouse to handle these kind of scenarios.

Thanks john and joshi for your valuable suggestions.

Thanks
Daison
 
Well Daison, you just about have all the current methods for dealing with deletes. Another alternative is to have the source system place a delete trigger or otherwise write the deleted row out to a "deleted records" file. Another althernative is to have the source system add a DELETE or INACTIVE flag, and have the trigger (or source system) populate that field during a delete; naturally, you keep this column in your warehouse to indicate a "logically" deleted record. As far as I can tell, that's about all the alternatives you have.

Maybe I've missed one? Or someone can do some serious innovative thinking outside the box.....

Sometimes the grass is greener on the other side because there is more manure there - original.
 
well here is another idea, but i am not sure if it works,
you can add a loadingDate column to your target table and just load in the new snapshot of data into the table with new loading date when ever you load the data. but if you are going to get large amount of data, the table size may grow incrementally. if the size doesnt matter then you can go for it, as if you want to get the latest info you can query target on latest loadingDate.
if loadingdate doesnt work for somehow, you can have a coulmn which contains loading sequence number.. and you can increment this sequence number when ever the new record inserts with the same primary key and this sequence number should be part of the primary key
i really dont know it works or not, just a thought....

Kishore MCDBA
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top