From the staging tables we are planning loading the fact , but we are having duplicate entries which would essentially break the summaries for reporting.. Did anyone who has undergone the issue would throw some light would be glad?/
The process I follow is first to extract the data into stage tables that mirror the source system, only the required tables are exported. These are the tranformed into my DW stage tables where the Surrogate keys and data cleansing take place. From there I do Distinct inserts using a Select Distinct. I know it is a bit more of an overhead but the payoff i that I have yet to have duplicate records.
I think the underlying question is one of data (or process) quality. How did the bad data (duplicates) occur? Either you have:
a) bad data in the source system
b) created duplicates during your ETL process
c) missed or omitted something during data analysis and modeling, resulting in a non-unique PK.
-------------------------
The reasonable man adapts himself to the world. The unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. - George Bernard Shaw
Thanks All for quick responses, Yes it's essentially coming from the sourcesystem and not being generated in the ETL stage , Our goal was to do reporting of what is available on sourcesystem and not responsible for cleansing of the data.
Cleansing of the data has become a major part of the Data Warehousing effort. If the records are truly duplicates (exactly), then there are ways to remove them either during the ETL process directly or after loading. In addition, it can be done in a pre-process using something like the Unix uniq command, or the sed or awk utilities, or a small VB or java program.
-------------------------
The reasonable man adapts himself to the world. The unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. - George Bernard Shaw
There are different ways to handle this as suggested above. If these are true duplicates then a distinct will work however if the uniqueness is only based on few fields then in case of Oracle you could use the ROWID to remove duplicates for example:
Select * from DuplicateTable
Wehre ROWID = (Select max(ROWID) from DuplicateTable
Group by Field1, Field2)
In case of SQL server or in any other database where ROWID is not readily available you can certainly artificially create one in the staging table and then use the same sql above to remove the duplicates.
If someone is constraining you to display "all OLTP data in the DW - even dupes", then you might want to identify a row as primary or do something to give users the option of building their reports on only the "good" data. Ideally the business would tell you how to determine which is primary, but if you have to use something more arbitrary like "oldest is primary", it's better than forcing the report writers to figure it out.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.