Handle Duplicate enteries in Fact

kpmINFO · Aug 22, 2005

From the staging tables we are planning loading the fact , but we are having duplicate entries which would essentially break the summaries for reporting.. Did anyone who has undergone the issue would throw some light would be glad?/

Thanks,
Info

MDXer · Aug 23, 2005

The process I follow is first to extract the data into stage tables that mirror the source system, only the required tables are exported. These are the tranformed into my DW stage tables where the Surrogate keys and data cleansing take place. From there I do Distinct inserts using a Select Distinct. I know it is a bit more of an overhead but the payoff i that I have yet to have duplicate records.

Shoot Me! Shoot Me NOW!!!
- Daffy Duck

johnherman · Aug 23, 2005

I think the underlying question is one of data (or process) quality. How did the bad data (duplicates) occur? Either you have:

a) bad data in the source system
b) created duplicates during your ETL process
c) missed or omitted something during data analysis and modeling, resulting in a non-unique PK.

-------------------------
The reasonable man adapts himself to the world. The unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. - George Bernard Shaw

kpmINFO · Aug 23, 2005

Thanks All for quick responses, Yes it's essentially coming from the sourcesystem and not being generated in the ETL stage , Our goal was to do reporting of what is available on sourcesystem and not responsible for cleansing of the data.

johnherman · Aug 24, 2005

Cleansing of the data has become a major part of the Data Warehousing effort. If the records are truly duplicates (exactly), then there are ways to remove them either during the ETL process directly or after loading. In addition, it can be done in a pre-process using something like the Unix uniq command, or the sed or awk utilities, or a small VB or java program.

-------------------------
The reasonable man adapts himself to the world. The unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. - George Bernard Shaw

shabut · Dec 6, 2005

There are different ways to handle this as suggested above. If these are true duplicates then a distinct will work however if the uniqueness is only based on few fields then in case of Oracle you could use the ROWID to remove duplicates for example:

Select * from DuplicateTable
Wehre ROWID = (Select max(ROWID) from DuplicateTable
Group by Field1, Field2)

In case of SQL server or in any other database where ROWID is not readily available you can certainly artificially create one in the staging table and then use the same sql above to remove the duplicates.

jburnham · Feb 24, 2006

If someone is constraining you to display "all OLTP data in the DW - even dupes", then you might want to identify a row as primary or do something to give users the option of building their reports on only the "good" data. Ideally the business would tell you how to determine which is primary, but if you have to use something more arbitrary like "oldest is primary", it's better than forcing the report writers to figure it out.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Handle Duplicate enteries in Fact

kpmINFO

IS-IT--Management

MDXer

Technical User

johnherman

MIS

kpmINFO

IS-IT--Management

johnherman

MIS

shabut

IS-IT--Management

jburnham

Programmer

Similar threads

Part and Inventory Search

Sponsor