Hello, experts,
I am trying to wrap my mind around the right data cleanup procedure. I have just implemented my SSIS package, that imports data and populates Slowly Changing Dimensions and Fact tables. Now I am trying to understand how would data analyst clean up data (like removing duplicates and fixing spelling errors that lead to multiple records in Dim tables instead of one).
- Would the analyst query resulting Dim/Fact tables and update the ones that do not look correct? But in this case next iteration of incoming data will overwrite the results of this cleanup.
- Would the analyst need to go into production system and fix the dirty data there? This doesn't look very sustainable as well, from security and data safety perspectives.
Also - is there any specific software that tackles this particular issue?
I am trying to wrap my mind around the right data cleanup procedure. I have just implemented my SSIS package, that imports data and populates Slowly Changing Dimensions and Fact tables. Now I am trying to understand how would data analyst clean up data (like removing duplicates and fixing spelling errors that lead to multiple records in Dim tables instead of one).
- Would the analyst query resulting Dim/Fact tables and update the ones that do not look correct? But in this case next iteration of incoming data will overwrite the results of this cleanup.
- Would the analyst need to go into production system and fix the dirty data there? This doesn't look very sustainable as well, from security and data safety perspectives.
Also - is there any specific software that tackles this particular issue?