Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Data integrity 1

Status
Not open for further replies.

maxtardiveau

Programmer
May 15, 2002
8
US
How do you check the integrity of the data in a database? Say you have a database with a bunch of data. Assume you have inherited that data, and therefore do not know how good (or bad) it is.

The question is: are there tools out there that allow you to define the logical "rules" that the data is supposed to follow (e.g. referential integrity, derivations, replications, maybe even more complex business constraints), and then check whether the data actually obeys these rules?

That would seem like something you'd want to do on a regular basis (like maybe weekly) to get a sense of whether the integrity of your data is getting better or worse. That would also allow you to catch a lot of errors in the data before they cause problems.

I'm not talking about data quality -- I know of many tools that do that. I'm really looking for something that can check the more complex constraints that the data should follow.

Extra bonus points if you can do cross-checks across several databases (like one Oracle and one DB2).

Anyone? Anyone? Bueller?

-- Max
 
What do you mean by "integrity"? Referencial integrity is checked by Oracle (as well as most other databases) automatically. If such constraints are disabled or don't exist at all, just create/enable them, probably with EXCEPTIONS clause to get ROWIDs of rows violating that rules. More complex rules should be handled by triggers. If you obtain some low-quality data from external systems, you may load it to some staging database. In any case data integrity should be checked not before it causes some problems, but rather before it's being exposed.

Regards, Dima
 
Hi Dima,

the problem with referential integrity is, it's not turned on in a lot of databases, and it's often not an option.

Same thing with triggers and stored procedures -- a lot of databases don't have any, and defining them will often break existing applications.

And how do you address the problem of discrepancies across heterogeneous databases (one Oracle, one DB2)?

What I have in mind is more an after-the-fact tool that would give me some notion of where I stand.

What do I mean by data integrity? A few examples from a system I worked on:

- an order marked as active can only belong to an active customer

- the sales tax charged to the customer should equal the sum of the order items times the correct tax rate for the state where the order was shipped

- the orders in the sales database (Oracle) should be reflected in the manufacturing database (DB2) and they should agree on all important points

These are things that are tedious to check in SQL, especially when you have thousands of such constraints. How do you manage the results? How do you prioritize them? There's got to be something out there that addresses this problem.

Thanks,

-- Max
 
I suppose that in most cases this is a problem of original database design: if you really need data integrity, you should normalize your data and calculate some values each time instead of storing (for the sake of efficiency) inherited values.

1. You may create separate tables for obsolete (not active) data. Of course, it looks like to be much simplier during developing time to add an attribute to mark deleted records, but it's only an illusion: complexity just moves to another area.

2. In this case sales tax may be calculated each time you need it.

3. it really should be some specific case-procedure. Another (not suitable for your case, I suppose) idea is to get data from the place it was originally entered.

You may use snapshots or even ordinary views based on remote data, use middleware to handle distributed transactions, but of course, there's no general approach, as there's no technique for resolving arbitrary issue.


Regards, Dima
 
HI,
There are several 3rd party tools for data cleansing and checking..
We use one from Ascential software called Quality Manager
but many exist..( I will not post contact info, since this is not the place for commercials)

[profile]

 
In any case you will be left with thorny issues, if the sales tax is 'wrong', it is STILL the amount the customer was quoted/paid. You can't change the PO, it is a contract with the customer.

Do you WANT to reactivate customers with active PO's? There is a good chance that someone deactivated them for a reason. Again there is no chance you want to deactivate the PO, it is a contract.

I tried to remain child-like, all I acheived was childish.
 
Max,

I know this is a TECHNICAL forum, and we typically offer only enough information that meets the minimum TECHNICAL requirements of the request. But I hope you will indulge me while I offer some theoretical and architectural thoughts.

First an anecdote: Several years ago, a ball-bearing-manufacturing company noticed that the ball bearings they were producing were slightly out of "perfect round". One of their engineers developed a machine that gently ground the "extra" material off the out-of-round ball bearings, resulting in an acceptable product. That machining process represented a capital investment of $27,000. (Sounds like a third-party, after-the-fact cleansing tool to me.) Had the company done proper root-cause analysis, they could have corrected their original manufacturing problem with some simple adjustments that would have been a $450 expense.

This brings me to my concerns with some of your specifications: 1) "...the problem with referential integrity is, it's not turned on in a lot of databases, and it's often not an option..." 2) "...Same thing with triggers and stored procedures -- a lot of databases don't have any, and defining them will often break existing applications."

One of the first priciples of data quality and integrity is, "Let bad data live as short a life as possible." This means, don't even let bad data make it to the database. Reject bad data at the data-entry level.

I infer from the two issues above that your organization is willing to live with "bad data" because enforcing "goodness" will break the application. By implication, this means that the application is either broken already or it is unable to handle normal business rules, or both. What is the financial risk to the company if their data is wrong and the application is unable to enforce goodness?

You also mentioned, "These are things that are tedious to check in SQL, especially when you have thousands of such constraints. How do you manage the results? How do you prioritize them?..." If constraints and triggers are "not an option", then queries against the database, looking for bad data are your only option, no matter how tedious. The only choice is whether you "home grow" the queries or buy/acquire a third-party scrubbing tool.

As far as managing and prioritizing: Once you find bad data,

1) Determine if there are business risks if you "fix" the data. (As JimboPalmer wisely mentioned, you may have a CONTRACT that prevents you from "fixing" bad data.)

2) Identify which fixable data will produce the biggest return on investment by fixing it. (This prioritizes the fixes.)

3) Identify the method to implement the fix: Do you implement the fixes via application software or via scripted SQL*Plus updates. In either case, you must also establish an appropriate audit trail.

4) Implement a prevention method to ensure bad data no longer enters the database. This may take the form of updading your application to prevent bad data, implementing database declarative constraints, implementing database triggers, or all three. But to say that you cannot do any of the above is to say we do not have the resources to "do it right", but we have the resources to "do it over".(Remember the "ball-bearing" example.)

Once you have scrubbed the data, there should be no reason that you cannot implement database constraints, triggers, or both. If your application "breaks" when you enforce proper business rules, then FIX THE APPLICATION so it doesn't break. What good is an application that does not accommodate business-rule enforcement.

Okay, I'll hop off my soap box now. Thanks for your indulgence.

Dave
 
And my patience too :). "Let bad data live as short a life as possible." Excellent slogan.

Regards, Dima
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top