Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Data Cleaning !

Status
Not open for further replies.

RhythmAddict112

Programmer
Jun 17, 2004
625
US
Hi all,
So I'm working on a project where we'll need to go through and cleanse some data. Here is one scenario...

We've got a table that stores educational information. Currently, the "school" column exists as a varchar field. We're going to modify the front-end logic so that users must select from a list of schools. Naturally, we'd like to convert existing free-text values to the list ID's when possible - but this is going to be very, very painful as there, in theory could be as many as 45k records.

I did a quick google, and there seems to be some tools out there available for performing such tasks see -> for a quick sample.

I've never used any of these tools, thus my question is a 2 part-er. 1 - Is there a way/what is the best way for me to implement sql/t-sql to perform this task or at least ease this task?
2 - Has anyone used tools to perform this kind of task? Any thoughts or reccomendations are greatly appreciated.

Thanks!

 
What exactly is wrong with the data to be cleansed? Is this just a question of how should you clean up school names so that they can be used to assign an ID number?

A wise man once said
"The only thing normal about database guys is their tables".
 
Yes. At least at the moment :) I just need to clean up free-text school names and replace them with ID's once that table is created.

Eg, Convert entires like "ASU", "U of Arizona", and "University of Arizona" to ID 124.

 
Ouch!

This is going to be a difficult process.

I would recommend searching posts on here for methods used in standardizing name and address fields. You may be able to work something out based on methods used for that type of task, but this could take some doing. Basically what you want to do is first standardize the school names, then use them to assign the ID.

This is going to be difficult to do, but I am not sure if data cleansing software could do it any better.

And fyi - ASU (Arizona State University - Sun Devils) and University of Arizone (Wildcats) are not the same school.

A wise man once said
"The only thing normal about database guys is their tables".
 
Having done this with chemical names all I can say good luck.
My way was to take a distinct table and manually go through the data putting the "correct" value in a second column then update original, but I only had about 1K to go through.
Lots of Luck,
djj
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top