Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

clean up the database

Status
Not open for further replies.

subs1678

IS-IT--Management
Sep 11, 2007
13
0
0
US
i have a freight database which has a field called as consignee. This field has millions of records and has mispellings all the way since the data is typed manually by diff people. for example remou is spelt as rem, remo, rems etc. Is there a easier way to correct all the spellings at once.

I really appreciate any help.
 
it depends on ur skill level. You can:

1) use an update query to update the column (would have to do each variation in spelling singly. i.e. one run to correct REMO to REM, one to do REMS to REM).

2) use a macro to do the update.

3) Use a vba module. This has the most flexibility and could update all speeling corrections at the same time.
 
thanks for the reply-lestatdelioncourt . Could you please help me with method no 3. I am quite new to this database. I would appreciate if you could tell me step by step

thanks a lot in advance
 
I think you first need to come up with an algorithm for deciding who is the "same" person. Looking at the record, how would you determine that "Rem" really is "Remou", and not "Remy"?

You may try by comparing other fields. For example, your rule may be that if either the Address or the Phone Number are the same, it is the same person.

But I don't think you will be able to come up with a totally automatic way. I think the best you can do is present the user with a possible list of duplicates, but then it is the user's choice to say yes/no, this is the same person. If the user says "Yes", you write further code that keeps one record as the "real" one, and assign it as the proper value for all the duplicates.

BTW, once you clean it up I highly recommend you make a separate Consignee table (as discussed in your previous thread) and normalize your tables so that when editing the main record, the user is obliged to pick a Consignee from the related table.

 
I think any solution will be dependant on the software and network that his database is running on. When you are dealing with millions of records your solutions are different than dealing with thousands. Please provide some more information.
1) What application is this database? I assume that it is not Access being that there are millions of records. Seems out of the scope for Access.
2) Do you really need to worry about all of the records or just current and future records? I assume some of the data is historical or do you really have millions of current accounts?
3) I agree with Joe that there is no way to automate this, but you can build a nice interface to quickly allow a user to look at a name and compare it to possible discrepencies and then have the user choose which ones to update. If I was dealing with Access and a few thousand records, I think my design would be two listboxes. I would pick a name out of a list box and it would populate a second listbox with possible discrepencies. Maybe if the selected name was Remou the query for the second box would include
Remou*
If I find any possible matches I select them in the second multi select list box and run my update query.
then I would hit a button and it would return records
Remo*
repeat process
then
Rem*
then
Re*

Sure I could just return all records like R* but I think I would want to look at the most likely and fix them first.
Even If this was possible on such a large database, It would take a person weeks to fix the discrepencies. Maybe if you only need to worry about records from the last few months this approach is doable.

4) I think you could write code to clean up the obvious mistakes, maybe a 75 percent solution. Have the code run through the table and look for obvious possiblities.
One and Two word names: R,Re
Four consanants: Rmmmou
Four vowels: Remouou
Two names differing by one letter: Remou; Remmou
etc.
This code could run for a long time and spit out possible results to a table. The user could then come back at a later time and look at these possiblities.
 
Now that I think of it, in my proposed design when I expand my search to go from returning
Remou*
to
Remo*
I would not include in the discrepency list any records that were returned in Remou*, and not selected as a discrepency. In other words if Remou* returned
Remou
Remous
Remousch
and these are all distinct names, then the expanded search would not include these names.

I would also include other buttons to do different types of searches.
 
You may also consider an algorithm based on SOUNDEX.

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
i am using Access and believe me or not, one table which is called freight table has right now 1.4 million records ( 2006 and 2007 YTD). Each record has origin st, origin city, destn st, destn city, paid amt, transport amt, weight, paid date, ship date, consignee, shipper,client, zip codes,ocountry, destn country - there are 25 fields in one table and a million reords. Since every field is associated with ocity, ost, dcity, dst, it is not even possible to divide this table. In this table the consignee field is messed up with different spellings.
 
If I had to do this, I think I would first create a table of unique entries:

[tt]Select Consignee As ConsignKey, Consignee As ConsignChange, Count(Consignee) As CountOfConsignee Into CorrectConsignee From tblTable Group By Consignee[/tt]

You can then correct the ConsignChange column and use it to update the main table through a join to ConsignKey. If you are lucky, the Consignees with the lowest counts are the ones that are incorrect.





 
Yeah that is only key to a strategy, are there millions of unique needs with some repeated twice or are there a few thousdands names repeated many times. I was thinking this was like a moving company that moved a million different people, but as Remou pointed out if this was like a parts distributor with a few thousand different clients repeated many times your stategy would be different.
 
I personally have never worked with that many records. What is the size of your database? I am not sure about A2007, but I think 2Gig is the limit for older versions. At this size do you have to create a lot of temp tables for performance purposes? How long does it take to do a simple select query on that table?

Since every field is associated with ocity, ost, dcity, dst, it is not even possible to divide this table.
Maybe, depends on your data. If your are shipping and delivering to hundred of thousand unique destinations I would say you are right. However, if your in a shipping hub that delivers back and forth between the same couple of hundred specific locations then you could make and address table and put a key in for destination and origin.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top