Data Cleanup Issue. Is there any way to do this?

biggens · Nov 27, 2008

Hi,

I have been given the task of cleaning up a list of company data. Ive removed all the duplicates and done some tidy ups but I have an issue I cant work out.

The problem is that there are company names that are the same except for where this example occurs;

Companyname
Companynane Ltd

Both identities are the same, however I want to remove the identity without the Ltd on the end.

There are a 000's of records so i cant compare them individually.

Is there a way to do this using SQL?

Thanks in advance.

biggens · Nov 27, 2008

Typo...

Companyname
Companyname Ltd

ingresman · Nov 28, 2008

you could try something like

Code:

select
concat("delete from <tab> where <col> ='",left(col1,locate("ltd",col1)-2),"'")
from t1 where locate("ltd",col1) > 0;

replacing <tab> and <col> with your table and column names.
You can write this out to a file which you can then run against the database
Hope it helps !

biggens · Nov 28, 2008

Pretty neat solution ingresman, thanks

Only issue here is that there can be more than one occurrence of the same company.

ie - I may end up with with a result like;

delete from table where company='companyname'
delete from table where company='companyname'
delete from table where company='companyname'

So I figure I had better use the ID col instead?

What do you think?

biggens · Nov 28, 2008

Also in some cases there are two or more records where companies have the "Ltd". (they operate in different locations)

So how can adjust this so where there are two or more companies one remains?

Sorry, Im on my learners permit with SQL - I only know the basics.

ingresman · Nov 28, 2008

Yes ID might be the best way.
I'd run the query and see what you get out of it. If you get 1000's of row you might have an issue, if you get 100's you might be better off just editing the file manualy.
To get only one occurane of an output record try putting the word distinct after the select and see what comes out.
To handle more than one occurance directly in SQL will mean you having a look at the group by and having clauses (kind of a where variation) which might help you out.
Great excersie to get into SQL but quite often for one off tasks doing a bit of manual work is quicker than using technology.
One final thing, try to describe the entire problem in you first post, just saves everyones time !

biggens · Nov 28, 2008

Thanks ingresman,

I have around half million records to check so manual is just not an option for me.

I tried the SQL and it doesnt quite do the job becasue it just gets every case of LTD which means i would be deleting any record whith LTD at the end.

I need to keep the record with the ltd and remove the remaining record(s) without the ltd.

Sorry if i didnt make this clear.

biggens · Nov 28, 2008

Ok I get it.
If the title doesnt exist after the ltd is trimmed, it wont be deleted. In this case I cant really use id using this method.

ingresman · Nov 28, 2008

your gettign me confused, there will only be a delete issued for a company name which exists with "LTD" at the the. In reality there might not be a physical record to delete.
How's your coding - you might have to create a custom app to handle 500,000 rows.
I've got to go now and won't be able to get to a PC (party on !!), so if no one picks you up over the weekend I'll see you on Monday

biggens · Nov 28, 2008

Well i use MySQL server on my PC and MySQL-front to manage the SQL & data which is good (and fast) - data file is 130Mb.

Anyway what your script does is trim off the "ltd" from any company name with "ltd" and writes out a delete script;
delete from <tab> where companyname = 'company', which i then run to delete the company names i dont want.

the delete script would skip any record it doesnt match and delete any record it does (anything with ltd), leaving the company with the ltd intact - right!

Only issue is that would be one huge SQL query as around 30% of names have ltd. @ 150,000 lines long.

Solution is a good one but due to the size of the data I think I need another method.

I was thinking of setting another column and inserting value '1' where the company has ltd.

- then I can trim off the ltd from all names.
- then remove duplicates by group using the flag col '1'

Then I can write back the ltd to companies flagged '1'

Its all I can think of unless somebody else has a better idea.

ingresman · Dec 1, 2008

because each line is a free standing delete you have no issues with transaction size, it just might take a little time to run. You coud break the file up into severla smaller ones and run one after another.
Try your method and see what happens.
My advice would be don't get hung up on effiency or having the most elegant solution if your only doing this once.

biggens · Dec 1, 2008

Ok Thanks ingresman.

Ill try your method

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Data Cleanup Issue. Is there any way to do this?

biggens

Technical User

biggens

Technical User

ingresman

IS-IT--Management

biggens

Technical User

biggens

Technical User

ingresman

IS-IT--Management

biggens

Technical User

biggens

Technical User

ingresman

IS-IT--Management

biggens

Technical User

ingresman

IS-IT--Management

biggens

Technical User

Similar threads

Part and Inventory Search

Sponsor