Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Data Cleanup Issue. Is there any way to do this?

Status
Not open for further replies.

biggens

Technical User
Nov 27, 2008
8
AU
Hi,

I have been given the task of cleaning up a list of company data. Ive removed all the duplicates and done some tidy ups but I have an issue I cant work out.

The problem is that there are company names that are the same except for where this example occurs;

Companyname
Companynane Ltd

Both identities are the same, however I want to remove the identity without the Ltd on the end.

There are a 000's of records so i cant compare them individually.

Is there a way to do this using SQL?

Thanks in advance.
 
you could try something like
Code:
select
concat("delete from <tab> where <col> ='",left(col1,locate("ltd",col1)-2),"'")
from t1 where locate("ltd",col1) > 0;
replacing <tab> and <col> with your table and column names.
You can write this out to a file which you can then run against the database
Hope it helps !
 
Pretty neat solution ingresman, thanks :)

Only issue here is that there can be more than one occurrence of the same company.

ie - I may end up with with a result like;

delete from table where company='companyname'
delete from table where company='companyname'
delete from table where company='companyname'

So I figure I had better use the ID col instead?

What do you think?


 
Also in some cases there are two or more records where companies have the "Ltd". (they operate in different locations)

So how can adjust this so where there are two or more companies one remains?

Sorry, Im on my learners permit with SQL - I only know the basics.
 
Yes ID might be the best way.
I'd run the query and see what you get out of it. If you get 1000's of row you might have an issue, if you get 100's you might be better off just editing the file manualy.
To get only one occurane of an output record try putting the word distinct after the select and see what comes out.
To handle more than one occurance directly in SQL will mean you having a look at the group by and having clauses (kind of a where variation) which might help you out.
Great excersie to get into SQL but quite often for one off tasks doing a bit of manual work is quicker than using technology.
One final thing, try to describe the entire problem in you first post, just saves everyones time !
 
Thanks ingresman,

I have around half million records to check so manual is just not an option for me.

I tried the SQL and it doesnt quite do the job becasue it just gets every case of LTD which means i would be deleting any record whith LTD at the end.

I need to keep the record with the ltd and remove the remaining record(s) without the ltd.

Sorry if i didnt make this clear.
 
Ok I get it.
If the title doesnt exist after the ltd is trimmed, it wont be deleted. In this case I cant really use id using this method.
 
your gettign me confused, there will only be a delete issued for a company name which exists with "LTD" at the the. In reality there might not be a physical record to delete.
How's your coding - you might have to create a custom app to handle 500,000 rows.
I've got to go now and won't be able to get to a PC (party on !!), so if no one picks you up over the weekend I'll see you on Monday
 
Well i use MySQL server on my PC and MySQL-front to manage the SQL & data which is good (and fast) - data file is 130Mb.

Anyway what your script does is trim off the "ltd" from any company name with "ltd" and writes out a delete script;
delete from <tab> where companyname = 'company', which i then run to delete the company names i dont want.

the delete script would skip any record it doesnt match and delete any record it does (anything with ltd), leaving the company with the ltd intact - right!

Only issue is that would be one huge SQL query as around 30% of names have ltd. @ 150,000 lines long.

Solution is a good one but due to the size of the data I think I need another method.

I was thinking of setting another column and inserting value '1' where the company has ltd.

- then I can trim off the ltd from all names.
- then remove duplicates by group using the flag col '1'

Then I can write back the ltd to companies flagged '1'

Its all I can think of unless somebody else has a better idea.








 
because each line is a free standing delete you have no issues with transaction size, it just might take a little time to run. You coud break the file up into severla smaller ones and run one after another.
Try your method and see what happens.
My advice would be don't get hung up on effiency or having the most elegant solution if your only doing this once.
 
Ok Thanks ingresman.

Ill try your method :)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top