Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cleaning Text 1

Status
Not open for further replies.

micang

Technical User
Aug 9, 2006
626
US
SQL 2000 - QA

Hi All,

I have some data, 50 odd million rows. (samples below). I have to cleanse this data as much as possible. How would I go about:

A:) Deleting anything where the first character is not a letter or number or actually, delete anything that is not a letter or number until it get's to either a letter or number.

B:) Once the above is done I will have to look through the whole string and remove anything that is not a letter or number.


I have to do this in at least 2 steps because once A:) is done, we have to analyse the data because there could be cases where having a space or dash in the middle is indeed a valid string (which then would nean another post to get help from you guys!)

93369.0E.397665
93337-1E473959
933371.E.483277
W0LF7AC A64V34481
94369-9V-209046
94369-9V/266046
*V1N9B08DJY636619
RCD7127847
/RCD7127856

As usual, many thanks in advance for any assistance.

Michael
 
does this work

LEFT(yourfield, 1) NOT LIKE '%[A-Z][0-9][a-z]%'

-DNG
 
The below code should do both but you need two steps.....

Code:
Public Function DeleteEmbedded(CharsToDelete As String, FromString As String)
        Dim i As Integer, s As String

        For i = 1 To Len(FromString)
            If InStr(CharsToDelete, Mid$(FromString, i, 1)) = 0 Then
                s = s & Mid$(FromString, i, 1)
            End If
        Next i
        DeleteEmbedded = s
 End Function

You would use it like:

DeleteEmbedded("-.*/\[]{}", [field])

And it will strip out any character found between the double quotes and return just the remainder....still in the same order.

All that ebing said, you could probably use this with two different calls as long as the possible first characters are not characters you'd expect throughout the rest of the data, which may not be the case....

=======================================
People think it must be fun to be a super genius, but they don't realize how hard it is to put up with all the idiots in the world. (Calvin from Calvin And Hobbs)

Robert L. Johnson III
CCNA, CCDA, MCSA, CNA, Net+, A+, CHDP
VB/Access Programmer
 
Thanks. I am a newbie mstrmagel1768. Is that VB code then?
 
The algorithm I show here is likely to be very slow, but since you are cleaning data (once and done), I suppose it should be acceptable.

First, create this UDF
Code:
Create Function [dbo].[GetCleanValue]
	(
	@Data VarChar(50)
	)
Returns VarChar(50)
As
Begin
	Declare @Temp VarChar(50)
	Declare @i Integer
	Declare @Output VarChar(50)

	Set @i = 1
	Set @Output = ''

	While @i <= Len(@Data)
		Begin
			Set @Temp = SubString(@Data, @i, 1)
			If @Temp Like '[0-9A-Z]'
				Set @Output = @Output + @Temp

			Set @i = @i + 1

		End

	Return (Select @Output)

End

Then, test it on your table (with just a couple rows)

Code:
Select top 100 Field, dbo.GetCleanValue(Field)
From   table


-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Hi George,

I have no idea what that does, but it does do the trick.

I see that it does both steps in one, could I split it up? So first get rid of anything in the begining of the string that is not a number or letter. Then once we have looked at that date we can see if we have to run step 2.

Many thanks so far, this is great.

Michael

PS - I see it creates the function, but I can;t see where it is stored? It's not in FUNCTION folder (in Query Analyser), should it not be there?

 
It will be in the function 'folder'. You need to refresh the list.

To get just those values where the first letter is not a number or letter, you could use...

Code:
Select Field
From   Table
Where  Left(Field, 1) Not Like '[0-9A-Z]'

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Thanks George.

But how would I go about actually deleting those first characters?

Code:
Select Field
From   Table
Where  Left(Field, 1) Not Like '[0-9A-Z]'

I can see how this select the data, but how do I get it to actually delete the data before the 1st numeric or alpha character?

Thanks so much, you're helping me so much.

Michael
 
Code:
Update Table
Set    Field = Right(Field, Len(Field) - 1)
Where  Left(Field, 1) Not Like '[0-9A-Z]'

This should do what you want. You should make a backup copy of your database before running this. That way I don't feel guilty if you lose data. [smile]



-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Thanks George.
Yes, that did work.
One more thing though, let's say some strings have 2 or more invalid characters (space ?*/ etc), can the code be modified to update taking this into account? As far as I see currently it will only remove first character that is not a number or letter.


Now later on today or tomorrow I will probably have to remove invalid characters in the middle or any part of the string....that's not gonna be fun.

Michael
 
Code:
Update Table
Set    Field =  Right(Field, Len(Field) - PatIndex('%[0-9A-Z]%', Field) + 1)

I encourage you to lookup PatIndex and CharIndex in books on line.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
George, you're a star, that does the trick, I have tested it on sample data with various combinations.

As for the other issue, I will try my best by looking at Pat and Char index as you suggest.

You know what it's like when a boss says "Just remove any weird characters in the middle of that data, but keep ones that have spaces or dashes or maybe dots or even a slash, they could be correct...." !!!!!!

Um hello, I wish I could just type your sentence into SQL and it would do it...!

Michael
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top