Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Working With Spaces In Strings

Status
Not open for further replies.

JasonCannon

IS-IT--Management
Apr 11, 2013
30
US
I have just started working with records that have excess spaces within titles and names and I am trying to find the best way to "clean them up."

I cannot trim out all the spaces as I need them between the words.

Examples:
- "A AA Self Storage" should read "AAA Self Storage"
- "A A A Carolina" and "A AA Carolina" should read "AAA Carolina"
- "A A Al-Anon" should read "AA Al-Anon"
- "BB T Bank" should read "BBT Bank"
etc.

Trying to find a better way than to have a long CASE .. ENDCASE statement.

Any suggestions?

Thanks.

"..if you give a man a fish he is hungry again in an hour. If you teach him to catch a fish you do him a good turn."
-- Anne Isabella Thackeray Ritchie.
 
Hi,

There is no clear pattern between how "A AA Self Storage" and "A A A Carolina" should be trimmed but you might have a look at the STRTRAN() function

? STRTRAN(A AA Self, " ","",1,1) && yields "AAA Self"
? STRTRAN(A A A Carolina, " ","",1,2) && yields "AAA Carolina"

hth

MarK
 
Not quite sure what your business rule is. Are you saying that if a "word" contains a single letter, it should be concatenated with the previous (or next?) word? (Your last example [BB T Bank] is ambiguous in that respect.)

Difficult to give a final answer without more information, but your solution would look something like this:

Code:
lcIn = < your input string >
lcOut = ""
FOR lnI = 1 TO GETWORDCOUNT(lcIn)
  lcWord = GETWORDNUM(lcIn, lnI)
  lcOut = lcOut + lcWord + IIF(LEN(lcWord) = 1, " ", "")
ENDFOR 
< your output string = lcOut

This will need tweaking, but it should give you the general idea.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
I am trying to avoid having a long list of STRTRAN()'s for each entry and try to come up with something more general. That, or have some sort of "spell check" table.

That FOR .. ENDFOR loop looks promising. But a problem would be when a title is like "A New Little Store". I do not want to take out the space between "A" and "New".

Much thanks for the help.

"..if you give a man a fish he is hungry again in an hour. If you teach him to catch a fish you do him a good turn."
-- Anne Isabella Thackeray Ritchie.
 
Jason,

As I said before, you need to specify your rules. The examples you gave are useful but they don't cover every case. And they are not consistent. For example, you say you don't want a space between "A" and "New", but you do want one between "A" and "AA".

For the moment, forget about the code. Just set out in English what your rules are. That's an essential first step in solving this kind of problem. (If you can't express the rules in English, you are unlikely to be able to program them.)

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Jason,

You may consider using regular expressions to do the job of pattern matching, and let VFP handle the cleaning up.

This is a start that takes into account your examples and desired processed results.

Code:
LOCAL RegExpr AS VBScript.RegExp
LOCAL Matches		&& objects created by RegExpr
LOCAL Match

m.RegExpr = CREATEOBJECT("VBScript.RegExp")
m.RegExpr.Global = .T.
m.RegExpr.Ignorecase = .F.
m.RegExpr.Pattern = "^([A-Z\s]+\s)*"

CREATE CURSOR Strings (InputData Varchar(200))

INSERT INTO Strings VALUES ("A AA Self Storage")
INSERT INTO Strings VALUES ("A A A Carolina")
INSERT INTO Strings VALUES ("A AA Carolina")
INSERT INTO Strings VALUES ("A A Al-Anon")
INSERT INTO Strings VALUES ("BB T Bank")

SCAN

	m.Matches = m.RegExpr.Execute(Strings.InputData)

	IF !ISNULL(m.Matches) AND m.Matches.Count > 0

		&& a match was found, so fetch the first one
		m.Match = m.Matches.item(0)
		&& remove the spaces from the found match and replace it in the source string
		? "[" + Strings.InputData + "]", "->", "[" + STRTRAN(Strings.InputData, m.Match.Value, CHRTRAN(m.Match.Value, " ", "") + " ") + "]"

	ELSE

		? Strings.InputData

	ENDIF

ENDSCAN
 
Mike,

Yes, working on that.

Atlopes,

Very interesting. Much thanks. Will study that.

Jason...

"..if you give a man a fish he is hungry again in an hour. If you teach him to catch a fish you do him a good turn."
-- Anne Isabella Thackeray Ritchie.
 
Jason

Just a few more points.

The two main aspects of using a regular expression approach are the pattern definition and what to do with matches found in a given string. The pattern I proposed seems to satisfy your requirements, but that may need some additional tweaking in case there your requirements get more complex than you enunciated.

The STRTRAN in my code should attempt to transform only the first occurrence of the match, so the parameter that would set that is missing from the snippet.
 
You set wrong values against correct ones and the correct values have the nature to be shorter here,

So to find a list of matches with or without additional spaces you could match names by removing spaces to find matches and to put all of them in the same group, then replace them with the topmost shortest name in the group.

Code:
Create Cursor crsNames (vName v(50))

Insert into crsNames values ("A AA Self Storage")
Insert into crsNames values ("AAA Self Storage")
Insert into crsNames values ("A A A Carolina") 
Insert into crsNames values ("A AA Carolina") 
Insert into crsNames values ("AAA Carolina")
Insert into crsNames values ("A A Al-Anon")
Insert into crsNames values ("AA Al-Anon")
Insert into crsNames values ("BB T Bank")
Insert into crsNames values ("BBT Bank")

Select Padr(Chrtran(vName," ",""),240) as Grouping, Count(*) as GroupCount, Min(Len(Alltrim(vName))) as LengthOfShortestSpelling ;
from crsNames ;
group by 1;
Having Count(*)>1 ;
into Cursor crsGroups

Select Padr(Chrtran(vName," ",""),240) as Grouping, vName as CorrectSpelling ;
From crsNames ;
Inner Join crsGroups on Padr(Chrtran(vName," ",""),240) = crsGroups.Grouping ;
 And Len(Alltrim(vName)) = LengthOfShortestSpelling ;
into Cursor crsCorrectSpellings


Update crsNames set vName = CorrectSpelling From crsNames;
inner join crsCorrectSpellings on Padr(Chrtran(crsNAmes.vName," ",""),240)==Grouping

Bye, Olaf.


Olaf Doschke Software Engineering
 
This isn't tested for performance on a large dataset.

The expression Padr(Chrtran(crsNAmes.vName," ",""),240) appears repeatedly, so´it may be a good idea to add a grouping column to the names list and compute the field "Grouping" just once for all names, then "inherit" it in the further cursors for processing. To not alter the existing table that might also be done by a 1:1 related temp table or cursor.

The other thing to note is that the assumption you do find the correct spelling among the values is not necessarily true. So a manual step before making the final name updates would be to examine the crsCirrectspellings for actually having correct spellings. All names producing the same "Grouping" value be removing all spaces.

One more detail: Why 240? 240 bytes is the maximum length for values to index and that's even just 120 characters in any other collation than MACHINE, so you might shorten that to 120, too. It should be sufficient to group names without adding too many names into one group and thereby merging differing names into one. Another reason to look into the crsCorrectSpellings. Besides, the shorter this can be, the more effective an index will be. I don't index in code here, Rushmore will add temporary indexes if it can. It's actually a good idea to index the Grouping field so Rushmore optimization doesn't need to.

I leave that an open point, as you have to make some decisions anyway and may already have solved it with the other advice you got.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top