Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

need quick loop to remove all spaces but one 3

Status
Not open for further replies.

pms18066

Programmer
Jan 12, 2005
94
0
0
US
I have data in name fields like
'John Doe MD' .

I need to remove all spaces except one where they exist. I know I can do several replace statements one after another. But I dont know how many spaces exist and cant get my while loop to work right. can someone give me a quick loop to remove all of the spaces except one.

ie while two spaces exist in the string replace with one space..

Thanks!
 
First, create this function:

Code:
Create Function dbo.RemoveDoubleSpace(@Data VarChar(8000))
Returns VarChar(8000)
As
Begin

	While CharIndex('  ', @Data) > 0
		Set @Data = Replace(@Data, '  ', ' ')

	Return @Data

End

Then you can use it like this...

Code:
Select dbo.RemoveDoubleSpace(YourColumnName)
From   YourTableName

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
thanks! I modified it a little so I wouldnt have a function to call but the logic I was looking for is there..

While (select CharIndex(' ', name) from raw1) > 0
begin
update raw1 set name = Replace(name, ' ', ' ')
end

I see the likelihood of using this in future sources so Im going to create a function later and add it to our library for everyone to use as well.
 
oops I get an error now hmmm!

Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
 
If you're going to do it that way, I would suggest a couple minor changes.

Code:
While Exists(select * From Raw1 Where CharIndex('  ', name) > 0)
  begin
    update raw1 
    set name = Replace(name, '  ', ' ')
    Where CharIndex('  ', name) > 0
  end

There are a couple advantages here. First is the Exists. Since you are looking at all the rows in the raw1 table, exists will "stop looking" as soon as it finds a row with 2 consecutive spaces. It does not look at all the data (until there are no more rows that have double spaces).

Second, your update statement is going to run against every row in your table. By adding a where clause, you will only be updating rows that contain a double spaces.

These suggestions may not make a noticeable improvement in performance if the number of rows in the table is small, but with a larger table, you should notice a speed improvement. More importantly, it's a good habit to get in to.

Most of the time, I advise against over-using functions. In this case, I don't necessarily think it's bad to use it because this function is not "touching" any tables. Sure, it's using data from a table, but it's not calling a table inside the function. This means that it ought to be very fast to execute. You see... functions are called per row so it's relatively easy to write a function that performs poorly. For example, if you used the function method, your other code would look like this...

Code:
Update Raw1
Set    Name = dbo.RemoveDoubleSpace(Name)
Where  CharIndex('  ', Name) > 0

Now, suppose there are 10,000 rows in the table. Also suppose there are 1,000 rows with double spaces. This update statement will only affect 1,000 rows, but more importantly, the function will only get called 1,000 times. If the function used any tables, that would be 1,000 table hits (which would be pretty slow). Since the function is not using a table (it's only working on a scalar value and returns a scalar value), it ought to be very fast to execute.

Make sense?

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
yes it makes perfect sense! I love your explanations!!! They are very helpful! works great too!

thanks again!

One question..Is using name like '% %' vs charindex going to affect performance?

ie is using like... "fuzzy" logic? whereas charindex is straightforward
 
There is no real difference either way.

Sometimes using a LIKE search can be considerably faster than charindex, but this is a special case situation. If you wanted to find a string that started with a certain character (or series of characters) and that column was indexed, then a like search would be faster. Other than that, there is not much difference in performance.

-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
BWAA-HAAA!!!! I go away for a couple of years and when I come back, I see people still using loops for the same ol' problem. There's no need for loops in this, folks. I'll explain...

I'll use the letter "O" to act as a "visible" space so you can see what I'm talking about... consider the following sets of spaces, please....

O
OO
OOO
OOOO
OOOOO
OOOOOO
OOOOOOO
OOOOOOOO

Remember the goal is to convert all of those sets of spaces to just a single space without the use of RBAR even if the RBAR is through the simple use of a UDF. I'll also state that the goal is to remove the extra spaces without making the original string any larger in the process because it may already be as large as it can be for the given datatype.

So... STEP 1 is to identify pairs of spaces. This is done by modifying the second space in each pair of spaces to be an "unlikely" character. In this case, I'll use the visible character of "X" (which isn't unlikely but serves this visual example) to represent a space that has been changed in a pair of spaces. When we replace all pairs of space "OO" with "OX", we get the following

O
OX
OXO
OXOX
OXOXO
OXOXOX
OXOXOXO
OXOXOXOX

STEP 2 is to replace all occurances of "XO" with NOTHING...

O
OX
O[highlight]XO[/highlight]
O[highlight]XO[/highlight]X
O[highlight]XO[/highlight][highlight]XO[/highlight]
O[highlight]XO[/highlight][highlight]XO[/highlight]X
O[highlight]XO[/highlight][highlight]XO[/highlight][highlight]XO[/highlight]
O[highlight]XO[/highlight][highlight]XO[/highlight][highlight]XO[/highlight]X

... and that leaves us with ...

O
OX
O
OX
O
OX
O
OX

STEP 3 is to replace "X" with NOTHING...

O
O[highlight]X[/highlight]
O
O[highlight]X[/highlight]
O
O[highlight]X[/highlight]
O
O[highlight]X[/highlight]

... and that leaves us with just singles spaces everywhere...

O
O
O
O
O
O
O
O

Again... the "O"s represent unmodified spaces and the "X"s represent spaces changed to some unlikely character like maybe one of the graphic characters that occur above ASCII 127. It's just a matter of 3 nested REPLACE statements to handle ANY number of spaces to accomplish the things I laid out above. It can all be done in a single set-based query without loops or even UDF's.

And, without further ado, here's a code example.

Code:
--===== Create and populate a test table.
     -- This is NOT a part of the solution.
DECLARE @Demo TABLE(OriginalString VARCHAR(8000))
 INSERT INTO @Demo (OriginalString)
 SELECT 'This      has multiple   unknown                 spaces in        it.' UNION ALL
 SELECT 'So                     does                      this!' UNION ALL
 SELECT 'As                                does                                this'

--===== Reduce each group of multiple spaces to a single space
     -- for a whole table without functions, loops, or other
     -- forms of slow RBAR.
     -- CHAR(7) is the nonprintable "bell" character that most cannot type into a string.
 SELECT OriginalString,
        REPLACE(
            REPLACE(
                REPLACE(
                    OriginalString
                ,'  ',' '+CHAR(7))
            ,CHAR(7)+' ','')
        ,CHAR(7),'') AS CleanString
   FROM @Demo

Heh... please stop suggesting loops.


--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
The problem I have with your approach is the use of an "unlikely" character.

I quickly tested the performance of your method against the performance of my method. Your method is faster (congratulations). I tested against a table I have that contains 1,000,000 rows. Your method was 2 seconds faster. That averages to approximately 2 microseconds per string. Of course, I will readily admit that the data plays a large role in this. Longer strings are likely to perform differently and the number of occurances of a double space in a string will affect it also. Same thing goes for the number of strings without any double spaces at all.

Regardless, the problem I have with your method is still the use of an "unlikely" character. In my database, I would prefer to take a 2 microsecond hit per string and be confident that I "got it right". My data is too important to me.

I encourage you to read the advice I give to many people (in other threads). If you do, you will see that I rarely advise loops. Given that the stated goal is to update the data in a table, it's clear that this is a "data cleanup" task... one that is unlikely to occur too often.

Given all that, I stand by my advice. I will admit that there is a faster method, but not a better method.



-George

"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
George,

First, thanks for the feedback especially since you're one of the folks that believes in "million row testing".

What did you use to measure the run time and did you return the million row result set to the screen? If you did, you measured the performance incorrectly because time to return to the screen is the "great equalizer".

To get a true measure of the performance, you have to dump the output to a trash variable.


--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Ok... I know what your objections are, George, but I have a serious objection to RBAR that someone may use for reasons other than what you intended. You've outlined some very serious limitations for use and if someone ignores that, they'll end up with some serious performance problems on their hands. We're not talking about two seconds over a million rows either... were talking about 3/4 of a minute of CPU time and more than a minute and 20 seconds in total duration difference on 1/10th of the rows you say you tested on.

Here's the test code I ran on an old 1.8 GHz single cpu...
Code:
/**********************************************************************************
        Test #1: 3 sets of random (1 to 20)spaces to repair (100,000 rows) w/Update
        Note: This testing should be done returning the results to the grid so
              that you can easily view the returned run statistics for comparison.
**********************************************************************************/
  PRINT REPLICATE('=',80)
  PRINT 'Test #1:  3 sets of random spaces to repair (100,000 rows) w/Update'
  PRINT REPLICATE('=',80)

--=================================================================================
--      Setup the test tables
--=================================================================================
  PRINT 'Creating 2 test tables...'

--===== Create two identical tables
DECLARE @Demo1A TABLE(OriginalString VARCHAR(8000))
DECLARE @Demo1B TABLE(OriginalString VARCHAR(8000))

--===== Populate the first table with text that has 3 random sized sets of spaces
 INSERT INTO @Demo1A (OriginalString)
 SELECT TOP 100000 
        'This is a test'           + SPACE(ABS(CHECKSUM(NEWID()))%20+1)
      + 'with 3 random sized sets' + SPACE(ABS(CHECKSUM(NEWID()))%20+1)
      + 'of spaces'                + SPACE(ABS(CHECKSUM(NEWID()))%20+1)
      + 'using only 100,000 rows'
   FROM Master.dbo.SysColumns sc1
  CROSS JOIN Master.dbo.SysColumns sc2

--===== Copy table A to table B so we have two identical copies to work with.
 INSERT INTO @Demo1B
        (OriginalString)
 SELECT OriginalString
   FROM @Demo1A

--=================================================================================
--      Test the set based update method
--=================================================================================
  PRINT REPLICATE('=',80)
  PRINT '========== Jeff Moden''s method =========='
    SET STATISTICS TIME ON

--===== Reduce each group of multiple spaces to a single space
     -- for a whole table without functions, loops, or other
     -- forms of slow RBAR.  In the following example, CHAR(7)
     -- is the "unlikely" character that "X" was used for in 
     -- the explanation.
 UPDATE @Demo1a
    SET OriginalString =
        LTRIM(RTRIM(
            REPLACE(
                REPLACE(
                    REPLACE(
                        OriginalString
                    ,'  ',' '+CHAR(7))  --Changes 2 spaces to the OX model
                ,CHAR(7)+' ','')        --Changes the XO model to nothing
            ,CHAR(7),'')                --Changes the remaining X's to nothing
        ))
  WHERE CHARINDEX('  ', OriginalString) > 0
    SET STATISTICS TIME OFF

--================================================================================
--      Test the RBAR update method
--================================================================================
  PRINT REPLICATE('=',80)
  PRINT '========== George''s original UDF =========='
    SET STATISTICS TIME ON
 UPDATE @Demo1B
    SET OriginalString = dbo.RemoveDoubleSpace(OriginalString)
  WHERE CHARINDEX('  ', OriginalString) > 0
    SET STATISTICS TIME OFF
...and here are the results from that test...
Code:
================================================================================
Test #1:  3 sets of random spaces to repair (100,000 rows) w/Update
================================================================================
Creating 2 test tables...

(100000 row(s) affected)

(100000 row(s) affected)
================================================================================
========== Jeff Moden's method ==========
[b][COLOR=green]
SQL Server Execution Times:
   CPU time = 4750 ms,  elapsed time = 5724 ms.[/color][/b]

(99992 row(s) affected)
================================================================================
========== George's original UDF ==========
[b][COLOR=red]
SQL Server Execution Times:
   CPU time = 52234 ms,  elapsed time = 88882 ms.[/color][/b]

(99992 row(s) affected)

Again, keep in mind that's just for a lousy 100,000 rows and not the million rows you say you tested on. And, yes, you solved the OP's problem and you've explained the limitations you've observed but, if someone uses your While Loop code for something you hadn't intended them to use it for, they're in deep Kimchie.


--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Jeff,

I think your reasoning is not best. I know George already covered this, but I feel it useful to restate this more strongly.

Sometimes we do things as professionals that not all beginners can understand, but when we do, it should be for a compelling reason. And I think that the tiny performance benefit your method gains is not worth the extra cost (difficulty understanding it, potential collision with existing column values).

In my opinion, inserting values to strings that "shouldn't already be in the data" is a recipe for someone, somewhere, to royally screw up his data. Weighing that against the possibility that some SQL neophyte might inappropriately use a loop on something because he once saw a loop used for an entirely different problem, I'd any day take the inappropriate loop (whose cure is simply a bit of experience + wit and a recode) over the data corruption, whose penalty is high and cure is difficult or impossible.

Let me say this another way. If your logic is valid that "someone might see you use a loop and then do something not so great later", then it applies to your method, too!

Someone might see your technique for inserting "placeholder characters" into strings and start using it inappropriately. Your only explanation was 'is the nonprintable "bell" character that most cannot type into a string.' Only most? What about those that can type this (such as I)? What about some beginner who looks up nonprintable characters, and upon learning that there are a lot of them from ascii 0 through 31, starts using other nonprintable characters? Try putting an ascii 0 in the middle of a varchar column and then selecting that column plus others in sql 2000 query analyzer. Does it do what you expected?

Fully analyzed, while your method is clever (and for that alone I really do like it and will remember it) I have to professionally recommend that people not use it.

Finally, while I share your utter distaste for loops, I think you are incorrect to dismiss them entirely. And please note that George's loop is not a case of RBAR, because though repeated, he uses a set-based operation, so your biggest criticism doesn't even apply! And also, please note that behind the scenes, SQL Server itself uses... wait for it... loops (gasp).

Here are other situations where loops are valuable to the professional SQL programmer:

1. Batching gigantic updates. The idea here is not a loop for every row, but repeatedly performing a set-based operation that is limited to a certain number of rows until it is completed.

So, I spent multiple days searching for a good way to update some multiple million row tables based on very complex joins. My findings were interesting.

First of all, it was to a production OLTP database where locking any object more than a few fractions of a second is not ok. Measuring the cost of my query only by total cpu or reads completely ignores the availability aspect. My boss doesn't care if I used 5 million otherwise-unused clock cycles on the prod CPU box, as long as no one else notices an impact on the system during that time. But he'll be really mad if I cause anyone to stop in his work because "overall I'll save 1 million cycles."

Second, looping and doing small batches actually performed the whole set of work dozens of times faster. The reason is because I made strategic use of temp tables and carefully crafted my query to "walk the clustered index" of the biggest, widest table. So what happened is that my query execution plan was able to use repeated small loop joins instead of a massive, massive hash join.

2. I have a reporting database that sends documents to another system. Data about the reports end up in a SQL database. So I made the system "confirmation-based" instead of "activity-based" so that it doesn't consider the report completed simply because it created the report and sent it, it waits until it sees it arrive in the other system. Well, the best way to compare data between SQL Servers is a linked server query. But again, the biggest concern here is about performance impact to other users, so I did some test queries and ran traces on the remote server to see what the DB library was doing behind the scenes with many different versions of joins. To make a long story short, the best possible performance I could achieve was with a serially applied CROSS APPLY (which in effect forces RBAR access) instead of trying to do a distributed transaction joining multiple remote tables. Performance of my reconciliation SP skyrocketed upward. Essentially, what I was doing was modeling across the linked server what is called a "bookmark lookup" in SQL 2000 execution plan parlance, or a "nonclustered index seek followed by clustered index seek against the same table" in SQL 2005 parlance.

3. In a database where availability didn't matter, I still wanted my queries to perform better for a huge analysis project I was doing, on a database with 3.5 million pages (that I was doing OCR against). Instead of using the complex UDF that ran against each row to calculate the UNC path from a special index value, I found a set-based way to do it. But I still had to batch it to get it to work well. And batching did work well: it increased performance, again by converting hash joins back to loop joins.


I'm reminded of Benner's phases of Clinical Competence.
• Rules given to novices are context-free and independent of specific cases; hence the rules tend to be applied universally. The rule-governed behavior typical of the novice is extremely limited and inflexible. As such, novices have no "life experience" in the application of rules.

Then skipping the intermediate steps to show the contrast I intend:

• The expert performer no longer relies on an analytic principle (rule, guideline, maxim) to connect her or his understanding of the situation to an appropriate action. The expert, with an enormous background of experience, now has an intuitive grasp of each situation and zeroes in on the accurate region of the problem without wasteful consideration of a large range of unfruitful, alternative diagnoses and solutions. The expert operates from a deep understanding of the total situation.

I am not suggesting you are a beginner, but I do think you are still applying rules too rigidly in your professional thought. Would you ever have considered using loops to solve my three examples above? Loops turnedd out to be, in fact, really good answers. Though I recognize you may not believe me about that.

Erik

P.S. I don't claim to be a master of SQL.
 
Thanks for taking the time for the post, Erik.

First, when have you EVER typed a BELL character into any GUI text box? Probably never and if you did, it was for a test. Even then, I'll bet you only typed 1. The real key is that it really doesn't matter because even if you type a BELL character, it can't be seen, won't take any space whether it's replaced or not, and will have no effect on computers capable of running the code.

Second, George's method does use RBAR... there's a WHILE Loop in his function. It does not simply use repeated set based operations.

Third, I know about the (very high speed) loops SQL Server uses behind the scenes... some of us call them "Pseudo Cursors" and they are the very reason why you must write set-based code because they are the whole substance of set-based code. Your comparison of that type of loop compared to the explicit loops of a WHILE clause (or other RBAR) is actually inappropriate because it is the "Pseudo Cursors" I'm actually trying to use while RBAR overrides their use.

Almost last but not least, some beginner will not look up anything about either George's or my code... they will simply use it which is why EXPLICIT loops should not be used and the BELL character code that uses Pseudo Cursors should.

Last but not least, this whole shootin' match isn't a personal attack on George... George usually writes some awesome code. Nah, it's a personal attack on RBAR and the severe performance problems that all forms of RBAR incur.

However, your concerns are your concerns. If you want to say not to use the method I did, you certainly have that right... it's the same right I have to say the you shouldn't use explicit loops and other forms of RBAR because of the horrible affect it has on performance. ;-)

Oh, I almost forgot. Congratulations (seriously) on the performance gain by intelligently using CROSS APPLY... CROSS APPLY doesn't always resolve out as RBAR in the execution plan any more than a properly written correlated sub query would (in other words, not all correlated sub queries are RBAR). Like a lot of other things, "IT DEPENDS".

Again, thank you for taking the time to write your large and thoughtful post. I just happen to disagree with its content just like you disagree with the content of mine.


--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Hi Jeff,

I actually like your solution and I don't think there could be any problem with it, thus gave you that star when I saw that solution.
 
I saw that you added it to your archive but didn't know who gave me the star. I never expect stars but I sure do appreciate them when they happen. Thanks, Markros, and thanks for the feedback on the solution. Both are much appreciated.

By the way (I hate to assume from post handles)... is your first name "Mark"?

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Nope - this is just a nickname I chose. If you read this forum long enough you'll figure out my real name.
 
If you need a hint for my real name - re-read the Dynamic variable name and check the link to the other forum I provided. But don't put it here - let's keep the "incognito
 
Heh... no chance of you just coughing it up? ;-)

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
No :) BTW, I think I saw a thread somewhere today where you were greeted - was it on MSDN T-SQL forum?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top