Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

need quick loop to remove all spaces but one 3

Status
Not open for further replies.

pms18066

Programmer
Jan 12, 2005
94
US
I have data in name fields like
'John Doe MD' .

I need to remove all spaces except one where they exist. I know I can do several replace statements one after another. But I dont know how many spaces exist and cant get my while loop to work right. can someone give me a quick loop to remove all of the spaces except one.

ie while two spaces exist in the string replace with one space..

Thanks!
 
No, but now I understand the "challenge" and the connection. It was on Jacob's fine site. Peter Larson welcomed me aboard.

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Aha, I see - yes. I read the interview with Brad Schulz and glanced through the first T-SQL challenge. Opened account on Linkedin few days ago and added SQL Server group to my list of groups.
 
I jumped in on LinkedIn about, ummmmm... jeez, I think it was two years ago, already.

I didn't know Jacob had a site until Challenge #13 came out. I got in on 13 and 14. We'll see how that goes. I believe that Peter Larson was the designer of #14.

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Some good posts, very enjoyable read.

I would be interested to hear if anyone has tested a SQL CLR solution.

ATM I havent found a need for any CLRs, and just wondering if it would be viable for certain string manipulations like this one.

Powershell code below, as I dont have the bits installed for writing a CLR.

Code:
$bell = [System.Char](7);
[string] $myString = "This      has multiple   unknown                 spaces in        it.";

$Option0clean = $myString.Replace("  "," " + $bell).Replace("$bell ","").Replace($bell,"")
$Option0clean;

$Option1clean = [regex]::Replace($myString, "\s+", " ");
$Option1clean;

$Option2clean = [string]::join( " ", $myString.Split( " ", [StringSplitOptions]::RemoveEmptyEntries ));
$Option2clean;
 
Jeff,

Thank you in turn for your thoughtful reply. I can see where you're coming from.

Now that I think about it, you are right that the first suggestion to use a function is RBAR. I was thinking of the WHILE EXISTS method, which uses set-based queries to perform an operation on every row at once. It has to do the job multiple times to ensure completion, but it is still set-based.

My own initial thoughts for this problem would have been something more like this:
Code:
UPDATE Demo1C SET OriginalString = Replace(OriginalString, '  ', ' ') WHERE CHARINDEX('  ', OriginalString) > 0
WHILE @@RowCount > 0 UPDATE Demo1C SET OriginalString = Replace(OriginalString, '  ', ' ') WHERE CHARINDEX('  ', OriginalString) > 0
I also played with some variations such as:
Code:
UPDATE Demo1E SET OriginalString = Replace(Replace(Replace(Replace(Replace(OriginalString, '                                ', ' '), '                ', ' '), '        ', ' '), '    ', ' '), '  ', ' ') WHERE CHARINDEX('  ', OriginalString) > 0
This got a performance improvement. (That's 32, 16, 8, 4, and 2 spaces in there, which can handle up to 64 repetitions.)

And I think my critique still applies, that inserting extra characters in a string could lead others to use improper techniques in their own coding.

Honestly, I might use your very clever solution some day, but my passion for data integrity makes it very hard for me to introduce supposed unused characters into data, even temporarily.

While it's not a perfect application to this case, here's an example of the kind of thinking I'm employing here, from one of my blogs:
If you don't think you need to escape commas, please ensure that commas can't be in your data ever, not that they don't exist now or "we'll tell people not to put commas in." An utterly basic and beginner's rule of programming is to never trust user input. Bad Things happen when you do. How do you know that in 5 years when you may be gone or on vacation that some new person who never heard about the comma rule won't enter a comma?"
I take pride in my work, and part of that pride is making it hard for people to break it later, even the very ignorant.

Can we at least agree that we want to do what's best in every area: best practice, performance, elegance, simplicity, etc.? We just have to wrangle out the details of what exactly that is in every case. :)

Erik

P.S. I use a variation of your technique in MS Word to scrub or transform data. I'm always replacing double paragraph marks with the string "&&*&" and assuming it will never be found in the text I'm working with. Then I'm free to work with single paragraph marks and restore the original double ones later. There's also a trick about replacing a character with paragraph formatting, then searching on the paragraph formatting, which allows some really cool filtering-type operations on entire paragraphs.
 
I forgot to say something.

The CROSS APPLY did not magically switch away from a loop behind the scenes. What it did was force individual calls across the linked server, one for each row (as proved by a SQL trace). This was far better performance than the standard JOIN because with the CROSS APPLY it was using proper WHERE conditions for the remote call, but with a standard JOIN it was trying to return a huge, huge chunk of rows.

In fact, looking at my code now, I wrapped this query in a WHILE loop limiting each one to only 100 rows. See, my system had been running along just fine for months until one time when the scheduled reconciliation was down for a few days, then when turned back on and it tried to work with tens of thousands of rows instead of a few hundred, the query kept bombing. I'd hate to think what it would have done if I hadn't been using the CROSS APPLY in the first place...
Code:
WHILE 1 = 1 BEGIN
   INSERT @Completed
   OUTPUT Inserted.QueueID INTO @Seen
   SELECT TOP 100
      Q.QueueID,
      P.[PageCount],
      P.PostDtm,
      QueueStatusID = CASE WHEN P.[PageCount] = Q.[PageCount] THEN 5 ELSE 12 END,
      VisitReportStatusID = CASE WHEN P.[PageCount] = Q.[PageCount] THEN 3 ELSE 2 END
   FROM
      dbo.[Queue] Q WITH (ROWLOCK, XLOCK, HOLDLOCK)
      INNER JOIN LinkedServer.Database.dbo.Documents D ON Q.DestDocName = D.Doc_Name AND Q.DestSubtitle = D.Subtitle AND D.Encounter = Convert(varchar(11), Q.VisitID)
      CROSS APPLY (
         SELECT
            [PageCount] = Max(P.Page),
            PostDtm = Max(P.ImageDate)
         FROM
            LinkedServer.Database.dbo.Versions V
            INNER JOIN LinkedServer.Database.dbo.Pages P ON V.Version_ID = P.Version_ID
         WHERE
            V.Doc_ID = D.Doc_ID
            AND V.VersionNumber = 0
            AND P.Deleted = 'N'
      ) P
      INNER JOIN dbo.VisitReport VR WITH (ROWLOCK, XLOCK, HOLDLOCK) ON Q.VisitID = VR.VisitID AND Q.ReportID = VR.ReportID
   WHERE
      Q.Test = 0
      AND Q.QueueStatusID = 4
      AND P.PostDtm >= Q.SentDtm
      AND NOT EXISTS (
         SELECT *
         FROM @Seen S
         WHERE Q.QueueID = S.QueueID
      )
   IF @@RowCount < 100 BREAK
END
 
No doubt the RBAR improved performance over what you say it was doing. I don't have access to your data and can't really say for sure, but I'd have attempted to do a preaggregation on the linked server stuff and then had gone from there.

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
It's funny that you assume I'm doing something wrong, though you tried to soften it with "can't really say for sure". But we all know what you mean to say whichi s that I must be missing something by EVER using 'RBAR'. :)

I can't preaggregate on the linked server because there are 4 million row in the pages table. I HAVE to look up the document first, then get the rows I need from the versions/pages tables. Doing it the 'RBAR' way does use one remote query per document, but this clustered index seek is way less expensive than some huge join involving all three tables. The normal "join as many tables on the remote server before joining locally" strategy fails in this case, thus my use of CROSS APPLY to force it.
 
Ok... so why do you have to do it all in one query? Divide'n'Conquer in a set based fashion.

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Is there something wrong with doing it all in one query? It works great.

Some of my findings also were that joins across linked servers do RBAR behind the scenes even when you think they won't. I'm pretty sure that if simply querying Documents for the Doc_ID and putting it in a temp table, then joining again would have really made a difference, I would have gone down that round.

I will think about your suggestion and give it a shot at some point, but I think I covered all the angles.
 
Yes... "set based" doesn't mean "all in one query" and sometimes splitting up the code will provide incredible performance gains. But, like I said, I don't have access to your data so I can't prove it in this case... they're just suggestions.

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 

Just sharing my results from testing.

Id say its not worth implementing a CLR unless you are already using them. In this scenario there are no real performance gains to be had writing custom functions.


NB: regex is overkill for this scenario and it shows in the figures.


--10'000 Row Sample
[squeezeSplitJoin] clr CPU time = 344 ms, elapsed time = 464 ms.
[squeezeReplace] clr CPU time = 406 ms, elapsed time = 476 ms.
Jeff Moden's method CPU time = 437 ms, elapsed time = 520 ms.

[squeezeRegex] clr CPU time = 875 ms, elapsed time = 1143 ms.
George's original CPU time = 953 ms, elapsed time = 1151 ms.

--100'000 Row Sample
[squeezeReplace] clr CPU time = 4172 ms, elapsed time = 5073 ms.
[squeezeSplitJoin] clr CPU time = 4156 ms, elapsed time = 5295 ms.
Jeff Moden's method CPU time = 4281 ms, elapsed time = 5375 ms.

[squeezeRegex] clr CPU time = 8734 ms, elapsed time = 10585 ms.
George's original UDF CPU time = 9516 ms, elapsed time = 17962 ms.

--1'000'000 Row Sample
[squeezeSplitJoin] clr CPU time = 40172 ms, elapsed time = 52232 ms.
Jeff Moden's method CPU time = 40671 ms, elapsed time = 53211 ms.
[squeezeReplace] clr CPU time = 39984 ms, elapsed time = 53694 ms.

[squeezeRegex] clr CPU time = 86609 ms, elapsed time = 111461 ms.
George's original CPU time = 97532 ms, elapsed time = 180209 ms.




Code:
public partial class SqlFunction
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString squeezeRegex(SqlString s)
    {
        return Regex.Replace(s.ToString(), @"\s+", " ");
    }

    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString squeezeSplitJoin(SqlString s)
    {
        return string.Join(" ", s.ToString().Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries));
    }   

    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString squeezeReplace(SqlString s)
    {       
        return s.ToString().Replace("  ", " " + "").Replace("" + " ", "").Replace("", "");
    }   
}
 
Thanks for the testing, Jamfool, especially since I haven't found a reason to enable CLR's yet. Wow... I'd always heard that Regex was supposed to be really good for this type of stuff. I'm surprised.

--Jeff Moden
-------------------------------------------------------------------------------
"RBAR" is pronounced "ree-bar" and is a "Modenism" for "Row By Agonizing Row
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top