Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Find and Replace Regex question please 1

Status
Not open for further replies.

PaulSc

MIS
Aug 21, 2000
148
GB
Hi all

After a bit of help again please

Every day I have to go through a 3-400 Mb XML formatted file correcting errors. Some of these can be done using standard find/replace however
I have a set of issues where I guess Regex is better placed to help as I need to identify and replace /'s which of course is also on every close tag.......
problem is I'm very new to regex and need some pointers/advice please

Being XMl it contains the start/end tags which also contain the / character I need to remove however some of the fields will only accept A-Z0-9 as valid input

i.e. <Ref>1234/567890</Ref> the other block is that the numbers are random....

I need to first identify the tags which contain this error which Ive been able to do using a find for "<Ref>[0-9]{4}/" and I then manually correct the
data - fine if you find 3 or 4..but finding 100 + its a bit of a b****r.....

knowing that's its always inside the <Ref> tags what I need too be able to do is to Remove the in-between tag /'s leaving just the numbers
i.e <Ref>1234567890</Ref>

Can anyone help/provide any pointers of how or indeed if this can be done..?

Many thanks (Apologies if this is in the wrong forum but this Perl forum provided a good deal of Regex help....)

PaulSc
 
Hi

Are you sure there will be always only one and only slash ?

Anyway here I added 3 regular expressions, play with commenting them out to see which one you need :
Perl:
[b]use[/b] strict[teal];[/teal]
[b]use[/b] warnings[teal];[/teal]

[b]my[/b] [navy]$xml[/navy] [teal]=[/teal] [b]do[/b] [teal]{[/teal] [b]local[/b] [navy]$/[/navy][teal];[/teal] [i][green]<DATA>[/green][/i] [teal]};[/teal]

[gray]# assume there is at most 1 slash there[/gray]
[navy]$xml[/navy] [teal]=~[/teal] [b]s[/b][fuchsia]:(<Ref>\d*)/:$1:[/fuchsia][b]g[/b][teal];[/teal]

[gray]# remove any amount of slashes[/gray]
[navy]$xml[/navy] [teal]=~[/teal] [b]s[/b][fuchsia]:(<Ref>)(.+?)(</Ref>):$1.$2=~y!/!!dr.$3:[/fuchsia][b]ge[/b][teal];[/teal]

[gray]# remove any amount of non-digit characters[/gray]
[navy]$xml[/navy] [teal]=~[/teal] [b]s[/b][fuchsia]:(?<=<Ref>).+?(?=</Ref>):$&=~s!\D!!gr:[/fuchsia][b]ge[/b][teal];[/teal]

[b]print[/b] [navy]$xml[/navy][teal];[/teal]

__DATA__
<foo>
    <Ref>1234/567890</Ref>
    <bar>
        <Ref>1234/567890</Ref><Ref>1234/567890</Ref>
    </bar>
    <nah>1234/567890</nah>
    <Ref>1234/5/6/7890</Ref>
    <Ref>1234:5/6-7890</Ref>
</foo>


Feherke.
feherke.github.io
 
Feherke, Thank you for your detailed answer..much appreciated

Unfortunately Ive discovered were not going to be allowed to install/Run Perl.......

We do have NotePad++ (in its basic install form i.e no xmltools plugin etc) so can anyone suggest please how we can maybe use regex to remove any /'s (or -'s or *'s etc) that appear between <Ref></Ref> tags whilst still maintaining the standard xml tags and keeping the rest of the data?

i.e. <Ref>1234/567890</Ref>
<Ref>1234567890</Ref>
or
<Ref>000*000001234</Ref>
<Ref>000000001234</Ref>

We've seen that we can have 4 number/then 8+ numbers or 3numbers/numbers... so no fixed standard bar the (so far) single / between the <Ref> tags that shouldn't/we don't want to be there....

We know we can do an initial change for </ to <# then change all / to "" then change <# back to </ but that's not really "safe/feasible" on what's now a 1.7million row xml file and makes a massive assumption that there's not /'s elsewhere.........

Cheers and Thanks again
 
Hi

Never used NotePad++ earlier, but this way works for my test data :
Find what : [COLOR=#cc9 #ff9][box][black][tt](<Ref>\d*)[/*][/tt][/black][/box][/color]
Replace with : [COLOR=#cc9 #ff9][box][black][tt]$1[/tt][/black][/box][/color]

However I would say for files of that size better use a dedicated tool instead of a text editor. In thread215-1789240 our fellow member spamjim recommended grepWin as such tool. Better see whether you can get approval to install it ( or maybe already have it... ).


Feherke.
feherke.github.io
 
feherke,
Nice example about using regular expressions in Perl.
Today I learned something from you again. You deserve the star.
 
Just curious, why don't they allow Perl to be used? I work for a multi billion dollar company and we use a number of scripting languages including Perl. If they don't want it installed on the network, there are many free versions that can be installed on windows. Activestate is a good one. I say windows because the POSEX systems like Unix, Solaris, and Linux (among other) come with Perl in the base install.

Bill
Lead Application Developer
New York State, USA
 
Thanks for your help...
I work for a "bank" meaning that everythings controlled/locked down etc etc so no additional software whether its free or not....so have to make do with the tools available and notepad++ is the tool of choice..
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top