Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Calculating where a string start,where ends and offset based on regex? 1

Status
Not open for further replies.

dmazzini

Programmer
Jan 20, 2004
480
US
Hi guys

I have many kinds of different reports where the data is located at the same position as the headers.
For example, as you can see here WBTS-1695 is just under header NE-ID, L6070407474 is just under "Target Id" column and so on.

Code:
               NE-ID      Target Id In Topology                             Feature Name            
          ------------------------------------------------------------------------------
           WBTS-1695    L6070407474         Yes                                IMA (FTM)                                             
           WBTS-1695    L6070407474          No                 Antenna Line Supervision                                 
           WBTS-1695    L6070407474         Yes                     BTS channel capacity

I want to make a generic sub routine to get data based on "column positions", using regex on the headers as reference and substr function.
Following code works fine for example with "HEADERFIVE", but not with "HEADER ONE"

Code:
#!/usr/bin/perl

my $headers=  qq(HEADER ONE           HEADERTWO   HEADER3  HEADER FOUR HEADERFIVE);
my $data=     qq(     data1     belong to data2     data3    my data 4   datafive);


if ($headers=~ /(HEADER ONE)/){

    print "Headers:$headers\n";
    my $length_headers=length($headers);
    print "Length_Headers =>$length_headers\n";    

    print "Matched:$1\n";
    my $length_matched=length($1);
    print "Length_Matched =>$length_matched\n"; 
      
    print "Before_Matched =>$`\n";
    my $length_before_matched=length($`);
    print "Length_Before_Matched =>$length_before_matched\n";     
    
    print "After_Matched =>$'\n";
    my $length_after_matched=length($');
    print "Length_After_Matched =>$length_after_matched\n";    
    
    $VALUETOGET =  substr( $data,$length_before_matched,$length_matched);   
    print "VALUE:$VALUETOGET\n"; # this gives me: " to data2" instead of "belong to data2"
    
}


I believe that still I have to calculate the spaces between the headers. For example between headers "NE-ID" and "Target Id".
Suggestions are welcome, and as I said before there are many reports with different info, but always data and headers are aligned
from left to right.







dmazzini
GSM/UMTS System and Telecomm Consultant

 
look into the pos() function.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
You could probably use index() as well.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Do you know in advance what the headers actually are? If not, it'll be trickier: e.g. how do you determine that the space between 'Id' and 'In' is a boundary between headers and the one between 'Target' and 'Id' is not?

I'm assuming you don't know the lengths of the fields beforehand so you have to work them out.

Here's something that might work for you if you have the headers beforehand:
Code:
#!/usr/bin/perl -w
use strict;

my @field_lengths;
my @headers = ( 'NE-ID', 'Target Id', 'In Topology', 'Feature Name' );

my $header_line = <DATA>;

for ( @headers ) {
   if ( $header_line =~ /(\s+\Q$_\E)/ ) {
      push @field_lengths, $+[0] - $-[0];
   }
   else {
      die "Header $_ not found\n";
   }
}

<DATA>; # skip the ----- line
my $pack_string = join '', map "A$_", @field_lengths;

while(<DATA>) {
   my @fields = unpack $pack_string, $_;
   print join '|', @fields, "\n";
}

__DATA__
    NE-ID      Target Id In Topology                             Feature Name
------------------------------------------------------------------------------
WBTS-1695    L6070407474         Yes                                IMA (FTM)
WBTS-1695    L6070407474          No                 Antenna Line Supervision
WBTS-1695    L6070407474         Yes                     BTS channel capacity
 
Hi Ishnid

I know in advance the headers name, but I did not want to "count" header position and offset for every single parameter. I did in the pass using substr and I was not happy with this solution..

I have tested your script and it seemd to work well! Thanks for that. Now I have to analize what you did .hhehe!

What does it do?

$+[0] - $-[0];
my $pack_string = join '', map "A$_", @field_lengths;

Cheers and many thaks again





dmazzini
GSM/UMTS System and Telecomm Consultant

 
For each capturing group in a regular expression, there'll be an entry in the special @- and @+ arrays. $-[0] and $+[0] contain data for the first capturing group. $+[0] is the offset of the end of the match, and $-[0] is the offset of the start of it. By subtracting one from the other, I get the length of the field in question.

The $pack_string is the first argument to pass to the 'unpack' function. (If you don't know how that works, have a look at perlpacktut). Basically, @field_lengths holds the length of each field (in this case: 9, 15, 12, 41). I convert that into the string 'A9A15A12A41' so as to pass it to 'unpack'.
 
I 've got it! Thanks so much!



dmazzini
GSM/UMTS System and Telecomm Consultant

 
Good solution, don't see the @- and @+ varaibles used much. I would think the same could be accomplished using index().

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top