Parsing Text File

weibs · Dec 17, 2008

I have several text files that I need certain content parsed out of each.

What I need is to grab the data for each Note that is after the Body tag. Each file is going to have a different amount of data as well. Oh, and I also need to have the client name within this new data set. Each files notes are also in different orders.

So for example, for all 100+ files, I need to parse out the "Web Hosting Information" and put it into a spreadsheet format.

Here is a partial example of one of the textfiles ( I have stared out sensitive information)

Code:

---  
- Name: Client Name 
- Contact:  
  - - Addresses 
    - - ************ 
  - - Phone_numbers 
    - - ************ 
      - ************ 
  - - Email_addresses 
    - - ************ 
- -  
    Note 3060148:  
    - Author: ************ 
    - Written: ************ 
    - About: ************ 
    - Body: |- 
        <h1>Lead</h1> 
         
        ************ 
  - Note 2909448:  
    - Author: ************ 
    - Written: ************ 
    - About: ************ 
    - Body: |- 
        <h1>Pre-Paid Hours</h1> 
         
        N/A 
  - Note 2909446:  
    - Author: ************ 
    - Written: ************ 
    - About: ************ 
    - Body: |+ 
        <h1>Domain Names</h1> 
         
        ************ 
        ************ 
        ************ 
         
        ************ 
        Admin: ************ 
        Pass: ************ 
         
        Whois INFO: 
         
        Administrative Contact : 
        ************ 
        ************ 
         
        Technical Contact : 
        ************ 
        ************ 
         
         
        Record expires on ************ 
  - Note 2909443:  
    - Author: ************ 
    - Written: ************ 
    - About: ************ 
    - Body: |- 
        <h1>Web Hosting</h1> 
        ************ 
         
        ************ 
         
        FTP Access 
        ************ 
        ************ 
        ************ 
        ************ 
         
        ************ 
        ************ 
        ************ 
        ************

My end goal is to be able to have ... for example all of the Web Hosting information from each file within one delimited text file.

This is what I have so far which puts the data into a hash.

Code:

#!/usr/bin/perl 
 
use strict; 
use warnings; 
use Data::Dumper; 
 
my (%data, $client, $note); 
my $datafile = 'filename'; 
 
open my $DATAFILE, '<', $datafile or die "can't open '$datafile' $!"; 
while( my $line = <$DATAFILE>) { 
 
    if ( $line =~ /^\s+- Body: (.*\n)/ ) { 
        $data{$client}{$note}{body} .= $1; 
        while ( $line = <$DATAFILE> ) { 
            if ( $line =~ /\s+Note (\d+):/ ) { 
                $note = $1; 
                last; 
            } 
            $data{$client}{$note}{body} .= $line; 
        } 
        next; 
    } 
 
    $client = $1 if $line =~ /^- Name: (.+)/; 
    $note = $1 if $line =~ /\s+Note (\d+):/; 
 
    if ( $note and $line =~ /^\s+- (\w+):\s*(.+)/ ) { 
        $data{$client}{$note}{$1} = $2; 
    } 
 
} 
print Dumper \%data;

This will give me a pretty display of all the data.

What I still can't figure out is how to make it only display the note for each one separately (example, Just the "Web Hosting Information" note.

Any help would be appreciated and I hope the above made sense.

Thank you in advance.

weibs · Dec 17, 2008

I also want to credit the above code to FishMonger from the Perl Guru site.
Thankyou FishMonger in helping.

weibs · Dec 17, 2008

I also want to credit the above code to FishMonger from the Perl Guru site.
Thankyou FishMonger in helping.

I also want to thank KevinR also from the Perl Guru site for being patient with me

rharsh · Dec 17, 2008

Code:

        <h1>Web Hosting</h1>
        ************
         
        ************
         
        FTP Access
        ************
        ************
        ************
        ************
         
        ************
        ************
        ************
        ************

Is this all considered the 'Web Hosting' note? Or are you looking for everything from the '- Note' line down?

Also, you mentioned you wanted them in a 'spreadsheet format' - what do you mean by that? Each line within the note becomes a new column in a line of your spreadsheet?

weibs · Dec 17, 2008

Yes, that is considered the Web Hosting Note.
I don't need
- Note 2909443:
- Author: ************
- Written: ************
- About: ************
- Body: |-
But I do need the Client Name with each Note
If it helps... The About line also has the Client Name.

What I mean by spreadsheet format is that it needs to at least be output in a delimited fashion so that it can be imported into a csv file. But yes, if each line within the note becomes a new column then that is perfect.

KevinADC · Dec 17, 2008

Are there multiple Client Names in each file?

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

weibs · Dec 17, 2008

No, each file just has one client name.
Each file is for a different client.

KevinADC · Dec 17, 2008

This seems to create a nice data set but I think your problem is figuring out how to parse the data set now:

Note: replace <DATA> with your own filehandle

Code:

#!/usr/bin/perl 
 
use strict; 
use warnings; 
use Data::Dumper; 
 
my $body;
my %data = (); 
#open (IN, 'yourfile') or die "$!"; 
while( my $line = <DATA>) {
   $line = normalize($line);
   if ($line =~/^- Name: ([\s\S]+)/) {
      my $client = $1; 
      MIDDLE: while ($line = <DATA>) {
         $line = normalize($line);
         if ($line =~ /^(?:\- )?Note (\d+):/) {
            my $note = $1; 
            <DATA>,<DATA>,<DATA>;#skips: Author, Written, About 
            while ($line = <DATA>) {
               $line = normalize($line);
               if ($line =~/^\- Body:\s*|[+-]/){
                  $line = <DATA>;
                  $line =~ tr/+-//d;#removes quantifier in text
                  ($body) = $line =~ /<\w+>(.*?)<\/\w+>/;
                  while ($line = <DATA>) {
                     $line = normalize($line);
                     redo MIDDLE if ($line =~ /^(?:\- )?Note \d+:/);
                     next if ($line =~ /^\s*$/); 
                     push @{$data{$client}{$note}{$body}},$line; 
                  }
               }
            } 
         } 
      }
   }
}
sub normalize {
   my $t = $_[0];
   chomp $t;
   $t =~ s/^\s*//;
   $t =~ s/\s*$//;  
   return $t;
}

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

KevinADC · Dec 17, 2008

Or instead of a data set you could write the data to another file:

Code:

use strict; 
use warnings; 
use Data::Dumper; 
 
my $body;
my %data = (); 
#open (IN, 'yourfile') or die "$!";
while( my $line = <DATA>) {
   $line = normalize($line);
   if ($line =~/^- Name: ([\s\S]+)/) {
      my $client = $1; 
      MIDDLE: while ($line = <DATA>) {
         $line = normalize($line);
         if ($line =~ /^(?:\- )?Note (\d+):/) {
            my $note = $1; 
            <DATA>,<DATA>,<DATA>;#skips: Author, Written, About 
            while ($line = <DATA>) {
               $line = normalize($line);
               if ($line =~/^\- Body:\s*|[+-]/){
                  $line = <DATA>;
                  $line =~ tr/+-//d;#removes quantifier in text
                  ($body) = $line =~ /<\w+>(.*?)<\/\w+>/;
                  print "$body\n";
                  while ($line = <DATA>) {
                     $line = normalize($line);
                     redo MIDDLE if ($line =~ /^(?:\- )?Note \d+:/);
                     next if ($line =~ /^\s*$/); 
                     print "\t$line\n"; 
                  }
               }
            } 
         } 
      }
   }
}
sub normalize {
   my $t = $_[0];
   chomp $t;
   $t =~ s/^\s*//;
   $t =~ s/\s*$//;  
   return $t;
}

Where you see "print" above you would use:

print FILEHANDLE "stuff to print";

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

weibs · Dec 17, 2008

Thanks Kevin,

I now have the new code

Code:

use strict;
use warnings;
use Data::Dumper;
 
my $body;
my %data = ();
open (DATA, 'contacts/clientname.txt') or die "$!";
while( my $line = <DATA>) {
   $line = normalize($line);
   if ($line =~/^- Name: ([\s\S]+)/) {
      my $client = $1;
      MIDDLE: while ($line = <DATA>) {
         $line = normalize($line);
         if ($line =~ /^(?:\- )?Note (\d+):/) {
            my $note = $1;
            <DATA>,<DATA>,<DATA>;#skips: Author, Written, About
            while ($line = <DATA>) {
               $line = normalize($line);
               if ($line =~/^\- Body:\s*|[+-]/){
                  $line = <DATA>;
                  $line =~ tr/+-//d;#removes quantifier in text
                  ($body) = $line =~ /<\w+>(.*?)<\/\w+>/;
                  while ($line = <DATA>) {
                     $line = normalize($line);
                     redo MIDDLE if ($line =~ /^(?:\- )?Note \d+:/);
                     next if ($line =~ /^\s*$/);
                     push @{$data{$client}{$note}{$body}},$line;
                  }
               }
            }
         }
      }
   }
}
sub normalize {
   my $t = $_[0];
   chomp $t;
   $t =~ s/^\s*//;
   $t =~ s/\s*$//;  
   return $t;
}
 print %data;

this will print the client name and a hash
ClientNameHASH(0x182f190)

I am at a loss of how to display on my screen the output or where to go next.

weibs · Dec 17, 2008

We must have posted at the same time.

I added the print statement as you suggested and yes, I see all the data now.

But again, I'm still at a loss of how to separate that data if I just want to see say the "Web Hosting" information.

If each file has the Web Hosting information in the same spot then it wouldn't be to hard to do, but since there is no order to these notes nor how many lines each one has I have to "find" the Web Hosting information and just show what is in the note for that.

Thanks again.

KevinADC · Dec 17, 2008

Here is how you can do that:

Code:

use strict; 
use warnings; 
#use Data::Dumper; 
 
my $body;
my %data = ();
my $search = 'web hosting';
#open (IN, 'yourfile') or die "$!"; 
while( my $line = <DATA>) {
   $line = normalize($line);
   if ($line =~/^- Name: ([\s\S]+)/) {
      my $client = $1; 
      MIDDLE: while ($line = <DATA>) {
         $line = normalize($line);
         if ($line =~ /^(?:\- )?Note (\d+):/) {
            my $note = $1; 
            <DATA>,<DATA>,<DATA>;#skips: Author, Written, About 
            while ($line = <DATA>) {
               $line = normalize($line);
               if ($line =~/^\- Body:\s*|[+-]/){
                  $line = <DATA>;
                  $line =~ tr/+-//d;#removes quantifier in text
                  ($body) = $line =~ /<\w+>(.*?)<\/\w+>/;
                  if (lc($search) eq lc($body)) {  
                     print "$body\n";
                     while ($line = <DATA>) {
                        $line = normalize($line);
                        redo MIDDLE if ($line =~ /^(?:\- )?Note \d+:/);
                        next if ($line =~ /^\s*$/); 
                        print "\t$line\n";
                     }
                  }
               }
            } 
         } 
      }
   }
}
sub normalize {
   my $t = $_[0];
   chomp $t;
   $t =~ s/^\s*//;
   $t =~ s/\s*$//;  
   return $t;
}

$search can be user input instead of hard coded like in the above code. I assume you know how to do that because on the other forum you said you are not new to perl.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

weibs · Dec 17, 2008

Hi Kevin,

Thank you ohhhhh soooo much. This has now separated the data like I wanted it to!! Now I think I can manipulate that data from here *crosses fingers*.

One day I will study what you did and how you achieved it.

Thank you and Happy Holidays!!

KevinADC · Dec 17, 2008

You too, Merry Christmas

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

KevinADC · Dec 17, 2008

this line in my code:

$line =~ tr/+-//d;#removes quantifier in text

might be bettwer written like so:

$line =~ tr/+-/ /;#replaces + or - in text with a space

then when you search for "Pre-Paid Hours" you would enter "Pre Paid Hours" instead of "PrePaid Hours". If you never search for sections with a + or - symbol it won't matter.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Parsing Text File

weibs

Programmer

weibs

Programmer

weibs

Programmer

rharsh

Technical User

weibs

Programmer

KevinADC

Technical User

weibs

Programmer

KevinADC

Technical User

KevinADC

Technical User

weibs

Programmer

weibs

Programmer

KevinADC

Technical User

weibs

Programmer

KevinADC

Technical User

KevinADC

Technical User

Similar threads

Part and Inventory Search

Sponsor