Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need help parsing file correctly 3

Status
Not open for further replies.

ke5jli

Technical User
Jan 1, 2007
13
0
0
US
I have a large file that I need to parse into a tab delimited file for use in a database. While I can get it parsed to a certian extent, entries with multiple lines to the questions and answers are throwing me for a loop!! Can anyone please show me the correct method for doing this? Here is an excerpt from the file:
____________________________________________________________

G1B Antenna structure limitations; good engineering and
good amateur practice; beacon operation; restricted
operation; retransmitting radio signals

G1B01 (C) [97.15a]
Provided it is not at or near a public-use airport, what is
the maximum height above ground an antenna structure may
rise without requiring its owner to notify the FAA and
register with the FCC?
A. 50 feet
B. 100 feet
C. 200 feet
D. 300 feet

G1B02 (B) [97.101a]
If the FCC Rules DO NOT specifically cover a situation, how
must you operate your amateur station?
A. In accordance with standard licensee operator
principles
B. In accordance with good engineering and good amateur
practice
C. In accordance with station operating practices adopted
by the VECs
D. In accordance with procedures set forth by the
International Amateur Radio Union

____________________________________________________________
sections like the top one "G1B Antenna structure..." should be completely ignored when parsed.

G1B01 (C) [97.15a] should be 5 seperate elements. (G1,B,01,C,[97.15a])

Multiple lines of either a question or answer should be a single element with each answer choice being seperate of course.

I was able to parse another similar but much simpler file (ie. no multi lines) for a website. The results of which can be seen here

Any help would be sincerely and greatly appreciated!! :)
BTW... the parsing is done on my local machine then the output file is uploaded to the server.
 
see if this works

Code:
my @files = qw(2006tech.txt el3-release-12-1-03.txt 2002_Extra_Pool3.txt);

foreach my $f (@files) {
   open(IN, $f) or die "$!";
   print "\n\n\n*** Output for $f ***\n\n\n";
   MAIN: while (<IN>) {
      my $flag = 0;
      chomp;
      if (/^(\w\w)(\w)(\w\w)\s*\(([ABCD])\)\s*(\[.+?\])/) {
         print "$1|$2|$3|$4|$5|";
         $flag = 1;
      }
      elsif (/^(\w\w)(\w)(\w\w)\s*\(([ABCD])\)\s*/) {
         print "$1|$2|$3|$4||";
         $flag = 1; 
      }
      elsif (/^(\w\w)(\w)(\w\w)\s*(\[.+?\])/) {#<-- might not need this one
         print "$1|$2|$3||$4|";
         $flag = 1;
      }
      else {
         next;
      }
      if ($flag) {
         while (<IN>) {
            chomp;
            if (/^D\./) {
               print "|$_\n";
               next MAIN; 
            }
            elsif (/^(A|B|C)\./) {
               print "|$_";
            }
            else {
               print;
            } 
         }
      }
   }
   close(IN);
}

- Kevin, perl coder unexceptional!
 
I am trying to learn perl for my personal use, could you please explain how this little piece
Code:
if (/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
         print "$1|$2|$3|$4|$5|";
is working? I'm not quite sure I understand the pattern matching here.

sorry I missed your question earlier, here is an explnation of that regexp:

Code:
(?-imsx:/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/)

matches as follows:
  
NODE                     EXPLANATION
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
----------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
----------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [ABCD]                   any character of: 'A', 'B', 'C', 'D'
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    \[                       '['
----------------------------------------------------------------------
    .+?                      any character except \n (1 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
    \]                       ']'
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------

the above was generated using YAPE::Regex::Explain

- Kevin, perl coder unexceptional!
 
I want to thank everyone who helped me with this!!!!
Kevin, that last one you posted worked beautifully. I took all day yesterday analyzing the output of all 3 files just to be sure before I gave a report.

I have been to other forums in the past trying to find help with similar problems and most always been met with rude comments and no real assistance. NOT HERE!! You gentlemen have shown yourselves to be true professionals.

THANK YOU!!! :)

BusinessCard.jpg
 
You're welcome [wavey]

- Kevin, perl coder unexceptional!
 
Been too busy to follow this thread, but just wanted to throw my kudos to Kevin as well for this one. Bravo.
 
Ok... Sorry to bother you guys with this one again but a site visitor pointed this one out to me. Aparrently multi line answers with a "D." prefix are only getting the first line parsed.
Code:
if (/^D\./) {
   print "|$_\n";
   next MAIN; 
}
I assume that it makes the first loop then gets sent back for the next question here and ignores the rest of the answer choice. Any suggestions for this?
 
I misssed that. Try this, replace these lines:

Code:
            if (/^D\./) {
               print "|$_\n";
               next MAIN;
            }


with:

Code:
            if (/^D\./) {
               print "|$_";
               while (<IN>) {
                  chomp;  
                  if (/\w/) {
                     print;
                  }  
                  else {
                     print "\n";
                     next MAIN;
                  }
               }
            }


check the output thoroughly.

- Kevin, perl coder unexceptional!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top