Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need help parsing file correctly 3

Status
Not open for further replies.

ke5jli

Technical User
Jan 1, 2007
13
US
I have a large file that I need to parse into a tab delimited file for use in a database. While I can get it parsed to a certian extent, entries with multiple lines to the questions and answers are throwing me for a loop!! Can anyone please show me the correct method for doing this? Here is an excerpt from the file:
____________________________________________________________

G1B Antenna structure limitations; good engineering and
good amateur practice; beacon operation; restricted
operation; retransmitting radio signals

G1B01 (C) [97.15a]
Provided it is not at or near a public-use airport, what is
the maximum height above ground an antenna structure may
rise without requiring its owner to notify the FAA and
register with the FCC?
A. 50 feet
B. 100 feet
C. 200 feet
D. 300 feet

G1B02 (B) [97.101a]
If the FCC Rules DO NOT specifically cover a situation, how
must you operate your amateur station?
A. In accordance with standard licensee operator
principles
B. In accordance with good engineering and good amateur
practice
C. In accordance with station operating practices adopted
by the VECs
D. In accordance with procedures set forth by the
International Amateur Radio Union

____________________________________________________________
sections like the top one "G1B Antenna structure..." should be completely ignored when parsed.

G1B01 (C) [97.15a] should be 5 seperate elements. (G1,B,01,C,[97.15a])

Multiple lines of either a question or answer should be a single element with each answer choice being seperate of course.

I was able to parse another similar but much simpler file (ie. no multi lines) for a website. The results of which can be seen here

Any help would be sincerely and greatly appreciated!! :)
BTW... the parsing is done on my local machine then the output file is uploaded to the server.
 
Here are links to all 3 files that I'm trying to get parsed into the flatfile db's. I post this just so that anyone trying to help me can see the differences in the files and possibly understand why I am having difficulty getting one script to work for all 3. Being that my Perl skill level is somewhere between beginner and novice, I have difficulty with pattern matching and working arrays.

Thanks Again!

 
I know I should have posted this earlier but I didn't think of it. I wish I could edit my previous posts to add it but,...

Anyway, here is a link to the one file that I did get parsed just so that the layout of the output can be seen. Please understand that there are extra fields involved that are created, but not populated by, the text file parsing. This is intentional as I plan on adding info to those fields at a later date.


Using the first line of the data.txt, this is the layout. I am using pipe as a delimiter and field tags () for this just to clarify...

(Section) T1 | (Subsection) A | (Question number)01 | (Answer) A | (Part 97 reference) [97.3(a)(1)] | (Question) Who is an amateur operator as defined in Part 97? | (Question Image)19375.jpg | (Choice A) A person named in an amateur operator/primary license grant in the FCC ULS database | (Choice B) A person who has passed a written license examination | (Choice C) The person named on the FCC Form 605 Application | (Chice D) A person holding a Restricted Operating Permit | (Supporting Text) Anything that would explain the answer | (Supporting Image) 9475.jpg


I hope that I'm not giving anyone information overload!! I just hope to fully explain the depth of what I'm trying to accomplish.
 
Is this school/class work? If so, you are not allowed to post it on this forum.

- Kevin, perl coder unexceptional!
 
KevinADC,
No, this is not school work. I am the president of and website administrator for our local amateur radio club. my intent is to use the databases to assist prospective hams in studying for their tests.
 
Mmm. Nasty format. Have to think about this one a bit.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Thank You for anything you can provide. Here is the code that I wrote to parse the first file. Bear in mind that my Perl skill level being somewhere between beginner and novice, you may cringe at this ugly code!!!! It worked really well with the exception of a few errors due to anomalies in the question pool file. However I was able to correct those in the flat file db rather easily since there were only a few. The next two files with all of the multi line entries is my main problem.

Code:
#!usr/bin/perl

$delimiter="\t";

open (FILE, "2006tech.txt");

@lines = ();
while(<FILE>){
	push(@lines, $_);
}
close(FILE);   

foreach (@lines){
	$test2 = $_;	
	chop;
		
	if ($_ ne ""){
			
		if ($_ ne "~~"){
				
			$test = $_;
				
			&GetQuestionNumber;
			&GetQuestions;
			&GetChoices;
				
			sub GetQuestionNumber{	
				
			$test =~ s/\s*$//;
				
			if ($test =~  /\w\d\w\d\d\s/){	
			($number, $answer, $part) = split (/\s/, $test);
			($aa, $bb, $cc, $dd, $ee) = split (undef,$number);
			$answork = substr($answer, 1, -1);
				
			print "$aa$bb$delimiter$cc$delimiter$dd$ee$delimiter$answork$delimiter$part$delimiter";	
			};	
				
			}	
				
			sub GetQuestions{				
			if ($test =~  /\?$/){	
			$question = $test;
			print "$question$delimiter";
			};	
				
		}
		
	sub GetChoices{	
		
	if ($test =~ /^\w\.\s/){
		($trash, $choicea) = split(/\w\.\s/, $test);
		print "$delimiter$choicea";
	}


elsif ($test =~ /^\s/ ){
	$test =~ s/^\s+//;
	print " $test";
		
}

}

}

else {print "$delimiter$delimiter\n"};	 

}

}
 
can't say tested this too much but it worked in limited tests:

Code:
MAIN: while (<DATA>) {
   chomp;
   next if (/^\s*$/);
   if (/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
      print "$1|$2|$3|$4|$5";
      while (<DATA>) {
         chomp;
         if (/^(A|B|C|D)\./) {
            print "|$_";
         }
         elsif (/^\w+/i) {
            print " $_";
         }
         elsif (/\?$/) {
            print "$_|";
         } 
         elsif (/^\s*$/ or /^~~/) {
            print "\n";
            next MAIN; 
         }
      }
   }
}

I used a '|' instead of a tab so I could more easily tell if it was working. I'd stick with '|', tabs are not good delimiters for this type of data.

- Kevin, perl coder unexceptional!
 
this worked for the 2006tech file:

Code:
open(IN,"2006tech.txt");
open(OUT,">2006tech_out.txt"); 
MAIN: while (<IN>) {
   chomp;
   if (/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
      print OUT "$1|$2|$3|$4|$5|";
      while (<IN>) {
         chomp;
         if (/^(A|B|C|D)\./) {
            print OUT "|$_";
         }
         elsif (/^\w+/i) {
            print OUT " $_";
         }
         elsif (/\?$/) {
            print OUT "$_|";
         } 
         elsif (/^~~/) {
            print OUT "\n";
            next MAIN; 
         }
      }
   }
}


- Kevin, perl coder unexceptional!
 
this appears to work now for all three files

Code:
my @files = qw(2006tech.txt el3-release-12-1-03.txt 2002_Extra_Pool3.txt);
foreach my $f (@files) {
   open(IN, $f) or die "$!";
   print "*** Output for $f ***\n";
   MAIN: while (<IN>) {
      chomp;
      if (/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
         print "$1|$2|$3|$4|$5|";
         while (<IN>) {
            chomp;
            if (/^D\./) {
               print "|$_\n";
               next MAIN; 
            }
            elsif (/^(A|B|C)\./) {
               print "|$_";
            }
            elsif (/^\w+/i) {
               print $_;  
            } 
            elsif (/\?$/) {
               print "$_";
            } 
         }
      }
   }
   close(IN);
}

- Kevin, perl coder unexceptional!
 
this appears to work too, just a little less code:

Code:
my @files = qw(2006tech.txt el3-release-12-1-03.txt 2002_Extra_Pool3.txt);
foreach my $f (@files) {
   open(IN, $f) or die "$!";
   print "*** Output for $f ***\n";
   MAIN: while (<IN>) {
      chomp;
      if (/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
         print "$1|$2|$3|$4|$5|";
         while (<IN>) {
            chomp;
            if (/^D\./) {
               print "|$_\n";
               next MAIN; 
            }
            elsif (/^(A|B|C)\./) {
               print "|$_";
            }
            else {
               print;
            } 
         }
      }
   }
   close(IN);
}

- Kevin, perl coder unexceptional!
 
if you want to add the extra fields do that here:

Code:
            if (/^D\./) {
               print "|$_|extra field|extra field|\n";
               next MAIN;
            }

- Kevin, perl coder unexceptional!
 
OUTSTANDING!!! Thank you so very much Kevin! That does in fact work, and better than I hoped for. Your assistance is appreciated more than you know!!

If I may impose upon you for one more thing...
I am trying to learn perl for my personal use, could you please explain how this little piece
Code:
if (/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
         print "$1|$2|$3|$4|$5|";
is working? I'm not quite sure I understand the pattern matching here.


Craig
 
Kevin,
I have discovered one slight error... I now believe that I may understand how
Code:
(/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])/) {
is doing it's pattern matching however, if the question does not have a pert 97 reference ([ABCD]) is is skipped over and not written to the file. How would I insert a blank field at that point and still output the rest of the question?
 
You'd have to make the last capture optional
Code:
(/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s+(\[.+?\])[red]?[/red]/)
but you'd need to check to see if $5 gets set to undef in this instance. I have a nagging feeling that it may remember the value of the previously captured part 97 reference.

I think Kev deserves a star for this one.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Thanks again Kevin. That does work however it also brings up one more little issue...

After making the last capture optional it will print out a correct string to the file but, only if there is what appears to be a hidden character directly following the answer. example, in the 2006tech.txt questions T3B01 & T3B02, T3B02 has a hidden character following (C) and does get written but T3B01 does not have one following (A) and therefore does not get written.

 
OK. Try
Code:
(/(\S\S)(\S)(\S\S)\s+\(([ABCD])\)\s[red]*[/red](\[.+?\])?/)
instead.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
>>Thanks again Kevin. That does work however it also brings up one more little issue...

That should be thanks to Steve. [smile]

Try Steve's last suggestion and see if it does what you need.

- Kevin, perl coder unexceptional!
 
stevexff said:
but you'd need to check to see if $5 gets set to undef in this instance. I have a nagging feeling that it may remember the value of the previously captured part 97 reference.
Steve is right:
perlre said:
The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See "Compound Statements" in perlsyn.)

NOTE: failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
 
here is another unfortunate candidate:

T3D05(C) [97.101(d)]

no space between T3D05 and (C)

these errant patterns could be found and adjusted for with some additional regexps. Something like:

Code:
      if (/(\S\S)(\S)(\S\S)\s*\(([ABCD])\)\s+(\[.+?\])/) {
         print "$1|$2|$3|$4|$5|";
      }
      elsif (/(\S\S)(\S)(\S\S)\s*\(([ABCD])\)\s*/) {
         print "$1|$2|$3|$4||";
      }
      etc....

- Kevin, perl coder unexceptional!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top