Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Second look at my regex please

Status
Not open for further replies.

jez

Programmer
Apr 24, 2001
370
0
0
VN
Hi All,

I hope someone might be able to help me by taking a look at my regex and telling me where i am going wrong :)

I have a load of text files in the format

Title:
blah blah blah over one or many lines

Article Body:
blah blah blah over many lines

I am trying to capture the content of the title and the article body without the "Title:" or "Article Body:"
Additionally, there is sometimes other thing in between where the title and body text appear in the file such as

Wordcount:
this is a number

Keywords:
blah blah blah

I want to ignore these other items in the file.
As a rule the content blocks to capture have their field name followed by a semi colon and then i want to capture everything until the next field name followed by a semi colon, but without any of the field names.


Here is what i am doing so far.
I have read the file into a single var ($result) since the files are not very big but there is lots of them,

Code:
my $pattern = qr/^Title:\n+^(.+\n)/mx;
if($result=~/$pattern/){
    $title = $1;
}
my $pattern2 = qr/^Article Body:\n+^(.+\n)/mx;
if($result=~/$pattern2/){
    $body = $1;
}


If i could, it would be better in one regex but I am not sure how to get that working.


I would really appreciate any suggestions as my regex knowledge is quite rusty (and probably wasn't that great to start with :)

Thanks,

Jez
 
The 'x' modifier has no use in your regex (afaik it is used to include comments).
Your regex fails because the end of your title or body fields is given either by another field name or the file end.
It can be done, of course, but a regex is for sure the most inefficient way to do this, especially if the file may be rather long.
I would do this by reading the file line by line, more or less like so (untested and using if's for simplicity, could use switch or other constructs):
Code:
my($title,$intitle,$content,$incontent);
while(<FILE>){
  if($incontent){
    $content.=$_;
  }elsif($intitle){
    $title.=$_;
  }elsif(/^Title:$/){
    $intitle=1;
    $incontent=0;
  }elsif(/^Article Body:$/){
    $intitle=0;
    $incontent=1;
  }else{
    $intitle=$incontent=0;
  }
}
A note: from your regex it seems that you expect multiple eol's after the field name: this should be clarified.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Hi Thanks for the suggestion, it is a good approach.
I am having some difficulty getting it to work though.

The content is indeed multiple lines.

I think i can get it working though, so Thank again.

Jez
 
Hi

Personally I prefer generic solutions :
Code:
[b]while[/b] [teal]([/teal][navy]$line[/navy][teal]=[/teal][green][i]<FILE>[/i][/green][teal])[/teal] [teal]{[/teal]
  [b]if[/b] [teal]([/teal][navy]$line[/navy][teal]=~[/teal][b]m[/b][fuchsia]/^([\w\s]+):$/[/fuchsia][teal])[/teal] [teal]{[/teal]
    [navy]$section[/navy][teal]=[/teal][navy]$1[/navy][teal];[/teal]
    [b]next[/b][teal];[/teal]
  [teal]}[/teal]
  [navy]$piece[/navy][teal]{[/teal][navy]$section[/navy][teal]}[/teal][teal].=([/teal][navy]$piece[/navy][teal]{[/teal][navy]$section[/navy][teal]}[/teal][teal]?[/teal][green][i]"\n"[/i][/green][teal]:[/teal][green][i]''[/i][/green][teal]).[/teal][navy]$line[/navy][teal];[/teal]
[teal]}[/teal]
Then you will have the desired data in $piece{'Title'} and $piece{'Article Body'}.

Note that the regular expression may need some adjustments if you have some tricky content lines. For the worst scenario you can still enumerate all allowed section names as /^(Title|Article Body|Wordcount|Keywords):$/ .

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top