Second look at my regex please

jez · Jan 3, 2012

Hi All,

I hope someone might be able to help me by taking a look at my regex and telling me where i am going wrong

I have a load of text files in the format

Title:
blah blah blah over one or many lines

Article Body:
blah blah blah over many lines

I am trying to capture the content of the title and the article body without the "Title:" or "Article Body:"
Additionally, there is sometimes other thing in between where the title and body text appear in the file such as

Wordcount:
this is a number

Keywords:
blah blah blah

I want to ignore these other items in the file.
As a rule the content blocks to capture have their field name followed by a semi colon and then i want to capture everything until the next field name followed by a semi colon, but without any of the field names.

Here is what i am doing so far.
I have read the file into a single var ($result) since the files are not very big but there is lots of them,

Code:

my $pattern = qr/^Title:\n+^(.+\n)/mx;
if($result=~/$pattern/){
    $title = $1;
}
my $pattern2 = qr/^Article Body:\n+^(.+\n)/mx;
if($result=~/$pattern2/){
    $body = $1;
}

If i could, it would be better in one regex but I am not sure how to get that working.

I would really appreciate any suggestions as my regex knowledge is quite rusty (and probably wasn't that great to start with

Thanks,

Jez

prex1 · Jan 3, 2012

The 'x' modifier has no use in your regex (afaik it is used to include comments).
Your regex fails because the end of your title or body fields is given either by another field name or the file end.
It can be done, of course, but a regex is for sure the most inefficient way to do this, especially if the file may be rather long.
I would do this by reading the file line by line, more or less like so (untested and using if's for simplicity, could use switch or other constructs):

Code:

my($title,$intitle,$content,$incontent);
while(<FILE>){
  if($incontent){
    $content.=$_;
  }elsif($intitle){
    $title.=$_;
  }elsif(/^Title:$/){
    $intitle=1;
    $incontent=0;
  }elsif(/^Article Body:$/){
    $intitle=0;
    $incontent=1;
  }else{
    $intitle=$incontent=0;
  }
}

A note: from your regex it seems that you expect multiple eol's after the field name: this should be clarified.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

jez · Jan 4, 2012

Hi Thanks for the suggestion, it is a good approach.
I am having some difficulty getting it to work though.

The content is indeed multiple lines.

I think i can get it working though, so Thank again.

Jez

feherke · Jan 4, 2012

Hi

Personally I prefer generic solutions :

Code:

[b]while[/b] [teal]([/teal][navy]$line[/navy][teal]=[/teal][green][i]<FILE>[/i][/green][teal])[/teal] [teal]{[/teal]
  [b]if[/b] [teal]([/teal][navy]$line[/navy][teal]=~[/teal][b]m[/b][fuchsia]/^([\w\s]+):$/[/fuchsia][teal])[/teal] [teal]{[/teal]
    [navy]$section[/navy][teal]=[/teal][navy]$1[/navy][teal];[/teal]
    [b]next[/b][teal];[/teal]
  [teal]}[/teal]
  [navy]$piece[/navy][teal]{[/teal][navy]$section[/navy][teal]}[/teal][teal].=([/teal][navy]$piece[/navy][teal]{[/teal][navy]$section[/navy][teal]}[/teal][teal]?[/teal][green][i]"\n"[/i][/green][teal]:[/teal][green][i]''[/i][/green][teal]).[/teal][navy]$line[/navy][teal];[/teal]
[teal]}[/teal]

Then you will have the desired data in $piece{'Title'} and $piece{'Article Body'}.

Note that the regular expression may need some adjustments if you have some tricky content lines. For the worst scenario you can still enumerate all allowed section names as /^(Title|Article Body|Wordcount|Keywords):$/ .

Feherke.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Second look at my regex please

jez

Programmer

prex1

Programmer

jez

Programmer

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor