Regex match continue searching

learingperl01 · Mar 2, 2009

Hello everyone, hoping someone can point me in the right direction.

I have the following code which finds files that end in .txt then tries to match on a regx. The scripts works fine, the only problem is how to I tell the script to stop the regex or print the line until a space is found. Here an example which should clear things up a bit more.

Thanks for the help in advanced

Code:

CODE USED
sub edits() {
    if ( -f && \.txt$ ) {  #Find files ending in .txt
       open(LOG, "< $File::Find::name") or return(0);
       while ( my $LINE = <LOG> ) {
               if ( $LINE =~ m/http\:\/\/|[URL unfurl="true"]www\:\/\//i[/URL] ) {
               print $LINE;
           }
        }
    }
}

SAMPLE FILE
I AM SEARCHING TRYING PRINT URL'S FOUND WITHING THE TXT DOCUMENT WHICH START WITH HTTP OR WWW

This is a test this is a test test test [URL unfurl="true"]http://www.website.[/URL]
com/getme.html this is test this is a test
This is a test this is a test test test [URL unfurl="true"]http://www.website2.com/[/URL]
getme.html this is test this is a test
This is a test this is a test test test [URL unfurl="true"]http://www.website3.com/getme[/URL]
.html this is test this is a test
This is a test this is a test test test [URL unfurl="true"]http://www.website4.com/getme.h[/URL]
tml this is test this is a test



-------------------
|   OUTPUT SEEN   |
-------------------
This is a test this is a test test test [URL unfurl="true"]http://www.website.[/URL]
This is a test this is a test test test [URL unfurl="true"]http://www.website2.com/[/URL]
This is a test this is a test test test [URL unfurl="true"]http://www.website3.com/getme[/URL]
This is a test this is a test test test [URL unfurl="true"]http://www.website4.com/getme.h[/URL]


-------------------
|  WANTED OUTPUT  |
-------------------
[URL unfurl="true"]http://www.website.com/getme.html[/URL]
[URL unfurl="true"]http://www.website2.com/getme.html[/URL]
[URL unfurl="true"]http://www.website3.com/getme.html[/URL]
[URL unfurl="true"]http://www.website4.com/getme.html[/URL]

kodr · Mar 2, 2009

This grabs everything between the http: and the first space character.

Code:

#!/usr/bin/perl

$text = 'test test test [URL unfurl="true"]http://www.test.com[/URL] test test';
$text =~ m/(http:\S*\s)/;
print $1."\n\n";

Output:

Code:

[URL unfurl="true"]http://www.test.com[/URL]

learingperl01 · Mar 2, 2009

Thanks for the reply. I have updated your regex but the outcome/results are the same. The main issues are:

*That the script prints the entire line not just the url
*if the URL continues to the next line then it cuts the URL/Link short. I guess I can check if the TLD's exist in the current line before printing the line and if not continue reading to after the <CR> until a white space? I guessing that I am making this harder that it can be.

Thanks for the help everyone!

Code:

#!/usr/bin/perl

use File::Find;
find(\&url_find, "/tmp/sub_url_test/");

sub url_find() {
    if ( -f && \.txt$) {  #Find files ending in .txt
       open(LOG, "< $File::Find::name") or return(0);
       while ( my $LINE = <LOG> ) {
               if ( $LINE =~ m/(http:\S*\s)/ ) {
               print $LINE;
           }
        }
    }
}



CURRENT OUTPUT
This is a test this is a test test test hxxp://xxx.website.
This is a test this is a test test test hxxp://xxx.website2.com/
This is a test this is a test test test hxxp://xxx.website3.com/getme
This is a test this is a test test test hxxp://xxx.website4.com/getme.h

OUTPUT THAT I AM HOPING TO SEE
[URL unfurl="true"]http://www.website.com/getme.html[/URL]
[URL unfurl="true"]http://www.website1.com/getme.html[/URL]
[URL unfurl="true"]http://www.website2.com/getme.html[/URL]
[URL unfurl="true"]http://www.website3.com/getme.html[/URL]
[URL unfurl="true"]http://www.website4.com/getme.html[/URL]

kodr · Mar 2, 2009

Well then, you'll have to come up with a way of deciding whether or not to read in multiple lines and determining if your url spans a linefeed.

Something like:

Code:

if in LINE there is http and no space after then
  collect http to linefeed into LINE1
  read in next line
  collect from 1st character to first whitespace character (or linefeed) into LINE2

Concatenate LINE1 & LINE2

KevinADC · Mar 2, 2009

learingperl01,

You are printing $LINE instead of $1 like kodr showed you to do.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]

learingperl01 · Mar 2, 2009

Good catch Kevin, that was the problem.

Thanks!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Regex match continue searching

learingperl01

MIS

kodr

Programmer

learingperl01

MIS

kodr

Programmer

KevinADC

Technical User

learingperl01

MIS

Similar threads

Part and Inventory Search

Sponsor