Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex match continue searching

Status
Not open for further replies.
Jun 3, 2007
84
US
Hello everyone, hoping someone can point me in the right direction.

I have the following code which finds files that end in .txt then tries to match on a regx. The scripts works fine, the only problem is how to I tell the script to stop the regex or print the line until a space is found. Here an example which should clear things up a bit more.

Thanks for the help in advanced
:)

Code:
CODE USED
sub edits() {
    if ( -f && \.txt$ ) {  #Find files ending in .txt
       open(LOG, "< $File::Find::name") or return(0);
       while ( my $LINE = <LOG> ) {
               if ( $LINE =~ m/http\:\/\/|[URL unfurl="true"]www\:\/\//i[/URL] ) {
               print $LINE;
           }
        }
    }
}

SAMPLE FILE
I AM SEARCHING TRYING PRINT URL'S FOUND WITHING THE TXT DOCUMENT WHICH START WITH HTTP OR WWW

This is a test this is a test test test [URL unfurl="true"]http://www.website.[/URL]
com/getme.html this is test this is a test
This is a test this is a test test test [URL unfurl="true"]http://www.website2.com/[/URL]
getme.html this is test this is a test
This is a test this is a test test test [URL unfurl="true"]http://www.website3.com/getme[/URL]
.html this is test this is a test
This is a test this is a test test test [URL unfurl="true"]http://www.website4.com/getme.h[/URL]
tml this is test this is a test



-------------------
|   OUTPUT SEEN   |
-------------------
This is a test this is a test test test [URL unfurl="true"]http://www.website.[/URL]
This is a test this is a test test test [URL unfurl="true"]http://www.website2.com/[/URL]
This is a test this is a test test test [URL unfurl="true"]http://www.website3.com/getme[/URL]
This is a test this is a test test test [URL unfurl="true"]http://www.website4.com/getme.h[/URL]


-------------------
|  WANTED OUTPUT  |
-------------------
[URL unfurl="true"]http://www.website.com/getme.html[/URL]
[URL unfurl="true"]http://www.website2.com/getme.html[/URL]
[URL unfurl="true"]http://www.website3.com/getme.html[/URL]
[URL unfurl="true"]http://www.website4.com/getme.html[/URL]
 
This grabs everything between the http: and the first space character.

Code:
#!/usr/bin/perl

$text = 'test test test [URL unfurl="true"]http://www.test.com[/URL] test test';
$text =~ m/(http:\S*\s)/;
print $1."\n\n";

Output:
Code:
[URL unfurl="true"]http://www.test.com[/URL]
 
Thanks for the reply. I have updated your regex but the outcome/results are the same. The main issues are:

*That the script prints the entire line not just the url
*if the URL continues to the next line then it cuts the URL/Link short. I guess I can check if the TLD's exist in the current line before printing the line and if not continue reading to after the <CR> until a white space? I guessing that I am making this harder that it can be.

Thanks for the help everyone!

Code:
#!/usr/bin/perl

use File::Find;
find(\&url_find, "/tmp/sub_url_test/");

sub url_find() {
    if ( -f && \.txt$) {  #Find files ending in .txt
       open(LOG, "< $File::Find::name") or return(0);
       while ( my $LINE = <LOG> ) {
               if ( $LINE =~ m/(http:\S*\s)/ ) {
               print $LINE;
           }
        }
    }
}



CURRENT OUTPUT
This is a test this is a test test test hxxp://xxx.website.
This is a test this is a test test test hxxp://xxx.website2.com/
This is a test this is a test test test hxxp://xxx.website3.com/getme
This is a test this is a test test test hxxp://xxx.website4.com/getme.h

OUTPUT THAT I AM HOPING TO SEE
[URL unfurl="true"]http://www.website.com/getme.html[/URL]
[URL unfurl="true"]http://www.website1.com/getme.html[/URL]
[URL unfurl="true"]http://www.website2.com/getme.html[/URL]
[URL unfurl="true"]http://www.website3.com/getme.html[/URL]
[URL unfurl="true"]http://www.website4.com/getme.html[/URL]
 
Well then, you'll have to come up with a way of deciding whether or not to read in multiple lines and determining if your url spans a linefeed.

Something like:

Code:
if in LINE there is http and no space after then
  collect http to linefeed into LINE1
  read in next line
  collect from 1st character to first whitespace character (or linefeed) into LINE2

Concatenate LINE1 & LINE2
 
learingperl01,

You are printing $LINE instead of $1 like kodr showed you to do.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top