Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex's

Status
Not open for further replies.

Kachoo

Programmer
Apr 27, 2008
10
US
Hey guys,

I been doing alot of research on regex yet I am still not sucessful at extracting a domain name, directory name, file name out of a string in this format using regex's. (Im new to perl)

Can someone show me how this could be done? http:// will always be at the start of the line although the format of $url can change from to to
Any help would be greatly appreciated!

Code:
$domain
[URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=(or[/URL] invalid chars)
^^^^^^^--------------^ $domain triggers upto first /
??? $url =~ s/^(http:\/\/*?\/)/i;

$directory
[URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=(or[/URL] invalid chars)
Directory triggers   ^---^ if present without ^

$file or anything after $domain.$directory
[URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=(or[/URL] invalid chars)
File triggers            ^----^---^^^~>
 
yea I am a BB noob too, I posted this question with a news tag!
 
As I understood, only parts containing a dot should be considered to be filenames. So you could take this script as a start:

Code:
#!/usr/bin/perl -w
use strict;

my @urls = qw([URL unfurl="true"]http://123.domain.123[/URL] [URL unfurl="true"]http://123.domain.123/Dir[/URL] [URL unfurl="true"]http://123.domain.123/file.ext[/URL] [URL unfurl="true"]http://123.domain.123/Dir/file.ext[/URL] [URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=foo[/URL] [URL unfurl="true"]http://news.google.com/news?hl=de&ned=de&q=perl&btnG=News-Suche);[/URL]

print "\n";
foreach my $url (@urls) {

  print "URL: $url\n\n";
  if ($url =~ m![URL unfurl="true"]http://([/URL][^/]+)/?(?:([^.]+)|(?:([^.?]+)/)?(\w+\.\w+).*)?$!o) {
    my $domain = $1 || '';
    my $path = $2 || $3 || '';
    my $file = $4 || '';
    print "Domain: $domain\nDir: $path\nFile: $file";
  } else {
    print "Could not analyse URL";
  }
  print "\n\n\n";
}

Output:

Code:
URL: [URL unfurl="true"]http://123.domain.123[/URL]

Domain: 123.domain.123
Dir:
File:


URL: [URL unfurl="true"]http://123.domain.123/Dir[/URL]

Domain: 123.domain.123
Dir: Dir
File:


URL: [URL unfurl="true"]http://123.domain.123/file.ext[/URL]

Domain: 123.domain.123
Dir:
File: file.ext


URL: [URL unfurl="true"]http://123.domain.123/Dir/file.ext[/URL]

Domain: 123.domain.123
Dir: Dir
File: file.ext


URL: [URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=foo[/URL]

Domain: 123.domain.123
Dir: Dir
File: file.ext


URL: [URL unfurl="true"]http://news.google.com/news?hl=de&ned=de&q=perl&btnG=News-Suche[/URL]

Domain: news.google.com
Dir: news?hl=de&ned=de&q=perl&btnG=News-Suche
File:

But as you see, the result for the last url is not what you might expect, because "news?hl=de&ned=de&q=perl&btnG=News-Suche" doesn't conatain any dot and therefore is considered to be a directory name.
 
One consideration to keep in mind is: How well do you trust your input data? In this case - are there any absolutes you can state about your addresses. Your opening post shows possibilities of:
<domain>
<domain>/<dir>
<domain>/<dir>/<file>

The reply from regex7 adds the possibility of:
<domain>/<file>

Depending on the source of your url, <file> may or may not have a .ext, which will make the distinction between <dir> and <file> much more difficult (hence regex7's issue with the google example)

One possibility might be to consider the following (just go with the idea and ignore lack of error checking)

# assume http:// is stripped off below
# don't care about this part
$url =~ s"^\s*http.*://"";
# toss trailing application calls
if ($url =~ s"[^a-z\.\-_/].*"") { $app_call = 1 ;};

@url_split = spilt(m"/", $url) ;
$domain = shift @url_split ;
if ($#url_split == 1) {
if (&some_file_test($url_split[0])) {
$dir = "" ;
$file = $url_split[0] ;
} else {
$dir = $url_split[0] ;
$file = "" ;
};
} else {
$dir = join("/", @url_split[0..($#url_split-1)]);
if (&some_file_test($url_split[$#url_split])) {
$dir .= "" ;
$file = $url_split[$#url_split] ;
} else {
$dir .= "/$url_split[$#url_split]" ;
$file = "" ;
};
};
sub some_file_test {
$str_to_check = $_[0] ;
return ($str_to_check =~ /\.\S\S\S/)
? 1 # guess .ext implies file (it may not)
: ($app_call)
? 1 # guessing some sort of sub-call
: (any other file check you can think of)
? 1 # return 1 based on your test
: 0 # if here, we're guessing it's a dir.
};
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top