Regex's

Kachoo · Apr 30, 2008

Hey guys,

I been doing alot of research on regex yet I am still not sucessful at extracting a domain name, directory name, file name out of a string in this format using regex's. (Im new to perl)

Can someone show me how this could be done? http:// will always be at the start of the line although the format of $url can change from

http://123.domain.123/Dir/file.ext

to

http://123.domain.123

to

http://123.domain.123/Dir

Any help would be greatly appreciated!

Code:

$domain
[URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=(or[/URL] invalid chars)
^^^^^^^--------------^ $domain triggers upto first /
??? $url =~ s/^(http:\/\/*?\/)/i;

$directory
[URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=(or[/URL] invalid chars)
Directory triggers   ^---^ if present without ^

$file or anything after $domain.$directory
[URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=(or[/URL] invalid chars)
File triggers            ^----^---^^^~>

Kachoo · Apr 30, 2008

yea I am a BB noob too, I posted this question with a news tag!

regex7 · Apr 30, 2008

As I understood, only parts containing a dot should be considered to be filenames. So you could take this script as a start:

Code:

#!/usr/bin/perl -w
use strict;

my @urls = qw([URL unfurl="true"]http://123.domain.123[/URL] [URL unfurl="true"]http://123.domain.123/Dir[/URL] [URL unfurl="true"]http://123.domain.123/file.ext[/URL] [URL unfurl="true"]http://123.domain.123/Dir/file.ext[/URL] [URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=foo[/URL] [URL unfurl="true"]http://news.google.com/news?hl=de&ned=de&q=perl&btnG=News-Suche);[/URL]

print "\n";
foreach my $url (@urls) {

  print "URL: $url\n\n";
  if ($url =~ m![URL unfurl="true"]http://([/URL][^/]+)/?(?:([^.]+)|(?:([^.?]+)/)?(\w+\.\w+).*)?$!o) {
    my $domain = $1 || '';
    my $path = $2 || $3 || '';
    my $file = $4 || '';
    print "Domain: $domain\nDir: $path\nFile: $file";
  } else {
    print "Could not analyse URL";
  }
  print "\n\n\n";
}

Output:

Code:

URL: [URL unfurl="true"]http://123.domain.123[/URL]

Domain: 123.domain.123
Dir:
File:


URL: [URL unfurl="true"]http://123.domain.123/Dir[/URL]

Domain: 123.domain.123
Dir: Dir
File:


URL: [URL unfurl="true"]http://123.domain.123/file.ext[/URL]

Domain: 123.domain.123
Dir:
File: file.ext


URL: [URL unfurl="true"]http://123.domain.123/Dir/file.ext[/URL]

Domain: 123.domain.123
Dir: Dir
File: file.ext


URL: [URL unfurl="true"]http://123.domain.123/Dir/file.ext?&=foo[/URL]

Domain: 123.domain.123
Dir: Dir
File: file.ext


URL: [URL unfurl="true"]http://news.google.com/news?hl=de&ned=de&q=perl&btnG=News-Suche[/URL]

Domain: news.google.com
Dir: news?hl=de&ned=de&q=perl&btnG=News-Suche
File:

But as you see, the result for the last url is not what you might expect, because "news?hl=de&ned=de&q=perl&btnG=News-Suche" doesn't conatain any dot and therefore is considered to be a directory name.

PinkeyNBrain · May 2, 2008

One consideration to keep in mind is: How well do you trust your input data? In this case - are there any absolutes you can state about your addresses. Your opening post shows possibilities of:
<domain>
<domain>/<dir>
<domain>/<dir>/<file>

The reply from regex7 adds the possibility of:
<domain>/<file>

Depending on the source of your url, <file> may or may not have a .ext, which will make the distinction between <dir> and <file> much more difficult (hence regex7's issue with the google example)

One possibility might be to consider the following (just go with the idea and ignore lack of error checking)

# assume http:// is stripped off below
# don't care about this part
$url =~ s"^\s*http.*://"";
# toss trailing application calls
if ($url =~ s"[^a-z\.\-_/].*"") { $app_call = 1 ;};

@url_split = spilt(m"/", $url) ;
$domain = shift @url_split ;
if ($#url_split == 1) {
if (&some_file_test($url_split[0])) {
$dir = "" ;
$file = $url_split[0] ;
} else {
$dir = $url_split[0] ;
$file = "" ;
};
} else {
$dir = join("/", @url_split[0..($#url_split-1)]);
if (&some_file_test($url_split[$#url_split])) {
$dir .= "" ;
$file = $url_split[$#url_split] ;
} else {
$dir .= "/$url_split[$#url_split]" ;
$file = "" ;
};
};
sub some_file_test {
$str_to_check = $_[0] ;
return ($str_to_check =~ /\.\S\S\S/)
? 1 # guess .ext implies file (it may not)
: ($app_call)
? 1 # guessing some sort of sub-call
: (any other file check you can think of)
? 1 # return 1 based on your test
: 0 # if here, we're guessing it's a dir.
};

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Regex's

Kachoo

Programmer

Kachoo

Programmer

regex7

Programmer

PinkeyNBrain

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor