Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extracting website name from URL

Status
Not open for further replies.

AndyGroom

Programmer
May 23, 2001
972
GB
Is there an easy way to get the results as laid out below, such that it would work with any URL?
Code:
[URL unfurl="true"]http://google.com[/URL]                    becomes google.com
[URL unfurl="true"]http://www.google.co.uk[/URL]              becomes google.co.uk
[URL unfurl="true"]http://visualbasic.ittoolbox.org[/URL]     becomes ittoolbox.org
[URL unfurl="true"]http://tech.groups.yahoo.com[/URL]         becomes yahoo.com

- Andy
___________________________________________________________________
If you think nobody cares you're alive, try missing a couple of mortgage payments
 
Yes but for example, if you use period as delimiter then how do you apply a rule that extracts ittoolbox.org or ittoolbox.co.uk from hello.world.ittoolbox.org or hello.world.ittoolbox.co.uk?

The only solution I have found so far is to create a list of known website domain extensions and work backwards from there. So for example, ignore .co.uk or .com and whatever comes next working from the right is the domain name. I just thought there might be an easier way.

- Andy
___________________________________________________________________
If you think nobody cares you're alive, try missing a couple of mortgage payments
 
I parse data all the time. You can do it. Start at the beginning of the string, look for the next slash after if there isn't one start at the end of the string, if there is start at that slash. Word backwards until you find a period, that should be the .org, .net, .com., .tv, etc, period. Then continue to work back until you get another period or the ://. I'm not going to write the code for you but you can use methods like that to parse out the string you are looking for.

Creator of - Movie Reviews, Movie Lists, and much more!
 
I also parse data all the time. I've been parsing data since 1982! I agree with your method but it's not quite what I want and so it won't produce the results I'm after.
Code:
[URL unfurl="true"]http://tech.groups.yahoo.com[/URL]
[URL unfurl="true"]http://groups.yahoo.com[/URL]
[URL unfurl="true"]http://yahoo.com[/URL]
From the URLs above I want a method that will return yahoo.com for all three URLs.
Code:
[URL unfurl="true"]http://tech.groups.yahoo.co.uk[/URL]
[URL unfurl="true"]http://groups.yahoo.co.uk[/URL]
[URL unfurl="true"]http://yahoo.co.uk[/URL]
From the URLs above I want a method that will return yahoo.co.uk for all three URLs.

Simply working through the URL looking for periods and slashes won't work unless you know that ".com" and ".co.uk" are valid domain extensions. I can't see a way of doing this unless you start off with an exhaustive list of domain extensions unless there's some clever thing in the Microsoft Internet Controls library that will do it for me?

- Andy
___________________________________________________________________
If you think nobody cares you're alive, try missing a couple of mortgage payments
 
Not very elegant but if I understood your question correctly then this does the job:

Code:
'mystring = "[URL unfurl="true"]http://tech.groups.yahoo.co.uk"[/URL]
'mystring = "[URL unfurl="true"]http://groups.yahoo.co.uk"[/URL]
'mystring = "[URL unfurl="true"]http://yahoo.co.uk"[/URL]
'mystring = "[URL unfurl="true"]http://tech.groups.yahoo.com"[/URL]
'mystring = "[URL unfurl="true"]http://groups.yahoo.com"[/URL]
mystring = "[URL unfurl="true"]http://yahoo.com"[/URL]

i = InStr(mystring, ".co.uk")
If i Then
    For a = i - 1 To 1 Step -1
        If Mid(mystring, a, 1) = "." Or Mid(mystring, a, 1) = "/" Then
            mystring = Right(mystring, Len(mystring) - a)
            MsgBox mystring
            Exit For
        End If
    Next
Else: i = InStr(mystring, ".com")
    For a = i - 1 To 1 Step -1
        If Mid(mystring, a, 1) = "." Or Mid(mystring, a, 1) = "/" Then
            mystring = Right(mystring, Len(mystring) - a)
            MsgBox mystring
            Exit For
        End If
    Next
End If

[gray]Experience is something you don't get until just after you need it.[/gray]
 
>there's some clever thing in the Microsoft Internet Controls library that will do it for me?

There's some clever stuff in the URLMON library that'll do it for you - mind you you have to do a little bit of work to be able to use it ... something like:
Code:
[blue]Option Explicit

Public Enum ParseAction
    PARSE_CANONICALIZE = 1
    PARSE_FRIENDLY
    PARSE_SECURITY_URL
    PARSE_ROOTDOCUMENT
    PARSE_DOCUMENT
    PARSE_ANCHOR
    PARSE_ENCODE
    PARSE_DECODE
    PARSE_PATH_FROM_URL
    PARSE_URL_FROM_PATH
    PARSE_MIME
    PARSE_SERVER
    PARSE_SCHEMA
    PARSE_SITE
    PARSE_DOMAIN
    PARSE_LOCATION
    PARSE_SECURITY_DOMAIN
    PARSE_ESCAPE
    PARSE_UNESCAPE
End Enum
 
Private Declare Function CreateUri Lib "URLMON" (ByVal pwzURI As String, ByVal dwFlags As Long, ByVal dwReserved As Long, ByRef iURI As Long) As Long
Private Declare Function CoInternetParseIUri Lib "URLMON.DLL" (ByVal pIUri As Long, ByVal ParseAction As ParseAction, ByVal dwFlags As Long, ByVal pwzResult As String, ByVal cchResult As Long, ByRef pcchResult As Long, ByVal dwReserved As Long) As Long
Private Const INTERNET_MAX_URL_LENGTH = 2084& [green]' UT16 char length[/green]

Public Function GetDomainFromURL(ByVal strURL As String) As String
    Dim result As Long
    Dim myIURI As Long
    Dim pcchResult As Long
    Dim strResult As String

    strResult = Space(INTERNET_MAX_URL_LENGTH / 2)
    [green]' need an interface to a URI, because that is what we need to pass to CoInternetParseIUri[/green]
    result = CreateUri(StrConv(strURL, vbUnicode), 0&, 0&, myIURI)
    If result = 0 Then
        result = CoInternetParseIUri(myIURI, PARSE_DOMAIN, 0&, strResult, INTERNET_MAX_URL_LENGTH, pcchResult, 0&)
        If result = 0 Then
            GetDomainFromURL = Left(StrConv(strResult, vbFromUnicode), pcchResult)
        End If
    End If
End Function[/blue]
 
Here's a quick test using regular expressions (made quickly in VBScript). I don't guarantee it to be bulletproof, but works for the examples you gave.

Code:
[COLOR=green]'[URL unfurl="true"]http://msdn.microsoft.com/en-us/library/ms974570.aspx[/URL]
'[URL unfurl="true"]http://msdn.microsoft.com/en-us/library/yab2dx62.aspx[/URL]
'[URL unfurl="true"]http://www.regular-expressions.info/index.html[/URL]

'[URL unfurl="true"]http://google.com[/URL]                    becomes google.com
'[URL unfurl="true"]http://www.google.co.uk[/URL]              becomes google.co.uk
'[URL unfurl="true"]http://visualbasic.ittoolbox.org[/URL]     becomes ittoolbox.org
'[URL unfurl="true"]http://tech.groups.yahoo.com[/URL]         becomes yahoo.com[/color]

set re = new regexp
dim strInput

dim Matches
dim Match

strInput = "[URL unfurl="true"]http://tech.groups.yahoo.com"[/URL]

re.Pattern = "\w*.\w{2,3}.\w{2,3}?$"
re.IgnoreCase = true
re.Global = true
wscript.echo(strInput)

        Set Matches = re.Execute(strInput)
        wscript.echo("Matches.Count: " & Matches.Count)
        
        if  Matches.Count >= 1 then
            value1 = Matches.Item(0)
        else
            wscript.echo("No matches found")
        end if
        wscript.echo(value1)

To try it out, copy and paste into a text file. When you name the file change the extension to ".vbs", then double click to run.
 
Andy, going back to looking at the periods and slashes to parse the string out. You can still do it. No valid domain name can be only 2 characters long. It has to be at least 3. Knowing that tells you the "co" part of yahoo.co.uk cannot be the domain, so you would know to grab another segment. It is still possible.


Creator of - Movie Reviews, Movie Lists, and much more!
 
Thanks Error7 and jges but ideally I wanted a solution that works with any url, not just .com and .co.uk ones - I used them by way of example to explain the nature of the problem.

Is there an easy way to get the results as laid out below, such that it would work with any URL?

Strongm I will give that solution a try, thanks.

- Andy
___________________________________________________________________
If you think nobody cares you're alive, try missing a couple of mortgage payments
 
barryna, what about
- Andy
___________________________________________________________________
If you think nobody cares you're alive, try missing a couple of mortgage payments
 
My solution should work for any url, I'm just not guaranteeing that it will.

Strongm's solution will probably be the most robust (no surprises there!).
 
interesting, that breaks the rules of what I've been taught, but is pretty obvious it is possible to have. Did they relax the 3 character minimum domain name? Can bt.com have sub-domains? The only other option is to try to use jges's script to see if that works, or build an allowed domain extension list to compare strings to so you can grab the host name you are looking for. Any particular reason you need to have the domain in such an exact format?

Creator of - Movie Reviews, Movie Lists, and much more!
 
Yeah the approach I've taken so far is to build a list of allowed domain names:

Code:
  ' Create a list of known domain extensions...
  n$ = ".arpa,.com,.co,.edu,.firm,.gov,.int,.mil,.mobi,.nato,.net,.nom,.org,.store,.web"
  Tmp = Split(n$, ",")
  n$ = ".ac,.ad,.ae,.af,.ag,.ai,.al,.am,.an,.ao,.aq,.ar,.as,.at,.au,.aw,.ax,.az,.ba,.bb,.bd,.be,.bf,.bg,.bh,.bi,.bj,.bm,.bn,.bo,.br,.bs,.bt,.bu,.bv,.bw,.by,.bz,.ca,.cc,.cd,.cf,.cg,.ch,.ci,.ck,.cl,.cm,.cn,.co,.cr,.cs,.cu,.cv,.cx,.cy,.cz,.de,.dj,.dk,.dm,.do,.dz,.ec,.ee,.eg,.eh,.er,.es,.et,.eu,.fi,.fj,.fk,.fm,.fo,.fr,.ga,.gb,.gd,.ge,.gf,.gg,.gh,.gi,.gl,.gm,.gn,.gp,.gq,.gr,.gs,.gt,.gu,.gw,.gy,.hk,.hm,.hn,.hr,.ht,.hu,.id,.ie,.il,.im,.in,.io,.iq,.ir,.is,.it,.je,.jm,.jo,.jp,.ke,.kg,.kh,.ki,.km,.kn,.kp,.kr,.kw,.ky,.kz,.la,.lb,.lc,.li,.lk,.lr,.ls,.lt,.lu,.lv,.ly,.ma,.mc,.md,.me,.mg,.mh,.mk,.ml,.mm,.mn,.mo,.mp,.mq,.mr,.ms,.mt,.mu,.mv,.mw,.mx,.my,.mz,.na,.nc,.ne,.nf,.ng,.ni,.nl,.no,.np,.nr,.nu,.nz,.om,.pa,.pe,.pf,.pg,.ph,.pk,.pl,.pm,.pn,.pr,.ps,.pt,.pw,.py,.qa,.re,.ro,.rs,.ru,.rw,.sa,.sb,.sc,.sd,.se,.sg,.sh,.si,.sj,.sk,.sl,.sm,.sn,.so,.sr,.st,.su,.sv,.sy,.sz,.tc,.td,.tf,.tg,.th,.tj,.tk,.tl,.tm,.tn,.to,.tp,.tr,.tt,.tv,.tw,.tz,.ua,.ug,.uk,.us,.uy,.uz,.va,.vc,.ve,.vg,.vi,.vn,.vu,.wf,.ws,.ye,.yt,.yu,.za,.zm,.zw"
  Tmp2 = Split(n$, ",")
  ExtList = "|"
  For A& = 0 To UBound(Tmp)
    ExtList = ExtList & Tmp(A&) & "|"
    For B& = 0 To UBound(Tmp2)
      If (B& = 0) Then ExtList = ExtList & Tmp2(B&) & "|"
      ExtList = ExtList & Tmp2(B&) & Tmp(A&) & "|"
      ExtList = ExtList & Tmp(A&) & Tmp2(B&) & "|"
    Next B&
  Next A&

The reason I need the domain name is that for the application I'm writing I need to know whether HTML elements on any particular webpage are hosted on the same domain as the webpage or on a different site. So for example if a webpage contains an image, is that image hosted somewhere on the same domain as the webpage? Obviously with subdomains that still counts as being on the same domain.

I suppoe now someone will say there's a much easier way of doing it!

- Andy
___________________________________________________________________
If you think nobody cares you're alive, try missing a couple of mortgage payments
 
Not to open up a can of worms, but images can be present on a webpage from another domain without being present in the html page source of the page. It is done pretty often for more complex websites using JavaScript and AJAX. How are you handling that sort of thing? I'm also assuming you are ignoring A elements since those are just links. And what about iFrames, what do those count as?

Creator of - Movie Reviews, Movie Lists, and much more!
 
>breaks the rules of what I've been taught

I've only ever really been aware of an upper limit.

RFC1035 (Domain Names - Implementation and Specification), dating from 1987, states "There are also some restrictions on the length. Labels must be 63 characters or less"

RFC2181 (Clarifications to the DNS Specification), dating from 1997, states "The length of any one label is limited to between 1 and 63 octets"

>The only other option ...

Er, or my code ...


 
>not apart of the standards that you quoted.

Er, the RFCs I have quoted from are (basically) the full DNS standard.

However, all one letter second level domain names happen to be reserved by ICANN, and therefore cannot be registered*, hence the 'invalid domain name' message

> assuming this isn't the first time you've answered this

It is the first time. I knocked the code together this morning whilst awaiting a delayed meeting. There is, therefore, no other thread to link to.



*Actually not completely true; one or two such domain names have been registered where a clear link between the letter and the company was demonstrated - for example X Bank - now owned by PayPal - were allowed x.com
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top