Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Awk Script Problem - Printing file extenisons

Status
Not open for further replies.

wotthe2004

Programmer
Apr 27, 2006
1
GB
How do you print the filename followed by a list of each unique link on that web page and the number of times it occurs in the file?

I need links with extension .html to be printed out, .jpg to be printed out and other file extensions in different lists.

I have tried using 3 arrays, one for each of the file extensions and using a counter so that i find out how many of each link there is. I am having a bit of trouble doing it this way as it is counting how many .htm links there are rather than how many of time a certain web page link occurs, for example, if this link appeared 5 times on the page, i want it to print the link then 5 afterwards to show it appears 5 times.

Can anyone help.

Thanks.
 
Any chance you could post your actual code ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
I thought this was an interesting problem so I had a go:

Code:
awk '
        BEGIN {
                IGNORECASE=1
                extcount=0
        }
        /href=/ {
                RSTART=RLENGTH=0
                REST=$0
                while (match(REST,"<a *href=\([^>]*\)>",url)) {

                        # Strip off quotes and any HTTP GET parameters
                        gsub("\"","",url[1])
                        gsub("?.*","",url[1])

                        # Set this before RSTART and RLENGTH are clobbered
                        # by further use of match()
                        REST=substr(REST,RSTART+RLENGTH)

                        # Strip off the host part of the URL to avoid ending
                        # up with extensions like .com, .org, etc.
                        temp=url[1]
                        sub("https?://[^/]*","",temp)
                        match(temp,".*[.]\([^./]+\)$",ext)

                        linkcount[url[1]]++
                        linkexts[url[1]]=ext[1]
                        exts[ext[1]]++
                }
        }
        END {
                for (extn in exts) {
                        print "Extension \"" extn "\" (" exts[extn] " matches)"
                        print ""
                        for (link in linkcount) {
                                if (linkexts[link]==extn) {
                                        print linkcount[link],link
                                }
                        }
                        print ""
                }
        }
' index.html

This would have been much cleaner if I could figure out how to iterate through a doubly subscripted array. The gawk man page mentions that you can use (i,j) in array but I think that only applies to testing whether (i,j) is in the array, not for for loops?

The script will need modification if you also want to capture URLs from <img target=> and so-on.

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top