Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Counting links in html page 3

Status
Not open for further replies.

SotonStu

Programmer
May 7, 2003
33
GB
I'd like to create an awk script which can count the number of individual links in an html source page and print out the link along with the number of times it appears in the page.

However, i really have no idea how to go about this, should i change the field separator?

Thanks for any help!
 
There might be more elegant solutions, but I gues that will work:
[tt]
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
}
/href=/ {
# a line contains 'href='
for ( i=1; i<=NF; i++ ) {
# for each field in that line
if ( $i ~ /href=/ ) {
# if the field contains 'href='
split( $i, temp_array_1, &quot;>&quot; )
# use '>' to split up that field
split( temp_array_1[1], temp_array_2, &quot;=&quot; )
# use '=' to split up the 1st part
gsub( &quot;\&quot;&quot;, &quot;&quot;, temp_array_2[2] )
# get rid of the &quot;
link = temp_array_2[2]
# store that link
link_array[link]++
# count that link
}
}
}
END {
print &quot;SUMMARY&quot;
for ( link in link_array )
printf( &quot;%5d times link \&quot;%s\&quot;\n&quot;, link_array[link], link )
}
[/tt]
HTH
 
I just tried running your script with my sample html page and i just got the output

SUMMARY
i'm pretty sure that the correct HTML syntax is being used, so why isn't it printing out the link?

i am very grateful for all of your help, please excuse my ignorance, i've only been using awk for a few days!

 
How about...

awk '{
strng=$0;
split(strng, tmp_arry, &quot;[\<\>]&quot;);
for(indx in tmp_arry)
if (match(tmp_arry[indx],&quot;[Hh][Rr][Ee][Ff]=\&quot;[^\&quot;]*\&quot;&quot;) > 0)
array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
}
END {
for(indx in array)
printf &quot;%5d\t%s\n&quot;, array[indx], indx;
}' eg1.html | sort +1
 
Could you supply a sample input file?
Just copy&paste the HTML source code you are trying to parse.
 
<HTML>
<HEAD>
<TITLE>NCSA Beginner's Guide--A Longer Example</TITLE>
</HEAD>
<BODY>

<H1>Something a Bit More Complex</H1>
<P>
This is a relatively simple HTML document. Simple
is in the eye of the beholder, but if you study what is included
in this beginner's guide, you can create documents like this
with no problem!
</P>

<H2>Special Effects (header 2)</H2>
<P>
The second paragraph shows some special effects: a
word in <I>italics</I> and a word in <B>bold</B>.
Here is an inlined GIF image: <IMG SRC=&quot;BarHotlist.gif&quot;>.
</P>
<P>
This is the third paragraph, which demonstrates links. A hypertext link
from the word <A HREF = &quot; to a document called &quot;/People/index.html&quot; exists but if you
try to follow this link, you will get an error screen.
</P>

<H2>A Bit of Poetry (header 2)</H2>
<P>
<A HREF = &quot; Here is a section of text that should display in a
fixed-width font when it is formatted:
</P>
<PRE>
On the stiff twig up there
Hunches a wet black rook
Arranging and rearranging its feathers in the rain ...
</PRE>

<H2>A List (header 2)</H2>
<P>
<A HREF = &quot; This is a unordered list of my favorite fruit:
</P>
<UL>
<LI> cranberries
<LI> blueberries
</UL>

<H2>You're Done! (header 2)</H2>
<P>
<A HREF = &quot; Tips!!</A>
This is the end of the longer sample document.
</P>

<HR>
<ADDRESS>Me (me@mycomputer.univ.edu)</ADDRESS>
</BODY>
</HTML>
 
this is my newest effort:

#!/bin/gawk -f
# Print list of word frequencies

BEGIN{FS=&quot;\&quot;&quot;}
FNR == 1 { printf(&quot;%s\n%s&quot;, (NR==1) ? &quot;&quot; : &quot;)&quot;, FILENAME)
print&quot;\n&quot;}
/href||HREF/{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for( word in freq ){
if(word ~/http/){
link = word
link_array[link]=freq[word]
}
}
for (link in link_array)
printf &quot;%s\t%d\n&quot;,link, link_array[link]
}


This works fine for 1 file but what i really want is the for loop that i have in the END statement in the main body so that it executes at the end of every file. this way i can pass the script many files and it will print out the respective links for each file. Is there a builtin variable that indicates the end of a file? eg FNR==0?
 
how 'bount the following on the page:

#--------------------- countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
FS=&quot;(=<>)|(\&quot;)&quot;
}
$0 ~ tolower(&quot;href&quot;) {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ &quot;href&quot; ) {
# printf(&quot;link->[%s]\n&quot;, $(i+1));
link_array[$(i+1)]++
}
}
END {
print &quot;SUMMARY&quot;
for ( link in link_array )
printf( &quot;%5d times link [%s]\n&quot;, link_array[link], link )
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
This still just outputs
SUMMARY

with no links printed underneath
 
The 'END&quot; block gets executed ONCE for ALL the supplied/specified files.

What you can do is move the logic out of the END block into the 'main' body with a condition like that:

FNR==1 && NR != 1 {
do your SUMMARY here per supplied file
and initilize/delete the summary link_array here
}

Actually you can leave the END block - it will pick up the LAST file for the summary report.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
ooops - try this one:

BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
FS=&quot;(=<>)|(\&quot;)&quot;
}
tolower($0) ~ &quot;href&quot; {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ &quot;href&quot; ) {
# printf(&quot;link->[%s]\n&quot;, $(i+1));
link_array[$(i+1)]++
}
}
END {
print &quot;SUMMARY&quot;
for ( link in link_array )
printf( &quot;%5d times link [%s]\n&quot;, link_array[link], link )
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
That works very nicely, thanks vlad. My next problem(i know, they just keep coming) is achieving the Summary printout after each file hs been read, i tried the FNR==1 && NR !=1 statement but it didnt seem to work. can i use getline() to determine when to run the print loop? eg if(getline()==0) ?
 
I think that the spaces either side of the equals sign are the problem.

awk '{
strng=$0;
split(strng, tmp_arry, &quot;[\<\>]&quot;);
for(indx in tmp_arry)
if (match(tmp_arry[indx],&quot;[Hh][Rr][Ee][Ff] *= *\&quot;[^\&quot;]*\&quot;&quot;) > 0)
array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
}
END {
for(indx in array)
printf &quot;%5d\t%s\n&quot;, array[indx], indx;
}' eg1.html | sort +1

Tested...

2 HREF = &quot; 1 HREF = &quot; 1 HREF = &quot;
 
ok, try this on on the MULTIPLE files:

nawk -f countHREF.awk myFiles*


#--------------------- countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
FS=&quot;(=<>)|(\&quot;)&quot;
}

FNR == 1 {prevFile=FILENAME};
FNR == 1 && NR != 1 {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( link in link_array ) {
printf( &quot;%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}

}
tolower($0) ~ &quot;href&quot; {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ &quot;href&quot; ) {
# printf(&quot;link->[%s]\n&quot;, $(i+1));
link_array[$(i+1)]++
}
}
END {
printf(&quot;SUMMARY for [%s]\n&quot;, FILENAME);
for ( link in link_array )
printf( &quot;%5d times link [%s]\n&quot;, link_array[link], link )
}



vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Yet again you come up trumps vlad. this may be wandering off topic, but on future occassions would it be easier if i wrote a shell script to run my awk script on individual files rather than letting awk handle it all?
 
well...... it depends on how 'AWK purist' you are and what the context of the AWK invocation is. I personally tend to gravitate towards doing it all in AWK if I can see the way to specify ALL my input files at once. If there's no pattern of input files to specify at once then I'd wrap the AWK script in some kind of the shell loop.

I guess it's a matter of the personal taste and given circumstances - your mileage may very.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
While you guys progressed in the solution(s) I worked on my own little solution. Since I got it working and it's there I'll let you take a look, even though it's too late:
[tt]
BEGIN {
# now works with different style HTML e.g.
# <A HREF = &quot;# <a href=&quot;}
/[Hh][Rr][Ee][Ff]/ {
# a line contains 'href='
for ( i=1; i<=NF; i++ ) {
# for each field in that line
if ( $i ~ /^[Hh][Rr][Ee][Ff]$/ ) {
lastfield = $i
print &quot;1 = &quot; $i
continue
}
if ( (lastfield ~ /^[Hh][Rr][Ee][Ff]$/) && ($i == &quot;=&quot;) ) {
lastfield = $i
print &quot;2 = &quot; $i
continue
}
if ( lastfield == &quot;=&quot; ) {
lastfield = &quot;&quot;
print &quot;3 = &quot; $i
split( $i, temp_array, &quot;\&quot;&quot; )
link = temp_array[2]
link_array[link]++
}
if ( $i ~ /[Hh][Rr][Ee][Ff]=/ ) {
split( $i, temp_array_1, &quot;>&quot; )
split( temp_array_1[1], temp_array_2, &quot;=&quot; )
gsub( &quot;\&quot;&quot;, &quot;&quot;, temp_array_2[2] )
link = temp_array_2[2]
link_array[link]++
}
}
}
END {
print &quot;SUMMARY&quot;
for ( link in link_array )
printf( &quot;%5d times link \&quot;%s\&quot;\n&quot;, link_array[link], link )
}
[/tt]

For your other question I have a solution too, because I had a similar problem.
Lets say you have 10 HTML documents you want to run your script for, simply use a shell script/command line like:
[tt]
find ./ -name *.html -exec cat {} \;| awk -f MYAWK.awk
[/tt]
This will summarize all 10 files (assuming they all have file endings &quot;html&quot;). Giving you ONE summary.
If you want a per file summary, then use:
[tt]
find ./ -name *.html | while read filename
do
cat filename | awk -f MYAWK.awk
done
[/tt]
This will give you TEN summaries.
 
Thanks for that Baraka, i appreciate your help, its always good to see different solutions, especially when learning like i am.
 
Re-reading the discussion I would like to point out to SotonStu, that your sample HTML code is not actually good HTML, that's why my first script wouldn't work and also why the &quot;cnn.com page&quot; solution failed.

Your code has a space between the &quot;href&quot; and the equal sign and another one between the equal sign and the actual link.

I believe my solution will work for your code and also for regular HTML. It should also work for mixed upper- and lowercase like &quot;hREF&quot; and &quot;Href&quot;, regardless if there are spaces present or not (in the places I pointed out in the previous paragraph).

HTH
 
Just when you thought the fun had ended, i'm trying to refine the script further. I think it would be quite cool if I could sort the links in to different categories and print them out separately. For example. first print out all links ending in .htm or .html, then print out all links to images (.jpg,.jpeg) and then print out the rest. can it be done?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top