Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Counting links in html page 3

Status
Not open for further replies.

SotonStu

Programmer
May 7, 2003
33
GB
I'd like to create an awk script which can count the number of individual links in an html source page and print out the link along with the number of times it appears in the page.

However, i really have no idea how to go about this, should i change the field separator?

Thanks for any help!
 
One last thing (then I'll go back to my regular work *g*) my solution in using one big summary or multiple summaries works perfectly well for subdirectories as well.

I use it to go through all my log files that are stored in a filesystem that uses something like:
[tt]
/year/month/day/logfiles
[/tt]

With my method I get all logfiles for say March 2003 when pointing to directory /2003/03/ and the "find" will give me all logfiles for march even though they are in subdirectories.

In your case you could summarise all html documents in a given subdirectory on a "all-files" basis or on a "per-file-" basis.
 
That is a useful hint for my general use as well as this current script. Thanks for the advice
 
Regarding the sorting part, sure can it be done!

Type "man sort" and simply pipe the output through sort.

Sorting is a wide field and you will have alot of fun at that, I'm sure.

One (maybe) useful hint is what I used in my script at one point:
[tt]
# how to call a system shell command from within an awk file
system("cat OUTPUTFILE | sort -o OUTPUTFILE")
[/tt]
Not tested, but meant to give you ideas ...
 
Ok, i'm still going here but haven't made much progress. What i'm trying to do is have different associative arrays for each kind of link, so if the link has .html in it, it will be stored in web_array, if it has .jpg in it, it will be stored in pic_array and the rest just get stored in link_array. Ive tried using if statements that search for the pattern but its not working. my head is starting to hurt.
 
try this one. Adjust your your link 'types' at the top of the script as you wish.

#------------------------ countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
FS=&quot;(=<>)|(\&quot;)&quot;

TYPEhtm=&quot;(.htm$)|(.html$)&quot;
TYPEgif=&quot;(.gif$)|(.jpg$)|(.jpeg$)&quot;
TYPEother=&quot;.*&quot;
typeNum=split(TYPEhtm SUBSEP TYPEgif SUBSEP TYPEother, typeA, SUBSEP);

}

FNR == 1 {prevFile=FILENAME};
FNR == 1 && NR != 1 {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( i=1; i <= typeNum; i++) {
printf(&quot; TYPE->[%s]\n&quot;, typeA);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf(&quot;\t%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}
}
}

}
tolower($0) ~ &quot;href&quot; {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ &quot;href&quot; ) {
# printf(&quot;link->[%s]\n&quot;, $(i+1));
link_array[$(i+1)]++
}
}
END {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( i=1; i <= typeNum; i++) {
printf(&quot; TYPE->[%s]\n&quot;, typeA);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf(&quot;\t%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}
}
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
or better yet - make sure that TYPEother is the LAST entry in in the typeA array - it's 'catch-ALL' condition:

#------------------ countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
FS=&quot;(=<>)|(\&quot;)&quot;

TYPEhtm=&quot;(\.htm$)|(\.html$)&quot;
TYPEgif=&quot;(\.gif$)|(\.jpg$)|(\.jpeg$)&quot;
TYPEother=&quot;.*&quot;
typeNum=split(TYPEhtm SUBSEP TYPEgif SUBSEP TYPEother, typeA, SUBSEP);
typeNumiName=split(&quot;Html&quot; SUBSEP &quot;Graphics&quot; SUBSEP &quot;Other&quot;, typeAname, SUBSEP);

}

FNR == 1 {prevFile=FILENAME};
FNR == 1 && NR != 1 {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( i=1; i <= typeNum; i++) {
printf(&quot; TYPE->[%s]\n&quot;, typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf(&quot;\t%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}
}
}

}
tolower($0) ~ &quot;href&quot; {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ &quot;href&quot; ) {
# printf(&quot;link->[%s]\n&quot;, $(i+1));
link_array[$(i+1)]++
}
}
END {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( i=1; i <= typeNum; i++) {
printf(&quot; TYPE->[%s]\n&quot;, typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf(&quot;\t%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}
}
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Not sure if you are still reading this thread baraka, but i was just trying out your find commands and i couldnt get it working, it kept on telling me that &quot;paths must preced expression&quot; when i entered it on the command line. If i had 10 html files in my current directory, how could i run it so that it found all ofthose files?
 
We all like a puzzle. I'll include my solution here...

for file in *.html
do
awk '{ strng=$0;
split(strng, tmp_arry, &quot;[\<\>]&quot;);
for(indx in tmp_arry)
if (match(tolower(tmp_arry[indx]),&quot;href *= *\&quot;[^\&quot;]*\&quot;&quot;)) {
if (tolower(tmp_arry[indx] ~ &quot;\.htm[l]*\&quot;$&quot;)) {
htm_array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
htm_cnt++;
} else { if (tolower(tmp_arry[indx] ~ &quot;\.jp[e]*g\&quot;$&quot;)) {
jpg_array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
jpg_cnt++;
} else {
oth_array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
oth_cnt++;
}
};
cnt++;
}
}
END { printf &quot;\n%d links in %s\n&quot;, cnt, FILENAME;
if (htm_cnt) {
printf &quot;\t%d document links\n&quot;, htm_cnt;
for(indx in htm_array)
printf &quot;\t\t%5d of %s\n&quot;, htm_array[indx], indx;
};
if (jpg_cnt) {
printf &quot;\t%d image links\n&quot;, jpg_cnt;
for(indx in jpg_array)
printf &quot;\t\t%5d of %s\n&quot;, jpg_array[indx], indx;
};
if (oth_cnt) {
printf &quot;\t%d other links\n&quot;, oth_cnt;
for(indx in oth_array)
printf &quot;\t\t%5d of %s\n&quot;, oth_array[indx], indx;
};
}' $file
done

Tested...

0 links in df.html

4 links in eg2.html
1 document links
1 of HREF = &quot; 3 other links
1 of HREF = &quot; 2 of HREF = &quot;
4 links in eg3.html
2 document links
1 of HREF = &quot; 1 of HREF = &quot; 2 image links
1 of HREF = &quot; 1 of HREF = &quot;
0 links in stats.html

3 links in tab0.html
2 document links
1 of HREF=&quot;tab1.html&quot;
1 of HREF=&quot;tab2.html&quot;
1 other links
1 of HREF=&quot;favicon.ico&quot;
 
Cheers Ygor. the more solutions that i can see, the more chance i have of actually learning something, especially the little things that i see on websites but have no idea of what they actually achieve in a script
 
I've been trying out your solution Ygor but i can't get the format you're using to work. it keeps saying that ' is an invalid character when i try to run your script. how do i get this form of awk script to work?
 
My last post was a korn shell script, using the for loop contruct to provide a list of files to awk.
 
ok, here's your fix, SotonStu

nawk -f countHREF.awk $(find . -type f -name '*.html')

#----------------- countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href=&quot;myLink&quot;>myText</a>
FS=&quot;(=<>)|(\&quot;)&quot;

TYPEhtm=&quot;(\.htm$)|(\.html$)&quot;
TYPEgif=&quot;(\.gif$)|(\.jpg$)|(\.jpeg$)&quot;
TYPEother=&quot;.*&quot;
typeNum=split(TYPEhtm SUBSEP TYPEgif SUBSEP TYPEother, typeA, SUBSEP);
typeNumiName=split(&quot;Html&quot; SUBSEP &quot;Graphics&quot; SUBSEP &quot;Other&quot;, typeAname, SUBSEP);
}

FNR == 1 && NR != 1 {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( i=1; i <= typeNum; i++) {
printf(&quot; TYPE->[%s]\n&quot;, typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf(&quot;\t%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}
}
}

}

FNR == 1 {prevFile=FILENAME};

tolower($0) ~ &quot;href&quot; {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ &quot;href&quot; ) {
# printf(&quot;link->[%s]\n&quot;, $(i+1));
link_array[$(i+1)]++
}
}
END {
printf(&quot;SUMMARY for [%s]\n&quot;, prevFile);
for ( i=1; i <= typeNum; i++) {
printf(&quot; TYPE->[%s]\n&quot;, typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf(&quot;\t%5d times link [%s]\n&quot;, link_array[link], link )
delete link_array[link];
}
}
}
}



vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top