Counting links in html page 3

SotonStu · May 7, 2003

I'd like to create an awk script which can count the number of individual links in an html source page and print out the link along with the number of times it appears in the page.

However, i really have no idea how to go about this, should i change the field separator?

Thanks for any help!

Baraka69 · May 8, 2003

One last thing (then I'll go back to my regular work *g*) my solution in using one big summary or multiple summaries works perfectly well for subdirectories as well.

I use it to go through all my log files that are stored in a filesystem that uses something like:
[tt]
/year/month/day/logfiles
[/tt]

With my method I get all logfiles for say March 2003 when pointing to directory /2003/03/ and the "find" will give me all logfiles for march even though they are in subdirectories.

In your case you could summarise all html documents in a given subdirectory on a "all-files" basis or on a "per-file-" basis.

SotonStu · May 8, 2003

That is a useful hint for my general use as well as this current script. Thanks for the advice

Baraka69 · May 8, 2003

Regarding the sorting part, sure can it be done!

Type "man sort" and simply pipe the output through sort.

Sorting is a wide field and you will have alot of fun at that, I'm sure.

One (maybe) useful hint is what I used in my script at one point:
[tt]
# how to call a system shell command from within an awk file
system("cat OUTPUTFILE | sort -o OUTPUTFILE&quot

[/tt]
Not tested, but meant to give you ideas ...

SotonStu · May 8, 2003

Ok, i'm still going here but haven't made much progress. What i'm trying to do is have different associative arrays for each kind of link, so if the link has .html in it, it will be stored in web_array, if it has .jpg in it, it will be stored in pic_array and the rest just get stored in link_array. Ive tried using if statements that search for the pattern but its not working. my head is starting to hurt.

vgersh99 · May 8, 2003

try this one. Adjust your your link 'types' at the top of the script as you wish.

#------------------------ countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href="myLink">myText</a>
FS="(=<>)|(\&quot

"

TYPEhtm="(.htm$)|(.html$)"
TYPEgif="(.gif$)|(.jpg$)|(.jpeg$)"
TYPEother=".*"
typeNum=split(TYPEhtm SUBSEP TYPEgif SUBSEP TYPEother, typeA, SUBSEP);

}

FNR == 1 {prevFile=FILENAME};
FNR == 1 && NR != 1 {
printf("SUMMARY for [%s]\n", prevFile);
for ( i=1; i <= typeNum; i++) {
printf(" TYPE->[%s]\n", typeA);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf("\t%5d times link [%s]\n", link_array[link], link )
delete link_array[link];
}
}
}

}
tolower($0) ~ "href" {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ "href" ) {
# printf("link->[%s]\n", $(i+1));
link_array[$(i+1)]++
}
}
END {
printf("SUMMARY for [%s]\n", prevFile);
for ( i=1; i <= typeNum; i++) {
printf(" TYPE->[%s]\n", typeA);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf("\t%5d times link [%s]\n", link_array[link], link )
delete link_array[link];
}
}
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

vgersh99 · May 8, 2003

or better yet - make sure that TYPEother is the LAST entry in in the typeA array - it's 'catch-ALL' condition:

#------------------ countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href="myLink">myText</a>
FS="(=<>)|(\&quot

"

TYPEhtm="(\.htm$)|(\.html$)"
TYPEgif="(\.gif$)|(\.jpg$)|(\.jpeg$)"
TYPEother=".*"
typeNum=split(TYPEhtm SUBSEP TYPEgif SUBSEP TYPEother, typeA, SUBSEP);
typeNumiName=split("Html" SUBSEP "Graphics" SUBSEP "Other", typeAname, SUBSEP);

}

FNR == 1 {prevFile=FILENAME};
FNR == 1 && NR != 1 {
printf("SUMMARY for [%s]\n", prevFile);
for ( i=1; i <= typeNum; i++) {
printf(" TYPE->[%s]\n", typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf("\t%5d times link [%s]\n", link_array[link], link )
delete link_array[link];
}
}
}

}
tolower($0) ~ "href" {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ "href" ) {
# printf("link->[%s]\n", $(i+1));
link_array[$(i+1)]++
}
}
END {
printf("SUMMARY for [%s]\n", prevFile);
for ( i=1; i <= typeNum; i++) {
printf(" TYPE->[%s]\n", typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf("\t%5d times link [%s]\n", link_array[link], link )
delete link_array[link];
}
}
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

SotonStu · May 8, 2003

Not sure if you are still reading this thread baraka, but i was just trying out your find commands and i couldnt get it working, it kept on telling me that "paths must preced expression" when i entered it on the command line. If i had 10 html files in my current directory, how could i run it so that it found all ofthose files?

Ygor · May 8, 2003

We all like a puzzle. I'll include my solution here...

for file in *.html
do
awk '{ strng=$0;
split(strng, tmp_arry, "[\<\>]&quot

;
for(indx in tmp_arry)
if (match(tolower(tmp_arry[indx]),"href *= *\"[^\"]*\"&quot

) {
if (tolower(tmp_arry[indx] ~ "\.htm[l]*\"$&quot

) {
htm_array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
htm_cnt++;
} else { if (tolower(tmp_arry[indx] ~ "\.jp[e]*g\"$&quot

) {
jpg_array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
jpg_cnt++;
} else {
oth_array[substr(tmp_arry[indx], RSTART, RLENGTH)]++;
oth_cnt++;
}
};
cnt++;
}
}
END { printf "\n%d links in %s\n", cnt, FILENAME;
if (htm_cnt) {
printf "\t%d document links\n", htm_cnt;
for(indx in htm_array)
printf "\t\t%5d of %s\n", htm_array[indx], indx;
};
if (jpg_cnt) {
printf "\t%d image links\n", jpg_cnt;
for(indx in jpg_array)
printf "\t\t%5d of %s\n", jpg_array[indx], indx;
};
if (oth_cnt) {
printf "\t%d other links\n", oth_cnt;
for(indx in oth_array)
printf "\t\t%5d of %s\n", oth_array[indx], indx;
};
}' $file
done

Tested...

0 links in df.html

4 links in eg2.html
1 document links
1 of HREF = "

http://www.mytest.com/sample.htm"

3 other links
1 of HREF = "

http://www.tek-tips.com"

2 of HREF = "

http://www.google.com"

4 links in eg3.html
2 document links
1 of HREF = "

http://www.mytest.com/sample.htm"

1 of HREF = "

http://www.tek-tips.com/index.html"

2 image links
1 of HREF = "

http://www.google.com/test.jpg"

1 of HREF = "

http://www.google.com/test2.jpeg"

0 links in stats.html

3 links in tab0.html
2 document links
1 of HREF="tab1.html"
1 of HREF="tab2.html"
1 other links
1 of HREF="favicon.ico"

SotonStu · May 8, 2003

Cheers Ygor. the more solutions that i can see, the more chance i have of actually learning something, especially the little things that i see on websites but have no idea of what they actually achieve in a script

SotonStu · May 8, 2003

I've been trying out your solution Ygor but i can't get the format you're using to work. it keeps saying that ' is an invalid character when i try to run your script. how do i get this form of awk script to work?

Ygor · May 8, 2003

My last post was a korn shell script, using the for loop contruct to provide a list of files to awk.

vgersh99 · May 8, 2003

ok, here's your fix, SotonStu

nawk -f countHREF.awk $(find . -type f -name '*.html')

#----------------- countHREF.awk
BEGIN {
# count HTML href tags
# assuming correct HTML syntax e.g. <a href="myLink">myText</a>
FS="(=<>)|(\&quot

"

TYPEhtm="(\.htm$)|(\.html$)"
TYPEgif="(\.gif$)|(\.jpg$)|(\.jpeg$)"
TYPEother=".*"
typeNum=split(TYPEhtm SUBSEP TYPEgif SUBSEP TYPEother, typeA, SUBSEP);
typeNumiName=split("Html" SUBSEP "Graphics" SUBSEP "Other", typeAname, SUBSEP);
}

FNR == 1 && NR != 1 {
printf("SUMMARY for [%s]\n", prevFile);
for ( i=1; i <= typeNum; i++) {
printf(" TYPE->[%s]\n", typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf("\t%5d times link [%s]\n", link_array[link], link )
delete link_array[link];
}
}
}

}

FNR == 1 {prevFile=FILENAME};

tolower($0) ~ "href" {
# a line contains 'href='
for ( i=1; i<=NF; i++ )
# for each field in that line
if ( tolower($i) ~ "href" ) {
# printf("link->[%s]\n", $(i+1));
link_array[$(i+1)]++
}
}
END {
printf("SUMMARY for [%s]\n", prevFile);
for ( i=1; i <= typeNum; i++) {
printf(" TYPE->[%s]\n", typeAname);
for ( link in link_array ) {
if ( link ~ typeA ) {
printf("\t%5d times link [%s]\n", link_array[link], link )
delete link_array[link];
}
}
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Counting links in html page 3

SotonStu

Programmer

Baraka69

Programmer

SotonStu

Programmer

Baraka69

Programmer

SotonStu

Programmer

vgersh99

Programmer

vgersh99

Programmer

SotonStu

Programmer

Ygor

Programmer

SotonStu

Programmer

SotonStu

Programmer

Ygor

Programmer

vgersh99

Programmer

Similar threads

Part and Inventory Search

Sponsor