This is an awk script i have written, attempting to isolate all links in an html page and print out the frequency of each link:
#!/bin/gawk -f
# Print list of word frequencies
BEGIN{FS="\""}
NR == 1 { printf("%s\n%s", (NR==1) ? "" : "", FILENAME)}
/http/{
printf "\n"
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
This gets the following output on a simple html file containing two links to google:
2
<A HREF = 1
from the word <A HREF = 1
>me!</A> 2
How can i get it to JUST add the links to the associative array and not all of the tags surroundng it.
Thanks again!
#!/bin/gawk -f
# Print list of word frequencies
BEGIN{FS="\""}
NR == 1 { printf("%s\n%s", (NR==1) ? "" : "", FILENAME)}
/http/{
printf "\n"
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
This gets the following output on a simple html file containing two links to google:
2
<A HREF = 1
from the word <A HREF = 1
>me!</A> 2
How can i get it to JUST add the links to the associative array and not all of the tags surroundng it.
Thanks again!