Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using POSIX expressions in gawk

Status
Not open for further replies.

Chaiwalla

Technical User
Jul 28, 2010
3
NZ
I’m running gawk for Win32 on a Win XP machine.
I’ve been trying to write a program that will count word frequencies when “words” are defined as alpha strings with at most two digits at the end, e.g. mp3. With the program below I experimented with various delimiters.

# Output list of word frequencies with alternate field separator
BEGIN {
FS=",";
}
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word] > "freq_list.txt"
}

I created a test input file:
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,

With comma as delimiter FS="," the output file was:

nine9: ten0 4
five5: six6 4
4
seven7: eight8 4
three3: four4 4
one1: two2 4

With colon as delimiter FS=":" the output file was:

six6, seven7 4
one1 4
two2, three3 4
four4, five5 4
ten0, 4
eight8, nine9 4

Since I wish to define a string that includes only alpha characters and the possibility of one or two digits at the end of the string, I want the delimiters to be any punctuation, space, tab or series of >2 digits. I was hoping to be able to code using POSIX expressions like [:punct:], [:blank:] and [:digit:].

When a POSIX FS="/[:punct:]/" is used, it produces:

one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0, 4

The POSIX expression FS="/[:digit:]/" gave the same results.

I’ve tried to find some mention of turning POSIX expressions on in gawk, but haven’t found anywhere that said they had to be explicitly activated. Does anyone know why these POSIX expressions seem to only divide by line.

Without any FS definition, this gives an output of:

six6, 4
seven7: 4
five5: 4
eight8, 4
one1: 4
ten0, 4
nine9: 4
two2, 4
four4, 4
three3: 4

Thank you!
 
I'd try this:
FS="[[:punct:]]"

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Thanks, PHV
I tried using several POSIX statements like [[:punct:]] and [[:blank:]], but they were always ignored. Do I need to initialize POSIX in gawk. I thought these expressions were automatically available in gawk.
 
A contributor on another forum suggested defining the field separator as a string. Many thanks DaWei!

Here's the complete program:

# gawk -f word_freq.awk --posix file_to_process
# Outputs a list of words and word frequencies
# Words are defined as a string of alpha characters which may end with up to two digits.
# Punctuation marks and white space serve as field separators.
BEGIN {
FS = "[, . ; : ! ? & \* \' \" \\ \/ ( ) { } < > @ % $]+"
}
{
for (i = 1; i <= NF; i++)
if ($i ~ /^[a-zA-Z]+[0-9]{0,2}$/)
freq [$i]++
}
END {
for (word in freq)
printf "%-10s\t%d\n", word, freq[word] > "freq_list.txt"
}
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top