I’m running gawk for Win32 on a Win XP machine.
I’ve been trying to write a program that will count word frequencies when “words” are defined as alpha strings with at most two digits at the end, e.g. mp3. With the program below I experimented with various delimiters.
# Output list of word frequencies with alternate field separator
BEGIN {
FS=",";
}
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word] > "freq_list.txt"
}
I created a test input file:
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
With comma as delimiter FS="," the output file was:
nine9: ten0 4
five5: six6 4
4
seven7: eight8 4
three3: four4 4
one1: two2 4
With colon as delimiter FS=":" the output file was:
six6, seven7 4
one1 4
two2, three3 4
four4, five5 4
ten0, 4
eight8, nine9 4
Since I wish to define a string that includes only alpha characters and the possibility of one or two digits at the end of the string, I want the delimiters to be any punctuation, space, tab or series of >2 digits. I was hoping to be able to code using POSIX expressions like [unct:], [:blank:] and [:digit:].
When a POSIX FS="/[unct:]/" is used, it produces:
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0, 4
The POSIX expression FS="/[:digit:]/" gave the same results.
I’ve tried to find some mention of turning POSIX expressions on in gawk, but haven’t found anywhere that said they had to be explicitly activated. Does anyone know why these POSIX expressions seem to only divide by line.
Without any FS definition, this gives an output of:
six6, 4
seven7: 4
five5: 4
eight8, 4
one1: 4
ten0, 4
nine9: 4
two2, 4
four4, 4
three3: 4
Thank you!
I’ve been trying to write a program that will count word frequencies when “words” are defined as alpha strings with at most two digits at the end, e.g. mp3. With the program below I experimented with various delimiters.
# Output list of word frequencies with alternate field separator
BEGIN {
FS=",";
}
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word] > "freq_list.txt"
}
I created a test input file:
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0,
With comma as delimiter FS="," the output file was:
nine9: ten0 4
five5: six6 4
4
seven7: eight8 4
three3: four4 4
one1: two2 4
With colon as delimiter FS=":" the output file was:
six6, seven7 4
one1 4
two2, three3 4
four4, five5 4
ten0, 4
eight8, nine9 4
Since I wish to define a string that includes only alpha characters and the possibility of one or two digits at the end of the string, I want the delimiters to be any punctuation, space, tab or series of >2 digits. I was hoping to be able to code using POSIX expressions like [unct:], [:blank:] and [:digit:].
When a POSIX FS="/[unct:]/" is used, it produces:
one1: two2, three3: four4, five5: six6, seven7: eight8, nine9: ten0, 4
The POSIX expression FS="/[:digit:]/" gave the same results.
I’ve tried to find some mention of turning POSIX expressions on in gawk, but haven’t found anywhere that said they had to be explicitly activated. Does anyone know why these POSIX expressions seem to only divide by line.
Without any FS definition, this gives an output of:
six6, 4
seven7: 4
five5: 4
eight8, 4
one1: 4
ten0, 4
nine9: 4
two2, 4
four4, 4
three3: 4
Thank you!