Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How do I write my first Awk program?

General programming in Awk

How do I write my first Awk program?

by  futurelet  Posted    (Edited  )
An Awk program consists of pairs like this:
[tt]
test { actions }
[/tt]
Awk automatically reads the files listed on the command line and executes all of the test-actions pairs once for each line read. If the test succeeds (produces a result other than zero), then the actions are performed. Example:
Code:
/Harold/ { print $0 }
The test portion is [tt]/Harold/[/tt]. It means, "Is `Harold' found in the line just read?" If the name is found, then the action is performed. The line just read is referred to as [tt]$0[/tt]. So this program simply prints every line containing "Harold".


This program can be shortened to
Code:
/Harold/ { print }
When we tell Awk to print but don't tell it what to print, it prints [tt]$0[/tt]. Let's shorten the program even more.
Code:
/Harold/
Here we have omitted the [tt]{[/tt] actions [tt]}[/tt] portion entirely. When we do that, Awk assumes we want [tt]{ print $0 }[/tt]. Knowing this, we can write a very short program that prints every line of the file we are reading:
[tt]
1
[/tt]
Any number other than [tt]0[/tt] will do. A less cryptic way of printing every line would be
[tt]
{ print }
[/tt]
Here we have omitted the test portion, so Awk assumes we want the action to be performed for every line read.

Let's make our previous program skip any line that contains "bogus".
Code:
/bogus/ { print "Skipping invalid line."
          next }
/Harold/
The [tt]next[/tt] command tells Awk to skip the rest of the test-actions pairs and to read the next line from the input file immediately.

A shorter way of printing all lines with "Harold" but without "bogus":
Code:
/Harold/ && !/bogus/
The [tt]&&[/tt] means "and"; the [tt]![/tt] means "not". So Awk will print the line if it contains "Harold" and does not contain "bogus".

Now you can see that [tt]/Harold/ && /genuine/[/tt] will display lines that have "Harold" and "genuine". But what if we want the line only if "genuine" follows "Harold"? In that case, we can use the power of "regular expressions".


Regular expressions

On the command-line of your computer, you have probably typed something like [tt]data*.txt[/tt] to refer to all files whose names start with "data" and that have the extension ".txt". Regular expressions extend that capability even further.

The most important regular expression "wild card" characters (metacharacters) are:

[tt]/[/tt] Begins and ends a regular expression.
[tt].[/tt] Stands for any single character.
[tt]*[/tt] Matches any number of the preceding item.
[tt]+[/tt] Matches one or more of the preceding item.
[tt]?[/tt] Matches 0 or 1 of the preceding item (makes the item optional).
[tt][[/tt] Begins a character set (also called "character class").
[tt]][/tt] Ends a character set.
[tt]|[/tt] Means "or".
[tt]^[/tt] Matches the beginning of the string.
[tt]$[/tt] Matches the end of the string.
[tt]([/tt] Begins a group.
[tt])[/tt] Ends a group.

So the solution to our problem is
Code:
/Harold.*genuine/
The [tt].[/tt] matches any character; [tt]*[/tt] matches any number of the preceding item; together they match any sequence of characters. So all of these lines in the file will be displayed:
[tt]
Harold genuine
Haroldgenuine
Harold is genuine
Is Harold genuine?
Harold certainly isn't genuine.
[/tt]
What if we actually want to find "Harold" followed by a period and an asterisk? When we want to search for a literal metacharacter, we have to "escape" it, that is, put a backslash in front of it:
Code:
/Harold\.\*/
To match either "Harold" or "harold", use a character set:
Code:
[Hh]arold
The [tt][Hh][/tt] will match either "H" or "h". Ranges can be used in character sets. To display only lines that contain a numeral:
Code:
/[0-9]/
A [tt]^[/tt] at the start of a character set "negates" it. To show lines that contain at least one non-numeral:
Code:
/[^0-9]/
To show lines that consist entirely of numerals:
Code:
/^[0-9]+$/

Running your program

There are two ways to run an Awk program. You can save it in a file and then type something like this on the command line:
[tt]
awk -f myprog.awk infile.txt >outfile.txt
[/tt]
The [tt]>outfile.txt[/tt] makes the output go to a file instead of the screen.

If the program is short, it can be typed on the command line. Here's one that prints every line that has "foo" or a numeral.
For Unix:
[tt]
awk '/foo/ || /[0-9]/ {print "Line " NR}' infile.txt
[/tt]
For DOS:
[tt]
awk "/foo/ || /[0-9]/ {print \"Line \" NR}" infile.txt
[/tt]
[tt]NR[/tt] is a built-in variable that keeps track of how many records (lines) have been read. This tiny program illustrates how strings are concatenated (joined together) in Awk. By simply putting [tt]NR[/tt] after the string [tt]"Line "[/tt] we make Awk convert the number to a string and to splice the two strings together before executing the [tt]print[/tt] command.


Fields

Earlier it was mentioned that the variable [tt]$0[/tt] holds the line just read. The variables [tt]$1[/tt], [tt]$2[/tt], [tt]$3[/tt], etc., are the fields into which Awk automatically splits [tt]$0[/tt]. Unless you change the variable [tt]FS[/tt], Awk uses whitespace (spaces and tabs) as the separator between fields. [tt]NF[/tt] holds the number of fields, so the last field can be gotten by [tt]$NF[/tt]. If the program is
[tt]
{ print $1 "-" $NF }
[/tt]
and the input file is
[tt]
Willy isn't nilly
Stop the growing gap
Good bye
[/tt]
the output will be
[tt]
Willy-nilly
Stop-gap
Good-bye
[/tt]

Here's a longer program:
[tt]
BEGIN { print "Adding columns 1, 2, and 3." }
{ sum1 += $1; sum2 += $2
sum3 += $3
}
END { print sum1, sum2, sum3 }
[/tt]
[tt]BEGIN[/tt] is a special test that is used to perform actions before any lines are read. [tt]END[/tt] is used to designate actions that will be done after all lines have been read. Statements can be separated by [tt];[/tt] or by putting them on separate lines. [tt]sum1 += $1[/tt] is equivalent to [tt]sum1 = sum1 + $1[/tt].


Looping with "for"

To print the even integers 2 through 10:
Code:
for (i=2; i<=10; i+=2)
  print i
To print the integers 1 through 5 and flag the odd ones:
Code:
for (i=1; i<6; i++)
{ print i
  if ( i % 2 )
    print "  odd"
}


Arrays

Arrays in Awk are "associative"; an array's indexes are strings. In the code below, the output is bold.
Code:
info["Tom"] = "A workaholic."
info[1] = "Indexed by '1'."
print info["1"]
[b]Indexed by '1'.[/b]
info["year"] = 2005
for (i in info)
  print i "-->" info[i]
[b]Tom-->A workaholic.
year-->2005
1-->Indexed by '1'.[/b]
The order in which the entries are produced by [tt]for (i in info)[/tt] will not necessarily be the order in which the entries were added.


Additional techniques:
[tt]
BEGIN {
s = "abc;def;ghi"
# Print string starting at 2nd character.
print substr( s, 2 )
# Print 3rd character.
print substr( s, 3, 1 )
# Make an array from the string, splitting at ";".
split( s, array, /;/ )
# Print the members of the array.
for (i=1; i in array; i++)
print array
# Print the location of "gh" in s.
print index( s, "gh" )
# Print the uppercased string.
print toupper( s )
}

# Now we're reading the input file.
# If this is the first line (in Awk-speak,
# lines are called "records"), print it.
1==NR

# If there's more than one field in the
# line, print sum of first 2 fields,
# padding with blanks for a width of 9.
NF > 1 { printf "%9g\n", $1+$2 }
[/tt]

A useful function

The built-in function [tt]split()[/tt] produces an array of the non-matching parts of the string. Here's a function that makes an array containing both the matching and the non-matching parts, in this order:
<nonmatching><matching><nonmatching>...<matching><nonmatching>
Code:
# Produces array of nonmatching and matching substrings.
# The size of the array will always be an odd number.
# The first and the last item will always be nonmatching.
function shatter( s, array, re )
{ gsub( re, "\1&\1", s  )
  return split( s, array, "\1" )
}

Register to rate this FAQ  : BAD 1 2 3 4 5 6 7 8 9 10 GOOD
Please Note: 1 is Bad, 10 is Good :-)

Part and Inventory Search

Back
Top