Awk

From LQWiki
Jump to navigation Jump to search

awk is a command for string operations. For example, it allows you to show only the second column of a file. awk is not a simple command, but rather a programming language on its own. awk and gawk (for GNU awk) are used synonymously.

Usage

When awk is run, it is given two forms of input, the program and the data. The program can be typed directly on the command line or stored in a file and accessed with the -f option. The data comes from files listed on the command line or from stdin if none are listed. The first example has the script on the command line with input from a file, while the second example uses an external program to create the input data and pipes into awk, which uses an external script as the program.

$ awk '{ print $1; }' datafile
$ makedata | awk -f myscript.awk

awk scripts that are saved in files can be executed directly by placing the proper shebang line at the beginning:

#!/bin/awk -f

Important note: use the exact path of your awk (available from typing "which awk") if it is not named /bin/awk.

How to

How to merge two files skipping some columns

cat >file_1<<eof
O11 0.081
O12 0.341
O13 0.343
eof
cat >file_2<<eof
O11 0.105
O12 0.415
O13 0.327
eof
paste file_1 file_2 | awk '{print $1" "$2" "$4}'
O11 0.081 0.105
O12 0.341 0.415
O13 0.343 0.327

Paste file_1 and file_2 skipping column 3 from resulting file.

Language structure

An awk program consists of a series of statements each consisting of a pattern and an action. Awk reads the input (whether files or data piped from stdin) line-by-line automatically. For each line of data, if the pattern is true, the action is executed. There are a few special patterns. The BEGIN rule is executed first, before any input is read, and the END rule is executed last, after the end of all input. Some complicated awk scripts consist of only a BEGIN rule and use getline to read the input data. If pattern is empty, the action is always executed. If action is empty, awk echos the line.

The pattern can be a regular expression enclosed in slashes ('/'), in which case it is considered true if the input line matches (i.e. contains matching text) the pattern. The expression /^[^#]/ would select all lines not beginning with a pound sign. The pattern could also be an awk expression, e.g. (NF>5) to select all lines with more than 5 words.

Whenever a line of input is read (whether automatically or with getline [1]), the line is split into words. The first word is assigned to $1, the second $2, etc. This makes it easy for awk to deal with columns of data. The variable NF is set to the number of words. $ is an awk operator, so the "number" can be the result of any expression. $NF is the last word on the line.

Truly remarkably power in awk can come from use of dynamic arrays, especially when combined with regular expressions. This can allow for complex queries across many files with collection and collation of results as shown in the following example for the query "what are the first word of all lines and how often do they occur?"

This example shows several power features:

  • selects all lines not starting with #
  • separates a matching line into multiple words
  • uses each word as index into wordcounts array
  • END clause, summary processing when all input is done
  • sort indices using asorti() and output counts

as:

/^[^#]/{
  w=match($0,/([a-zA-Z0-9_$]+)/,thisline);
  for(i=1; i<=w; i++) {
    wordcounts[toupper(thisline[i])]++;
  }
}
END {
  n = asorti(wordcounts, words);
  for (i = 1; i <= n; i++) {
    printf("%14s - %4d\n",words[i],wordcounts[words[i]]);
  }
}

If you save the above example as a file, in this case words.awk, then scanning a group of files can be as easy as:

  awk -f words.awk *.txt

Add more complex regex criteria, use printf() for debugging, collect different arrays of results, see split() for further parsing, these and so many more features make awk one of the most powerful of scripting tools.

For a complete description of the language, see the GNU awk manual [2].

GNU Awk extensions

Things to be careful about when using a gawk script in a non-GNU awk include:

  • Special files like /dev/stderr, useful for printing error messages.
  • The systime() and strftime() functions.
  • The nextfile statement.
  • delete ARRA to delete an entire array.
  • The gensub() function.
  • Bidirectional pipes to coprocesses.

This list is not comprehensive; the gawk manual (below) has more info.

See also

External links