Awk

awk is a command for string operations. For example, it allows you to show only the second column of a file. awk is not a simple command, but rather a programming language on its own. awk and gawk (for GNU awk) are used synonymously.

Usage
When awk is run, it is given two forms of input, the program and the data. The program can be typed directly on the command line or stored in a file and accessed with the -f option. The data comes from files listed on the command line or from stdin if none are listed. The first example has the script on the command line with input from a file, while the second example uses an external program to create the input data and pipes into awk, which uses an external script as the program.

$ awk '{ print $1; }' datafile

$ makedata | awk -f myscript.awk

awk scripts that are saved in files can be executed directly by placing the proper shebang line at the beginning: Important note: use the exact path of your awk (available from typing "which awk") if it is not named /bin/awk.
 * 1) !/bin/awk -f

How to merge two files skipping some columns
cat >file_1<file_2<<eof O11 0.105 O12 0.415 O13 0.327 eof

paste file_1 file_2 | awk '{print $1" "$2" "$4}' O11 0.081 0.105 O12 0.341 0.415 O13 0.343 0.327

Paste file_1 and file_2 skipping column 3 from resulting file.

Language structure
An awk program consists of a series of statements each consisting of a pattern and an action. Awk reads the input (whether files or data piped from stdin) line-by-line automatically. For each line of data, if the pattern is true, the action is executed. There are a few special patterns. The BEGIN rule is executed first, before any input is read, and the END</tt> rule is executed last, after the end of all input. Some complicated awk scripts consist of only a BEGIN</tt> rule and use getline</tt> to read the input data. If pattern is empty, the action is always executed. If action is empty, awk echos the line.

The pattern can be a regular expression enclosed in slashes ('/'), in which case it is considered true if the input line matches (i.e. contains matching text) the pattern. The expression /^[^#]/</tt> would select all lines not beginning with a pound sign. The pattern could also be an awk expression, e.g. (NF&gt;5)</tt> to select all lines with more than 5 words.

Whenever a line of input is read (whether automatically or with getline</tt> ), the line is split into words. The first word is assigned to $1</tt>, the second $2</tt>, etc. This makes it easy for awk to deal with columns of data. The variable NF</tt> is set to the number of words. $</tt> is an awk operator, so the "number" can be the result of any expression. $NF</tt> is the last word on the line.

Truly remarkably power in awk can come from use of dynamic arrays, especially when combined with regular expressions. This can allow for complex queries across many files with collection and collation of results as shown in the following example for the query "what are the first word of all lines and how often do they occur?"

This example shows several power features: <ul> <li>selects all lines not starting with #</tt></li> <li>separates a matching line into multiple words</li> <li>uses each word as index into wordcounts array</li> <li>END clause, summary processing when all input is done</li> <li>sort indices using asorti and output counts</li> </ul>

as: /^[^#]/{ w=match($0,/([a-zA-Z0-9_$]+)/,thisline); for(i=1; i<=w; i++) { wordcounts[toupper(thisline[i])]++; } } END { n = asorti(wordcounts, words); for (i = 1; i <= n; i++) { printf("%14s - %4d\n",words[i],wordcounts[words[i]]); } }

If you save the above example as a file, in this case words.awk</tt>, then scanning a group of files can be as easy as:

awk -f words.awk *.txt

Add more complex regex criteria, use printf for debugging, collect different arrays of results, see split for further parsing, these and so many more features make awk</tt> one of the most powerful of scripting tools.

For a complete description of the language, see the GNU awk manual.

GNU Awk extensions
Things to be careful about when using a gawk</tt> script in a non-GNU awk include: This list is not comprehensive; the gawk manual (below) has more info.
 * Special files like /dev/stderr</tt>, useful for printing error messages.
 * The systime</tt> and <tt>strftime</tt> functions.
 * The <tt>nextfile</tt> statement.
 * <tt>delete ARRA</tt> to delete an entire array.
 * The <tt>gensub</tt> function.
 * Bidirectional pipes to coprocesses.