Awk

2009-08-03T19:35:36Z

Chaosless: /* Language structure */

[[awk]] is a [[command]] for string operations. For example, it allows you to show only the second ''column'' of a file. awk is not a simple command, but rather a programming language on its own. awk and gawk (for GNU awk) are used synonymously.

== Usage ==

When <tt>awk</tt> is run, it is given two forms of input, the ''program'' and the ''data''. The program can be typed directly on the [[command line]] or stored in a file and accessed with the <tt>-f</tt> option. The data comes from [[file]]s listed on the command line or from [[stdin]] if none are listed. The first example has the script on the command line with input from a file, while the second example uses an external program to create the input data and pipes into <tt>awk</tt>, which uses an external script as the program.

$ awk '{ print $1; }' ''datafile''

$ ''makedata'' | awk -f ''myscript.awk''

<tt>awk</tt> scripts that are saved in files can be executed directly by placing the proper [[shebang]] line at the beginning:
#!/bin/awk -f
'''Important note:''' use the exact path of your awk (available from typing "<tt>[[which]] awk</tt>") if it is not named <tt>/bin/awk</tt>.

== Language structure ==

An awk program consists of a series of statements each consisting of a ''pattern'' and an ''action''. Awk reads the input (whether [[file]]s or data [[pipe]]d from [[stdin]]) line-by-line automatically. For each line of data, if the ''pattern'' is true, the ''action'' is executed. There are a few special patterns. The <tt>BEGIN</tt> rule is executed first, before any input is read, and the <tt>END</tt> rule is executed last, after the end of all input. Some complicated awk scripts consist of only a <tt>BEGIN</tt> rule and use <tt>getline</tt> to read the input data. If ''pattern'' is empty, the ''action'' is always executed. If ''action'' is empty, awk echos the line.

The pattern can be a [[regular expression]] enclosed in slashes ('/'), in which case it is considered true if the input line matches (i.e. contains matching text) the pattern. The expression <tt>/^[^#]/</tt> would select all lines not beginning with a pound sign. The pattern could also be an awk expression, e.g. <tt>(NF>5)</tt> to select all lines with more than 5 words.

Whenever a line of input is read (whether automatically or with <tt>getline</tt> [http://www.gnu.org/software/gawk/manual/html_node/Getline.html]), the line is split into words. The first word is assigned to <tt>$1</tt>, the second <tt>$2</tt>, etc. This makes it easy for awk to deal with columns of data. The variable <tt>NF</tt> is set to the number of words. <tt>$</tt> is an awk operator, so the "number" can be the result of any expression. <tt>$NF</tt> is the last word on the line.

Truly remarkably power in awk can come from use of dynamic arrays, especially when combined with regular expressions. This can allow for complex queries across many files with collection and collation of results as shown in the following example for the query "what are the first word of all lines and how often do they occur?"

This example shows several power features:
<ul>
<li>selects all lines not starting with <tt>#</tt></li>
<li>separates a matching line into multiple words</li>
<li>uses each word as index into wordcounts array</li>
<li>END clause, summary processing when all input is done</li>
<li>sort indices using asorti() and output counts</li>
</ul>

as:
<pre>
/^[^#]/{
w=match($0,/([a-zA-Z0-9_$]+)/,thisline);
for(i=1; i<=w; i++) {
wordcounts[toupper(thisline[i])]++;
}
}
END {
n = asorti(wordcounts, words);
for (i = 1; i <= n; i++) {
printf("%14s - %4d\n",words[i],wordcounts[words[i]]);
}
}
</pre>

If you save the above example as a file, in this case <tt>words.awk</tt>, then scanning a group of files can be as easy as:

<pre>
awk -f words.awk *.txt
</pre>

Add more complex regex criteria, use printf() for debugging, collect different arrays of results, see split() for further parsing, these and so many more features make <tt>awk</tt> one of the most powerful of scripting tools.

For a complete description of the language, see the GNU awk manual [http://www.gnu.org/software/gawk/manual/html_node/index.html].

== GNU Awk extensions ==

Things to be careful about when using a <tt>gawk</tt> script in a non-GNU awk include:
* Special files like <tt>/dev/stderr</tt>, useful for printing error messages.
* The <tt>systime()</tt> and <tt>strftime()</tt> functions.
* The <tt>nextfile</tt> statement.
* <tt>delete ARRA</tt> to delete an entire array.
* The <tt>gensub()</tt> function.
* Bidirectional pipes to coprocesses.
This list is not comprehensive; the gawk manual (below) has more info.

== See also ==
* [[Programming#Scripting languages|Scripting languages section]]

== External links ==

* [http://www.gnu.org/software/gawk/manual/html_node/index.html gawk manual] - Excellent reference, especially the sections on Reading Files, Expressions, and Functions.
* [http://man-wiki.net/index.php/Awk gawk man page]
* [http://www.tek-tips.com/threadminder.cfm?pid=271 awk forum] - when you have questions
* [http://sparky.rice.edu/~hartigan/awk.html How to Use AWK] - quick intro
* [http://www.cs.uu.nl/docs/vakken/st/nawk/nawk_toc.html awk manual] - shorter than the gawk manual above

LQWiki - User contributions [en]

Awk