Sort
The sort command sorts lines of text. It accepts a file or list of files to sort, and if none or '-' is given, it reads from stdin.
Example usage
$ echo -e "b\nc\na" |sort a b c
Some of the most useful options:
-f ignores case
-k sorts on a key (whitespace separated text-block)
-n sorts numbers
-r reverses sort
-u unique lines - sometimes makes a pipe to uniq unnecessary
$ cat sortdemo foo 100 foo 95 bar 12 Foo 200 baz 156 Foo 100
$ sort -fru sortdemo foo 95 Foo 200 foo 100 baz 156 bar 12
$ sort -nk2 sortdemo bar 12 foo 95 Foo 100 foo 100 baz 156 Foo 200
Internals and Tuning
Ordinarily, sorting is fast enough that you do not need to consider how its done. However, if the data to be sorted is large enough, it may be beneficial to use command-line parameters to tune its operation. To do this effectively, you may need to consider the internal workings of the program. Since there are multiple versions of the program, it is important to know which one you are considering.
Generally, the TMPDIR environment variable overrides the default /tmp directory for storage of temporary files which sort generally creates. Necessary if your data is larger than /tmp, and really useful to make /tmp on a different drive from either the input or the output.
GNU sort
GNU sort uses a pure merge sort algorithm. Initially, it sorts data in an internal memory buffer (using a merge-sort algorithm,) but if the space is exhausted the sorted data are written to one or more temporary files which are later merged together to form the output. If the data does fit in memory, the sorted output is written directly.
There are several parameters that tune GNU sort's internal operations.
- --buffer-size the size of the internal memory area used for sorting or merging; if not specified the default size depends on the operating environment and particularly on the size of physical memory.
- --batch-size the number of temporary files to merge at once, if required. The minimum is 2. If not specified, the default is 16.
- --parallel the number of threads to use in the sort. Particularly useful on multi-core machines. The default is 1.
When temporary files are required, they are merged batch-size at a time into new temporaries. This continues until a batch contains all of the remaining temporaries, and the output file is written with the result. If more than one batch is formed, some of the data will thus be merged in multiple batches.
Experiments on a 1.3 terrabyte file on a machine with 32GB of RAM showed that with default values the temporaries were about 11 gigabytes, and there were 117 of them. Using a buffer-size of 11g and batch-size of 250 produced temporaries of a bit over 5 gb, and brought the time down from 21 hours to 11 hours. You may want the temporaries even smaller, but be sure they can all be merged at once. You may also want to use the --parallel option if your machine has multiple cores.