Regular expression

From LQWiki
(Redirected from RegEx)
Jump to navigation Jump to search

The use of Regular expressions is a way to filter strings that match a pattern out of text from a file or from standard input. Regular expressions (or RegExs or RegExp) can be used with a large number of programs, most notably grep (the global regular expression parser) and sed (the stream editor), and they are built into a number of programming languages including the bash shell (and possibly other shells), perl, Java, python and so on.

These programs sometimes differ in the details of how regular expressions are specified. Many programs try to follow the forms used in perl, but there are others that predate that language and for compatibility reasons are not likely to change, so that we can expect to find differing dialects of regexps for the foreseeable future.

Example: You have a file myCode.cpp and you want to find all occurrences of printf and cout. Here's the command:

grep -E "printf|cout" myCode.cpp

In this case, we say you match any line containing printf or cout.

Usecases

Regular expressions can do the following for you:

  • Match any from a set of strings, e.g. lines containing Tom or Dick or Harry in the file test:
grep -E "tom|dick|harry" test
  • Match with a group of characters, e.g. match lines containing "LINUX" or L1NUX in the file test:
grep -E "L[I,1]NUX" test
  • Invert a group of characters, e.g. lines containing "forget" "forgive" but not "for you", "foresee" etc...
grep -E "for[^ e]" test
  • Match a range of character, e.g. lines containing foo1, foo2 till foo9 in the file test:
grep -E "foo[1-9]" test
  • Invert matches, e.g. match any line in the file test that does not contain gettimeofday:
grep -E -v "gettimeofday" test
  • Match any lines starting with a text, e.g. Unix in the file test:
grep -E "^Unix" test
  • Match any lines ending with a text, e.g. Unix in the file test:
grep -E "Unix$" test
  • Match any character that is not a newline, e.g. Unix, Enix, and so on in the file test:
grep -E ".nix" test
  • Match if a character is there at least once, e.g. the i in Linux in the file test:
grep -E "L[i]+nux" test

The + here means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *. See #Quantifiers

  • Finally found: filter out blahblah:
grep -P '^(?!(.*blahblah.*))'

Literals & Meta Characters

Here's how you find all lines containing "Linux" in the file test:

grep -E "Linux" test

Now you should know that the dot (.) is a meta character, meaning every character but a newline. So, if you want to match with inux, preceeded by any character, you can do

grep -E ".inux" test

This will match with Linux as well with tinux and ~inux. If you want the dot to be taken literally, i.e. "inux", preceeded by a dot, you have to escape it:

grep -E "\.inux" test

You escape from its ability to be a meta-character.

Common syntax

There is a syntax for regular expressions, defined by the POSIX 1003.2 standard. This definition is used by grep, sed and php although there may be little differences.

Branches

A pattern will match if one or more of its branches does. A branch consists of several pieces which match next to each other. They are usually separated with the metacharacter |. In the command

grep -E "printf|cout"

printf|cout is the expression and both printf and cout are a branch.

Pieces

A piece is merely an atom, optionally followed by a quantifier. It matches if the atom matches as many times as required by the quantifier, or once if there is no quantifier.

In the example

grep -E "Li*nux" test

L is a piece as well as i* and n.

Atoms

Atoms include:

  • Any character that is not a metacharacter, including metacharacters that match literally by means of escaping (or lack thereof)
  • The . (period) character, to match any character that is not a newline
  • ^ or $, to match the null string at the beginning and end of a string, respectively (sometimes they also or only match at the beginning and end of a line, rather than a string; often however, the distinction is meaningless as with programs such as grep which only operate on entire lines)
  • Bracket expressions
  • Subexpressions, which are patterns in their own right and as such can contain branches, pieces, atoms and quantifiers of their own.

So any of the following lines is an atom:

a
b
(this is a subexpression)
[0-9]
[a-z]
[aeiou]
[^ ]

Quantifiers

Quantifiers provide us with a possibility to match a certain number of the previous character or meta character. Some of the most common quantifiers are:

  • * Match 0 or more times
  • + Match 1 or more times
  • ? Match 1 or 0 times

In the example

grep -E "Li*nux" test

The pattern Li*nux would match any number of i like the following:

Lnux 
Linux 
Liiiiinux

If we exchange the * for a + Lnux wouldn't be matched.

There are also quantifiers for telling the exact number of times to match or a range.

  • {n} Match exactly n times
  • {n,} Match at least n times
  • {x,y} Match at least x times, but no more than y times

If you use other programs than grep, the braces may not be metacharacters by default, requiring \{n\}, \{n,\}, etc. Less often, this will only apply to the opening brace, allowing just \{n}, \{n,}, etc.

Bracket expressions

So called because they reside within [ and ], any character between them will match. They can be inverted by placing a caret (^) as the first character. Ranges can be specified using hyphens: [a-d] means the same as [abcd]. Character classes can be specified inside them with another set of bracket enclosing colons and the name of the class: [:classname:]. Class names include alnum (the same as a-zA-Z0-9) and digit (the same as 0-9).

Subexpressions

Surrounded with ( and ) or \( and \), these are full patterns in their own right, but treated as atoms by the including pattern. As such, any pattern can be parenthesised and optionally followed by a quantifier to become a piece, which can be included in a branch, which can be part of its own subexpression, ad infinitum.

Examples:

# echo "banana" | sed -r "s;(an)+;o;"
boa
echo "/tmp/whatever/foo" | sed "s;.*/\(.*\);\1;" 

extracts the relative file name out of the full-qualified file name

Backreferences

One of the more useful constructs is grouping and back referencing which let us group and reuse matches. Backreferencing comes extra handy when doing substitutions where we can use the matched part in the substitution string.

Grouping can be done with or without capturing. When using capturing the matched text are saved for later use, called backreferencing. The use of non-capturing grouping is useful when you want to use alternation, as in the pattern (apple|banana) for matching apple or banana.

  • ( ) Group the text without capturing
  • \( \) Group the text and capture the result

Examples:

Add a surname

sed -e 's;\(Linus\);\1 Torvalds;g' test

The substitution command to sed has the syntax, s;old text;new text;options
This substitution above searches the somedoc.txt for occurrences of Linus and puts every match in memory. It replaces it with the first match (\1) and adds a space plus the text Torvalds.

Extract hostname

$ echo "http://wiki.linuxquestions.org/wiki/Regex" | sed "s;\(http://[^/]*\)/.*;\1;"
http://wiki.linuxquestions.org

extracts the hostname from a URL like http://wiki.linuxquestions.org/wiki/regex It searches for http:// and then an unknown count of characters that is not a slash ([^/]*). This is, in our example, http://wiki.linuxquestions.org. This is remembered as \1, because parenthesis are around. The rest, /wiki/regex, is not remembered. The whole string is then substituted by \1.

See also

External links