View the Most Wanted LQ Wiki articles.
LinuxQuestions.org > Linux Wiki > Regular expression

From LQWiki

(Redirected from RegEx)
Jump to: navigation, search

The use of Regular expressions is a way to filter strings that match a pattern out of text from a file or from standard input. Regular expressions (or RegExs) can be used with a big choice of programs, most notably grep (the global regular expression parser) and sed (the stream editor).

Example: You have a file myCode.cpp and you want to find all occurrences of printf and cout. Here's the command:

grep -E "printf|cout" myCode.cpp

In this case, we say you match any line containing printf or cout.

Contents

Features

Regular expressions can do the following for you:

  • Match any from a set of strings, e.g. lines containing Tom or Dick or Harry in the file test:
grep -E "tom|dick|harry" test
  • Match with a group of characters, e.g. match lines containing "LINUX" or L1NUX in the file test:
grep -E "L[I,1]NUX" test
  • Invert a group of characters, e.g. lines containing "forget" "forgive" but not "for you", "foresee" etc...
grep -E "for[^ e]" test
  • Match a range of character, e.g. lines containing foo1, foo2 till foo9 in the file test:
grep -E "foo[1-9]" test
  • Invert matches, e.g. match any line in the file test that does not contain gettimeofday:
grep -E -v "gettimeofday" test
  • Match any lines starting with a text, e.g. Unix in the file test:
grep -E "^Unix" test
  • Match any lines ending with a text, e.g. Unix in the file test:
grep -E "Unix$" test
  • Match any character that is not a newline, e.g. Unix, Enix, and so on in the file test:
grep -E ".nix" test
  • Match if a character is there at least once, e.g. the i in Linux in the file test:
grep -E "L[i]+nux" test

The + here means, that i occurs 1 or more times. It is also possible to accept 0 or more times if you replace the + by a *. See #Quantifiers

Literals & Meta Characters

Here's how you find all lines containing "Linux" in the file test:

grep -E "Linux" test

Now you should know that the dot (.) is a meta character, meaning every character but a newline. So, if you want to match with inux, preceeded by any character, you can do

grep -E ".inux" test

This will match with Linux as well with tinux and ~inux. If you want the dot to be taken literally, i.e. "inux", preceeded by a dot, you have to escape it:

grep -E "\.inux" test

You escape from its ability to be a meta-character.

Common syntax

This is the most common syntax for expressions, defined by the POSIX 1003.2 standard and adopted by just about everything that uses them. There are many differences, such as whether or not certain metacharacters need escaping or not, and extra features have been added, but regexes written for one program in this syntax will be mostly portable to another program.

Branches

A pattern will match if one or more of its branches does. A branch consists of several pieces which match next to each other. They are usually separated with the metacharacter |.

Pieces

A piece is merely an atom, optionally followed by a quantifier. It matches if the atom matches as many times as required by the quantifier, or once if there is no quantifier.

Atoms

An atom is simply anything that will match on its own, and will match if it is there. Atoms include:

  • Any character that is not a metacharacter, including metacharacters that match literally by means of escaping (or lack thereof)
  • The . (period) character, to match any character that is not a newline
  • ^ or $, to match the null string at the beginning and end of a string, respectively (sometimes they also or only match at the beginning and end of a line, rather than a string; often however, the distinction is meaningless as with programs such as grep which only operate on entire lines)
  • Bracket expressions
  • Subexpresions, which are patterns in their own right and as such can contain branchs, pieces, atoms and quantifiers of their own.

Quantifiers

Quantifiers provide us with a possibility to match a certain number of the previous character or meta character. Some of the most common quantifiers are:

  • * Match 0 or more times
  • + Match 1 or more times
  • ? Match 1 or 0 times

In the example

grep -E "Li*nux" test

The pattern Li*nux would match any number of i like the following:

Lnux 
Linux 
Liiiiinux

If we exchange the * for a + Lnux wouldn't be matched.

There are also quantifiers for telling the exact number of times to match or a range.

  • {n} Match exactly n times
  • {n,} Match at least n times
  • {x,y} Match at least x times, but no more than y times

If you use other programs than grep, the braces may not be metacharacters by default, requiring \{n\}, \{n,\}, etc. Less often, this will only apply to the opening brace, allowing just \{n}, \{n,}, etc.

Bracket expressions

So called because they reside within [ and ], any character between them will match. They can be inverted by placing a caret (^) as the first character. Ranges can be specified using hyphens: [a-d] means the same as [abcd]. Character classes can be specified inside them with another set of bracket enclosing colons and the name of the class: [:classname:]. Class names include alnum (the same as a-zA-Z0-9) and digit (the same as 0-9).

Subexpressions

Surrounded with \( and \) (this is the only instance in which the POSIX standard calls for a metacharacter to be literal by default), these are full patterns in their own right, but treated as atoms by the including pattern. As such, any pattern can be parenthesised and optionally followed by a quantifier to become a piece, which can be included in a branch, which can be part of its own subexpression, ad infinitum.

Example:

echo "/tmp/whatever/foo" | sed "s;.*/\(.*\);\1;" 

extracts the relative file name out of the full-qualified file name

Backreferences

One of the more useful constructs is grouping and back referencing which let us group and reuse matches. Backreferencing comes extra handy when doing substitutions where we can use the matched part in the substitution string.

Grouping can be done with or without capturing. When using capturing the matched text are saved for later use, called backreferencing. The use of non-capturing grouping is useful when you want to use alternation, as in the pattern (apple|banana) for matching apple or banana.

  • ( ) Group the text without capturing
  • \( \) Group the text and capture the result

Examples:

Add a surname

sed -e 's;\(Linus\);\1 Torvalds;g' test

The substitution command to sed has the syntax, s;old text;new text;options
This substitution above searches the somedoc.txt for occurrences of Linus and puts every match in memory. It replaces it with the first match (\1) and adds a space plus the text Torvalds.

Extract hostname

$ echo "http://wiki.linuxquestions.org/wiki/Regex" | sed "s;\(http://[^/]*\)/.*;\1;"
http://wiki.linuxquestions.org

extracts the hostname from a URL like http://wiki.linuxquestions.org/wiki/regex It searches for http:// and then an unknown count of characters that is not a slash ([^/]*). This is, in our example, http://wiki.linuxquestions.org. This is remembered as \1, because parenthesis are around. The rest, /wiki/regex, is not remembered. The whole string is then substituted by \1.

See also

External links


Share

Personal tools