Regular Expressions

Links & references:

http://linuxreviews.org/beginner/tao_of_regular_expressions
http://en.wikipedia.org/wiki/Regular_expressions
Web tools for testing and learning:

Introduction

To fully utilize shell scripting and certain commands and utilities commonly used in scripts (expr, sed, awk, etc.), you need to know how to use regular expressions.

Do not confuse regular expressions with shell globbing (filename expansion).
- sh/ksh/bash do not normally use regular expressions, but can do file globbing, which use conventions that are similar to regular expressions.

Regular expressions are sets of characters and/or metacharacters that represent text patterns.

The main uses for regular expressions are text searches and string manipulation.
- A regular expression matches a single character or a set of characters (a substring or an entire string).

Regexp (regular expression) meta-characters

The asterisk * matches any number of repeats of the character string or regexp preceding it, including zero.

     "1133*" matches 11 + one or more 3's + possibly other characters:
     113, 1133, 111312, and so forth.

The dot . matches any one character, except a newline.

     "13." matches 13 plus at least one of any character (including a
     space): 1133, 11333, but not 13 (additional character missing).

     ".*" matches any number of any characters.

The caret ^ matches the beginning of a line, but sometimes, depending on context, negates the meaning of a set of characters in an regexp.

The dollar sign $ at the end of an regexp matches the end of a line.

     "^$" matches blank lines.

Brackets […] enclose a set of characters to match in a single regexp.

       "[xyz]" matches the characters x, y, or z.

       "[c-n]" matches any of the characters in the range c to n.

       "[B-Pk-y]" matches any of the characters in the ranges B to P and k to y

       "[a-z0-9]" matches any lowercase letter or any digit.

       "[^b-d]" matches all characters except those in the range b to d.
                (This is an instance of ^ negating or inverting the meaning of
                the following regexp, taking on a role similar to ! in a different
                context.)

       Combined sequences of bracketed characters match common word
       patterns.

       "[Yy][Ee][Ss]" matches yes, Yes, YES, yEs, and so forth.

       "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]" matches any
       Social Security number.

The backslash \ escapes a special character, which means that character gets interpreted literally.

     A "\$" reverts back to its literal meaning of "$", rather than its
     regexp meaning of end-of-line. Likewise a "\\" has the literal meaning
     of "\".

( ) - treats the expression between ( and ) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9. On the shell command line or in scripts, the ( and ) metacharacters have be escaped like this: .

| - “or” two conditions together

       "him|her" matches "it belongs to him" and "it belongs to her"

       "(Memo|Report)20.\.txt" matches Memo201.txt, Report20a.txt, and
       Report209.txt; note use of grouping ().  Certain applications
       require the parens () to be escaped:  \( and \)

       $ w | grep "jchung\|clayton" # Note the "\|" in the grep regexp.

Extended regular expressions

Used in egrep, awk, and Perl

The question mark ? matches zero or one of the previous regexp. It is generally used for matching single characters.

     im?ing matches swiing, swiming, but not swimming

The plus + matches one or more of the previous regexp. It serves a role similar to the *, but does not match zero occurrences.

     9+ matches 9, 99, 999, but not 88

{i}, {i,j} - match a specific number of instances or instances within a range of the preceding character.
- If used on the command line the {} chars may have to be escaped with “\”: \{ \}

       A[0-9]{3} matches "A" followed by exactly 3 digits (A123, A1234
                 but not A12 34).

       [0-9]{4,6} matches any sequence of 4, 5 or 6 digits

Simple regexp examples using the %s (search and replace) command in vi

    :%s/  */ /g          Change 1 or more spaces into a single space.
    :%s/ *$//            Remove all spaces from the end of the line.
    :%s/^/ /             Insert a space at the beginning of every line.
    :%s/^[0-9][0-9]* //  Remove all numbers at the beginning of a line.
    :%s/b[aeio]g/bug/g   Change all occurences of bag, beg, big, and bog, to
                         bug.

Medium regexp example using search and replace in vi

Change all instances of foo(a,b,c) to foo(b,a,c). where a, b, and c can be any parameters supplied to foo(). That is, we must be able to make changes like the following:

 Before                   After
 ------                   -----
 foo(10,7,2)              foo(7,10,2)
 foo(x+13,y-2,10)         foo(y-2,x+13,10)
 foo(bar(8),x+y+z,5)      foo(x+y+z,bar(8),5)

 The following substitution command will do the trick:

 :%s/foo(\([^,]*\),\([^,]*\),\([^)]*\))/foo(\2,\1,\3)/g

 [^,]  means any character which is not a comma.

 [^,]*  means 0 or more characters which are not commas.

 \([^,]*\)  using grouping \( )\, tags the non-comma characters as \1 for use
 in the replacement part of the command.

 \([^,]*\),  means that we must match 0 or more non-comma characters
 which are followed by a comma. The non-comma characters are tagged.

 foo(\([^,]*\),  translates to "after you find foo(, tag all characters up to
 the next comma as \1".

Joe Chung
Monmouth U. Homepage

Table of Contents

Regular Expressions

Introduction

Regexp (regular expression) meta-characters

Extended regular expressions

Simple regexp examples using the %s (search and replace) command in vi

Medium regexp example using search and replace in vi