====== Regular Expressions ======

----

Links & references:

  * http://linuxreviews.org/beginner/tao_of_regular_expressions
  * http://en.wikipedia.org/wiki/Regular_expressions
  * Web tools for testing and learning:
    * http://www.regexr.com/
    * http://regexpal.com/
    * https://regexone.com/

----

===== Introduction =====

    * To fully utilize shell scripting and certain commands and utilities commonly used in scripts (expr, sed, awk, etc.), you need to know how to use regular expressions.

    * Do not confuse regular expressions with shell globbing (filename expansion).
          * sh/ksh/bash do not normally use regular expressions, but can do file globbing, which use conventions that are similar to regular expressions.

    * Regular expressions are sets of characters and/or metacharacters that represent text patterns.

    * The main uses for regular expressions are text searches and string manipulation.
          * A regular expression matches a single character or a set of characters (a substring or an entire string).

===== Regexp (regular expression) meta-characters =====

    * The asterisk * matches any number of repeats of the character string or regexp preceding it, including zero. 

       "1133*" matches 11 + one or more 3's + possibly other characters:
       113, 1133, 111312, and so forth.                                 

    * The dot . matches any one character, except a newline. 

       "13." matches 13 plus at least one of any character (including a
       space): 1133, 11333, but not 13 (additional character missing).

       ".*" matches any number of any characters.

    * The caret ^ matches the beginning of a line, but sometimes, depending on context, negates the meaning of a set of characters in an regexp.

    * The dollar sign $ at the end of an regexp matches the end of a line. 

       "^$" matches blank lines.                                       

    * Brackets [...] enclose a set of characters to match in a single regexp.

<code>
       "[xyz]" matches the characters x, y, or z.

       "[c-n]" matches any of the characters in the range c to n.

       "[B-Pk-y]" matches any of the characters in the ranges B to P and k to y

       "[a-z0-9]" matches any lowercase letter or any digit.

       "[^b-d]" matches all characters except those in the range b to d.
                (This is an instance of ^ negating or inverting the meaning of
                the following regexp, taking on a role similar to ! in a different
                context.)

       Combined sequences of bracketed characters match common word
       patterns.

       "[Yy][Ee][Ss]" matches yes, Yes, YES, yEs, and so forth.

       "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]" matches any
       Social Security number.                                      
</code>

    * The backslash \ escapes a special character, which means that character gets interpreted literally. 

       A "\$" reverts back to its literal meaning of "$", rather than its
       regexp meaning of end-of-line. Likewise a "\\" has the literal meaning
       of "\".

    * ( ) - treats the expression between ( and ) as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as \1 through \9. On the shell command line or in scripts, the ( and ) metacharacters have be escaped like this: \( \). 

    * | - "or" two conditions together 

<code>
       "him|her" matches "it belongs to him" and "it belongs to her"

       "(Memo|Report)20.\.txt" matches Memo201.txt, Report20a.txt, and
       Report209.txt; note use of grouping ().  Certain applications
       require the parens () to be escaped:  \( and \)

       $ w | grep "jchung\|clayton" # Note the "\|" in the grep regexp.
</code>

===== Extended regular expressions =====

    * Used in egrep, awk, and Perl 

    * The question mark ? matches zero or one of the previous regexp. It is generally used for matching single characters. 

       im?ing matches swiing, swiming, but not swimming

    * The plus + matches one or more of the previous regexp. It serves a role similar to the *, but does not match zero occurrences. 

       9+ matches 9, 99, 999, but not 88

    * {i}, {i,j} - match a specific number of instances or instances within a range of the preceding character. 
      * If used on the command line the {} chars may have to be escaped with "\":  \{  \}

<code>
       A[0-9]{3} matches "A" followed by exactly 3 digits (A123, A1234
                 but not A12 34).

       [0-9]{4,6} matches any sequence of 4, 5 or 6 digits
</code>

----

===== Simple regexp examples using the %s (search and replace) command in vi =====

      :%s/  */ /g          Change 1 or more spaces into a single space.
      :%s/ *$//            Remove all spaces from the end of the line.
      :%s/^/ /             Insert a space at the beginning of every line.
      :%s/^[0-9][0-9]* //  Remove all numbers at the beginning of a line.
      :%s/b[aeio]g/bug/g   Change all occurences of bag, beg, big, and bog, to
                           bug.

===== Medium regexp example using search and replace in vi =====

  * Change all instances of foo(a,b,c) to foo(b,a,c). where a, b, and c can be any parameters supplied to foo(). That is, we must be able to make changes like the following:

   Before                   After
   ------                   -----
   foo(10,7,2)              foo(7,10,2)
   foo(x+13,y-2,10)         foo(y-2,x+13,10)
   foo(bar(8),x+y+z,5)      foo(x+y+z,bar(8),5)
  
   The following substitution command will do the trick:
  
   :%s/foo(\([^,]*\),\([^,]*\),\([^)]*\))/foo(\2,\1,\3)/g
  
   [^,]  means any character which is not a comma.
  
   [^,]*  means 0 or more characters which are not commas.
  
   \([^,]*\)  using grouping \( )\, tags the non-comma characters as \1 for use
   in the replacement part of the command.
  
   \([^,]*\),  means that we must match 0 or more non-comma characters
   which are followed by a comma. The non-comma characters are tagged.
  
   foo(\([^,]*\),  translates to "after you find foo(, tag all characters up to
   the next comma as \1".

----