Table of Contents

Text Processing Utilities

Examples Setup

Run the following command to begin setting up a subdirectory structure for the text processing utility examples:

for topic in grep sed sort uniq awk tr; do echo mkdir -p ~/cs370/examples/text/$topic; done # | bash

Remove the end comment # to pipe (|) the mkdir commands to bash.

Note: The above directories could have been created without a for loop, using shell brace expansion:

mkdir -p ~/cs370/examples/text/{sed,sort,uniq,awk,tr}


     grep [options] PATTERN [FILE...]
     grep [options] [-e PATTERN | -f FILE] [FILE...]

Basic grep

$ cat grepfile        # see grepfile contents

Well you know it's your bedtime,
So turn off the light,
Say all your prayers and then,
Oh you sleepy young heads dream of wonderful things,
Beautiful mermaids will swim through the sea,
And you will be swimming there too.

$ grep the grepfile     # look for pattern "the" in grepfile

So turn off the light,
Say all your prayers and then,
Beautiful mermaids will swim through the sea,
And you will be swimming there too.

$ cat grepfile | grep the       # pipe grepfile to grep

So turn off the light,
Say all your prayers and then,
Beautiful mermaids will swim through the sea,
And you will be swimming there too.

# look for whole word "the" in grepfile and number lines found
$ grep -wn the grepfile

2:So turn off the light,
5:Beautiful mermaids will swim through the sea,

# look for lines without "the", number lines
$ grep -wnv the grepfile

1:Well you know it's your bedtime,
3:Say all your prayers and then,
4:Oh you sleepy young heads dream of wonderful things,
6:And you will be swimming there too.

Regular expressions with grep

$ grep .nd grepfile

Say all your prayers and then,
Oh you sleepy young heads dream of wonderful things,
And you will be swimming there too.

$ grep ^.nd grepfile

And you will be swimming there too.

$ grep sw.*ng grepfile

And you will be swimming there too.

$ grep [A-D] grepfile 

Beautiful mermaids will swim through the sea,
And you will be swimming there too.

$ grep "\." grepfile

And you will be swimming there too.

$ grep a. grepfile  

Say all your prayers and then,
Oh you sleepy young heads dream of wonderful things,
Beautiful mermaids will swim through the sea,

$ grep a.$ grepfile

Beautiful mermaids will swim through the sea,

$ grep [a-m]nd grepfile

Say all your prayers and then,

$ grep [^a-m]nd grepfile

Oh you sleepy young heads dream of wonderful things,
And you will be swimming there too.

$ egrep s.+w grepfile

Oh you sleepy young heads dream of wonderful things,
Beautiful mermaids will swim through the sea,

$ egrep "off|will" grepfile

So turn off the light,
Beautiful mermaids will swim through the sea,
And you will be swimming there too.

$ egrep im*ing grepfile    

And you will be swimming there too.

$ egrep im?ing grepfile
? Why no matches ?

grep pattern match context options

# -C 1 option below means grep will show 1 line above and
# and 1 line below the matching lines:
$ grep -C 1 sleepy grepfile

Say all your prayers and then,
Oh you sleepy young heads dream of wonderful things,
Beautiful mermaids will swim through the sea, 

# -A 2 option means show up to 2 lines AFTER the matching lines:
$ grep -A 2 sleepy grepfile

Oh you sleepy young heads dream of wonderful things,
Beautiful mermaids will swim through the sea,
And you will be swimming there too.

# -B 2 option means show up to 2 lines BEFORE the matching lines:
$ grep -B 2 sleepy grepfile

So turn off the light,
Say all your prayers and then,
Oh you sleepy young heads dream of wonderful things,

sed - Stream EDitor

     sed [ -e command ] [ -f scriptfile ] { fileName }

The sed commands

Substituting Text

# The sed input file:
$ cat fiction

The lone monarch butterfly flew flutteringly through
the cemetery, dancing on and glancing against headstone
after headstone before alighting atop Willie Mitchell's
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,
until a big shovelful of dirt landed on it and it died.

$ sed 's/^/ /' fiction > fiction.indented

# contents of 'fiction' indented by one space:
$ cat fiction.indented

The lone monarch butterfly flew flutteringly through
the cemetery, dancing on and glancing against headstone
after headstone before alighting atop Willie Mitchell's
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,
until a big shovelful of dirt landed on it and it died.

$ sed 's/^ *//' fiction.indented     # removes leading spaces

# To insert the indentations directly into 'fiction' means
# doing an "in-place" edit of 'fiction', using sed's '-i' option:
$ sed -i 's/^/ /' fiction

Deleting Text

$ sed '/a/d' fiction      # remove all lines containing char 'a'.

from the open mouths of five or six lingering mourners,

$ sed '/\<a\>/d' fiction  # remove lines containing the word 'a'.

The lone monarch butterfly flew flutteringly through
the cemetery, dancing on and glancing against headstone
after headstone before alighting atop Willie Mitchell's
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,

Appending/Inserting Text

# Sed accepts sed scripts with the '-f' option;
# sed5 is a sed script containing sed commands;
# It will insert 2 lines at line 1:
$ cat sed5

Copyright 2002 Joe Chung\
All rights reserved\

$ sed -f sed5 fiction

Copyright 2002 Joe Chung
All rights reserved

The lone monarch butterfly flew flutteringly through
the cemetery, dancing on and glancing against headstone
after headstone before alighting atop Willie Mitchell's
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,
until a big shovelful of dirt landed on it and it died.
Append text after a line that contains pattern with
   sed '/pattern/a line of text here' filename

Insert text before a line that contains pattern with
   sed '/pattern/i line of text here' filename

Examples of appending and inserting a line of text:

$ cat test

$ sed '/option/a append text here' test
append text here

$ sed '/option/i insert text here' test
insert text here

Replacing (Changing) Text

# Another sed script, containing a sed change text directive:
$ cat sed6

Lines 1-3 are censored.\

$ sed -f sed6 fiction

Lines 1-3 are censored.
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,
until a big shovelful of dirt landed on it and it died.

# Another sed script, containing a sed change text directive:
$ cat sed7

Line 1 is censored.
Line 2 is obfuscated.
Line 3 is kaput.

$ sed -f sed7 fiction

Line 1 is censored.
Line 2 is obfuscated.
Line 3 is kaput.
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,
until a big shovelful of dirt landed on it and it died.

Inserting files

# We want to insert a file called 'fin' using sed:
$ cat fin

The End

# Direct sed to insert 'fin' at end of 'fiction'
$ sed '$r fin' fiction

The lone monarch butterfly flew flutteringly through
the cemetery, dancing on and glancing against headstone
after headstone before alighting atop Willie Mitchell's
already lowered casket, causing gasps of awe to fly
from the open mouths of five or six lingering mourners,
until a big shovelful of dirt landed on it and it died.
The End

Multiple sed Commands

# Use sed's '-e' option to perform multiple sed operations
# per line:
$ sed -e 's/^/<< /' -e 's/$/ >>/' fiction

<< The lone monarch butterfly flew flutteringly through >>
<< the cemetery, dancing on and glancing against headstone >>
<< after headstone before alighting atop Willie Mitchell's >>
<< already lowered casket, causing gasps of awe to fly >>
<< from the open mouths of five or six lingering mourners, >>
<< until a big shovelful of dirt landed on it and it died. >>

sort - sort lines of text files or stdin

     sort [OPTION]... [FILE]...
# Sort input file:
$ cat sortfile

jan  Start  chapter 3  10th
Jan  Start  chapter 1  30th
Jan  Start chapter 5  23rd
Jan  End  chapter 3  23rd
Mar  Start chapter 7  27
may  End  chapter 7  17th
Apr  End  Chapter 5  1
Feb  End chapter 5  14

$ sort sortfile         

Apr  End Chapter 5  1
Feb  End chapter 5  14
Jan  End chapter 3  23rd
Jan  Start chapter 1  30th
jan  Start chapter 3  10th
Jan  Start chapter 5  23rd
Mar  Start chapter 7  27
may  End chapter 7  17th

# Force reverse or descending sort:
$ sort -r sortfile      

may  End chapter 7  17th
Mar  Start chapter 7  27
Jan  Start chapter 5  23rd
jan  Start chapter 3  10th
Jan  Start chapter 1  30th
Jan  End chapter 3  23rd
Feb  End chapter 5  14
Apr  End Chapter 5  1

# Sort starting in the 1st (+0) field, end at the 2nd (-1) field;
# alternatively: sort --key=1,1 sortfile:
$ sort +0 -1 sortfile

Apr  End Chapter 5  1
Feb  End chapter 5  14
jan  Start chapter 3  10th
Jan  End chapter 3  23rd
Jan  Start chapter 1  30th
Jan  Start chapter 5  23rd
Mar  Start chapter 7  27
may  End chapter 7  17th

# Sort by month name in 1st field
$ sort +0 -1 -M sortfile

Jan  End chapter 3  23rd
Jan  Start chapter 1  30th
jan  Start chapter 3  10th
Jan  Start chapter 5  23rd
Feb  End chapter 5  14
Mar  Start chapter 7  27
Apr  End Chapter 5  1
may  End chapter 7  17th

# sort by the 5th (last) field numerically;
# alternatively:  sort --key=5 -n sortfile
$ sort +4 -5 -n sortfile

Apr  End Chapter 5  1
jan  Start chapter 3  10th
Feb  End chapter 5  14
may  End chapter 7  17th
Jan  End chapter 3  23rd
Jan  Start chapter 5  23rd
Mar  Start chapter 7  27
Jan  Start chapter 1  30th

uniq - remove duplicate lines from a sorted file or stdin

     uniq [OPTION]... [INPUT [OUTPUT]]
# Input file for uniq:
$ cat animals

cat snake
monkey snake
dolphin elephant
dolphin elephant
goat elephant
pig pig
pig pig
monkey pig

# Default mode filters out non-unique lines:
$ uniq animals

cat snake
monkey snake
dolphin elephant
goat elephant
pig pig
monkey pig

# count instances of nonunique lines
$ uniq -c animals

  1 cat snake
  1 monkey snake
  2 dolphin elephant
  1 goat elephant
  2 pig pig
  1 monkey pig

# Ignore first field of each line when
# looking for duplicates:
$ uniq -1 animals

cat snake
dolphin elephant
pig pig

awk - pattern scanning and processing language

     $ awk -F "." '{ print "mkdir " $2 }'
     $ awk -F "." -f makedirs

     where makedirs contains

        { print "mkdir " $2 }
       awk [ condition ] [ { action } ]

       condition can be:

          - special token BEG[awk - pattern scanning and processing language] IN or END
          - expression using logical or relational operators and/or regular expression

       action is performed on every line of input that matches the
       condition and can be one or more C-like programming statements:

          - if (conditional) statement [ else statement ]
          - while (conditional) statement
          - for (expression; conditional; expression ) statement
          - break/continue
          - variable = expression
          - print [ list of expressions ] [ > expression ]
          - printf format [ , list of expressions ] [ > expression ]
          - next (skips the remaining patterns on the current line of input)
          - exit (skips the rest of the current line)
          - [ list of statements ]   

Accessing individual fields of lines of text

# Say we have this input file:
$ cat float

Wish I was floating in blue across the sky,
My imagination is strong,
And I often visit the days
When everything seemed so clear.
Now I wonder what I'm doing here at all...

$ awk '{print NF, $0}' float

9 Wish I was floating in blue across the sky,
4 My imagination is strong,
6 And I often visit the days
5 When everything seemed so clear.
9 Now I wonder what I'm doing here at all...

# Awk fields are delimited using white space by default.

BEGIN and END conditions applied to lines of text

# Say that the file awk2 contains these awk statements:
$ cat awk2

BEGIN { print "Start of file" }
{ print $1 $3 $NF }
END { print "End of file" }

$ awk -f awk2 float

Start of file: 
End of file

# Equivalently, on the command line:
$ awk 'BEGIN { print "Start of file" } { print $1 $3 $NF } END { print "End of file" }' float

Logical operators in awk conditions

$ awk 'NR > 1 && NR < 4 { print NR, $1, $3, $NF }' float

2 My is strong,
3 And often days
$ awk '/t.+e/ { print $0 }' float

Wish I was floating in blue across the sky,
And I often visit the days
When everything seemed so clear.
Now I wonder what I'm doing here at all...

Awk condition ranges

$ awk '/strong/,/clear/ { print $0 }' float

My imagination is strong,
And I often visit the days
When everything seemed so clear.

Awk delimiters

# See contents of /etc/passwd (delimited file using : as the delimiter)
$ cat /etc/passwd

# Extract fields of /etc/passwd using awk:
$ awk -F ":" '{ print $1, $3, $NF }' /etc/passwd # 1st, 3rd and last fields

Using cut instead of awk

tr - TRanslating Characters

     tr -cds string1 string2
# Input file:
$ cat go.cart
go cart

# Translating case: probably the most common use of tr
$ tr a-z A-Z < go.cart

# Replace character ranges
$ tr a-c D-E < go.cart
go EDrt

# Replace every non-"a" with "X"
$ tr -c a X < go.cart

# Replace non-"a-z" with (new line)
# Could substitute '\n' for '\012'
$ tr -c a-z '\012' < go.cart

# Just delete characters
$ tr -d a-c < go.cart
go rt


1. nospace

Create a script nospace to look for filenames with spaces in them in the current directory and to rename those files, converting the spaces to _ (underscore).

In a separate nospace directory, use touch to create a bunch of files that have spaces in the file names:

mkdir nospace
cd nospace
touch "report one" "report two" "report three" "reports four and five"

Link to nospace code

2. wget pipeline

Download the following file using wget:

Write a pipeline to extract only the PPP ip address “” from this file. Incorporate wget in the pipeline.

Complete the pipeline using sed, and later, awk.

Solution using sed:

# wget: Quiet (-q) wget output while sending fetched modem.out to stdout (-O -)
# grep: Match 1 line of modem.out containing "PPP"
# sed: Delete all information before the IP address
wget -q -O - |
grep PPP |
sed 's/.*PPP *//'

Solution using awk:

# wget: Quiet (-q) wget output while sending fetched modem.out to stdout (-O -)
# grep: Match 1 line of modem.out containing "PPP"
# awk: Extract IP address, which is the 5th field ($5) in the line,
#           IP Network Address   PPP    
wget -q -O - |
grep PPP |
awk '{print $5}'

3. randlines

Write a script randlines to randomize the order of lines in standard input. Here's a start:

# randlines: Randomize lines in standard input
# Uses $RANDOM shell variable (found at the Advanced BASH
# Shell Scripting Guide).

while read myline   # Read one line of stdin at a time.
   echo $RANDOM $myline

Using either the head or tail command, create a variant of randlines that outputs one line at random from standard input.

Note: We are just re-implementing the functionality of the shuf command which randomizes lines of files and stdin.

4. wordfreq

Create a script called wordfreq to print the number of occurrences of all words in a file or standard input. Output must be sorted descending by number of occurrences.

Sample output if input is

        738 the
        519 I
        508 to
        472 of
        434 and
        387 a
        305 in
        210 his
        204 that
        193 was
        191 my
        189 he
        169 you
        162 not
        150 with
        146 it
        141 me
        139 him
        121 Bartleby

We want wordfreq to be able to handle both STDIN and files given as arguments. So, it should be able to do something like wget -q -O - | grep PPP | awk '{print $5}'

fortune | ./wordfreq      # wordfreq on STDIN

and also

./wordfreq input.txt      # wordfreq a input file

(and also)

./wordfreq input*.txt     # wordfreq multiple input files together

Link to wordfreq code

5. makeuserids

Study the cut text processing command. Apply cut to a file containing this list of names:

Wehman, John
Wehner, Monk
Weid, Kahn
Weigner, Ray
Weimann, Joseph
Weimmer, Nottingham
Weinberg, John
Weiner, Stephanie
Weiner, Joseph
Weinert, Molly
Weingarten, Joyce
Weinraub, John

Use cut to extract the first letters of the first names, convert to lower case, and write the letters to a file called firstinit.

       and so on

Use cut again to extract the first 7 letters of the last names, convert to lower case, and write to a file called lastname.

       and so on

Study the paste text processing utility. Use paste to paste firstinit and lastname together, eliminating any spaces.

       and so on

and redirect the result to a filed called userids.

Write a script to perform the above tasks on an input file.

Link to code | (alternative version that uses process substitution)

6. grep context pipeline

Write a pipeline to turn the following input (saved in a file called 'servers'):

# comment blah
	  host MA-FXDWF-14
		   hardware ethernet 00:13:21:5C:11:16;

	  host MA-FXDWF-15
		   hardware ethernet 00:13:21:5D:12:17;

	  host MA-FXDWF-16
		   hardware ethernet 00:13:21:5E:13:18;

	  # repeats 4000 times

into this (for import into a spreadsheet):

grep -A 3 "host" servers | # find lines that contain "host", list 3 lines following each matching line
tr -d '\n' |               # delete new lines to put everything on one line
sed "s/--/\n/g" |          # insert a new line where "--" occurs ("--" separates the grep matches)
awk '{ print $2, $6, $8 }' | # print 2nd, 6th and 8th tokens, using default awk delimiter
tr -d ';' |                # delete semicolons
sed "s/ /???/g"            # replace single spaces with ???

6. roster processing

Download a class roster. Using sed search and replace operations, convert the raw roster file to a list with the following format:


The list would be even better if Lastname and Firstname were both lower case, like this:

awk -F ", " '{ print $1"-"$2":"$3 }' roster | # using ", " as delimiter, extract and print last"-"first":"id
sed "s/ [A-Z]\.//" |                          # search for and delete middle initials (space, uppercase letter, period)
tr A-Z a-z                                    # lowercase

7. webadvisor2roster

In the script webadvisor2roster take a roster from webadvisor and transform it into

Last, First [MI], ID

format, writing to the file roster.

Link to webadvisor2roster code

(SKIP) 8. randomseating (SKIP)

In the script randomseating, combine last names from a roster (see 7. above) and a seats file to randomize seating in HH 305.

Link to randomseating code

Link to randomseating-v2 code (preferred)

9. Sum the points in quiz1

Sum and display the total points in quiz1-answers.txt.

expression=$(cat quiz1-answers.txt | grep "(. point" | sed "s/[^0-9]//g" | tr '\n' '+' | sed "s/+$//")
answer=$(( expression ))
echo $answer


echo $(( $(cat quiz1-answers.txt | grep "(. point" | sed "s/[^0-9]//g" | tr '\n' '+' | sed "s/+$//") ))


# use bc, a command line calculator
echo $(cat quiz1-answers.txt | grep "(. point" | sed "s/[^0-9]//g" | tr '\n' '+' | sed "s/+$//") | bc

# pipeline breakdown
grep ". point"       | # find lines that contain "(n point(s))"
sed "s/[^0-9]//g"    | # delete all non-digit chars, leaving only a column of numbers
tr '\n' '+'          | # put all on single line, separated by "+"
sed "s/+$//"           # delete last "+" at end

10. Sort a string

Sort the following string from a scavenger hunt challenge:


See the fold core text processing utility.

echo 22fl6abbz7yaabcdeezez99178 |
fold -w 1                       | # lines can be only 1 char wide (print string vertically)
sort                            |
tr -d '\n'                        # remove newlines to return to horizontal 

11. text2png

Write the text2png script that turns standard input into a large wallpaper-type image file.

This will be a fairly long shell script that demonstrates:

Link to text2png code | text2png_getopts (alternate version that uses getopts)