Subshells and Parallel Processing

Some information taken the Advanced Bash-Scripting Guide section on subshells (http://en.tldp.org/LDP/abs/html/subshells.html)
(Do in class) Setup:

$ mkdir -p ~/cs370/examples/subshells_parallel_proc
$ cd ~/cs370/examples/subshells_parallel_proc

Subshells

Definition: A subshell is a child process launched by a shell (or shell script).

Running a shell script launches a new process, a subshell.

A shell script can itself launch subprocesses (subshells).
- These subshells let the script do parallel processing, in effect executing multiple subtasks simultaneously.

Example:

#!/bin/bash
# subshell-test.sh

(
# Inside parentheses, and therefore a subshell . . .
while [ 1 ]   # Endless loop.
do
  echo "Subshell running . . ."
done
)

#  Script will run forever,
#  or at least until terminated by a Ctl-C.

exit $?  # End of script (but will never get here).

Run the script: $ ./subshell-test.sh

And, while the script is running, from a different xterm: $ ps aux | grep subshell-test.sh

        UID       PID   PPID  C STIME TTY      TIME     CMD
        500       2698  2502  0 14:26 pts/4    00:00:00 sh subshell-test.sh
        500       2699  2698 21 14:26 pts/4    00:00:24 sh subshell-test.sh

                  ^^^^

Analysis: PID 2698, the shell script, launched PID 2699, the subshell.
- Note: The “UID …” line would be filtered out by the “grep” command, but is shown here for illustrative purposes.

In general, an external command in a script forks off a subprocess, whereas a Bash built-in command does not. (For this reason, built-ins execute more quickly and use fewer system resources than their external command equivalents.)

Command List within Parentheses

      ( command1; command2; command3; ... )

A command list (command chain) embedded between parentheses runs in a subshell.

Variable scope in a subshell
- Variables used in a subshell are NOT visible outside the block of code in the subshell.
- They are not accessible to the parent process, the shell that launched the subshell.
  - These are, in effect, variables local to the child process.

Redirecting I/O to a subshell uses the “|” pipe operator, as in

      ls -al | ( command1; command2; ... ).

Example: Last section of ssh_setup.sh

		# Run group of echo and cat commands through a subshell if grep fails:
		grep -qf id_rsa.pub authorized_keys ||
		(
		   echo
		   echo "SSH public key (id_rsa.pub) contents not found in ~/.ssh/authorized_keys."
		   echo "Appended id_rsa.pub to authorized_keys."
		   echo
		   cat id_rsa.pub >> authorized_keys
		)

Subshell Sandboxing

A subshell may be used to set up a “dedicated environment” for a command group.
For example, you might want to suppress an environment variable temporarily:

      # temporarily unset $http_proxy variable before a wget
      $ (unset http_proxy; wget ...)

Or you might have a task that must run in a specific working directory, and you want the remainder of the script (or your interactive shell) not to be affected by that.

Another example:

        COMMAND1
        COMMAND2
        COMMAND3
        # Start of subshell
        (
          IFS=:
          PATH=/bin
          unset TERMINFO
          set -C
          shift 5
          COMMAND4
          COMMAND5
          exit 3 # Only exits the subshell!
        )
        # The parent shell has not been affected, and the environment is preserved.
        COMMAND6
        COMMAND7

Shell Parallel Processing

Also referred to as asynchronous execution.
Utilize background processing and the bash wait shell built-in command.
The bash wait command stops script execution until all jobs running in the background have terminated, or until the job number or process id specified as a wait option terminates.
- wait returns the exit status of waited-for command.

Using Subshells

Process chains and pipelines can be made to execute in parallel within different subshells.
This permits breaking a complex task into subcomponents processed concurrently, though subshells aren't necessarily needed to do this.
Example: Running parallel processes in subshells

        (cat list1 list2 list3 | sort | uniq > list123) &
        (cat list4 list5 list6 | sort | uniq > list456) &
        # Merges and sorts both sets of lists simultaneously.
        # Running in background ensures parallel execution.
        #
        # Could also have been done without subshells:
        #   cat list1 list2 list3 | sort | uniq > list123 &
        #   cat list4 list5 list6 | sort | uniq > list456 &

        wait   # Don't execute the next command until subshells finish.

        diff list123 list456

(Do in class) Examples

Download and extract the parallel_shell_proc.tar.xz archive to try the following parallel processing examples. You can also download the archive with the command

wget https://plato.monmouth.edu/~jchung/download/parallel_shell_proc.tar.xz

Using State Files or Logs

Use log or state files that are generated by parallel processes:

#!/bin/sh

# run_scripts.sh

# Spawn off two sub-scripts to perform some function.  This script
# will wait until they both signal their job is complete by creating
# an empty state file named sub_script1.done and sub_script2.done

# First, make sure the ".done" files don't exist In case there was an
# abrupt end and we didn't get a chance to cleanup the files:
sub_script1_file="sub_script1.done"
sub_script2_file="sub_script2.done"

rm -f ${sub_script1_file} ${sub_script2_file}

# Launch the two scripts in the background w/ &:
./sub_script1.sh ${sub_script1_file} &
./sub_script2.sh ${sub_script2_file} &

# Every 10 seconds, check whether both subscript state files have been
# created:
while [ ! -e ${sub_script1_file} -a ! -e ${sub_script2_file} ] ; do
   sleep 10
done

# At this point, both sub-scripts are done, so clean up the .done
# files:
rm -f ${sub_script1_file} ${sub_script2_file}

sub_script1.sh (in the above example):

#!/bin/sh

# sub_script1.sh

# This script finds and writes a list of all the files in the /usr
# directory and all subdirectories:
echo "Running sub_script1.sh..."
find /usr -type f > sub_script1.data 2> /dev/null

# The above find command is finished. Now create an empty state file
# to signal we're done. We were given the path and the name of the
# file to create as a command-line argument ($1):
touch $1

echo "sub_script1.sh done"

sub_script2.sh (in the above example):

#!/bin/sh

# sub_script2.sh

# This script grabs the slashdot.org homepage:
echo "Running sub_script2.sh..."
wget --quiet http://slashdot.org -O sub_script2.data

# We're done, create the "done" state file that was given as a
# command-line argument ($1):
touch $1

echo "sub_script2.sh done"

However, the above example could have been greatly simplified just by using wait in bash.

#!/bin/bash

# run_scripts_wait.sh

# Spawn off two sub-scripts to perform some function.
# This script will wait until they both signal their job is complete

# Launch the scripts to create state files 1 & 2, though state files
# aren't actually needed for this version of the script:
./sub_script1.sh 1 &
./sub_script2.sh 2 &

# wait for both scripts to complete:
wait

# Clean up state files 1 & 2:
rm 1 2

Background Remote Programs

Simple distributed parallel processing in scripts
- Can be accomplished using ssh after password-less logins have been set up.

STDIN can be passed to remote programs through ssh

Example: wordcounts of large files

See the parallel-wordcount subdirectory.

Based on wordcount script developed in previous lab activity

Original script
- Simple sequential processing of large1.txt, followed by large2.txt

#!/bin/bash
# sequentialwc: Simple sequential processing of large1.txt,
# followed by large2.txt

cat large1.txt | ./wordcount > large1.wc 
cat large2.txt | ./wordcount > large2.wc


Run time (localhost):
	$ time ./sequentialwc

	real    0m8.473s

Modified script
- Local parallel processing of large1.txt and large2.txt

#!/bin/bash
# parallelwc: Simple local background parallel processing of large1.txt,
# and large2.txt, with a wait statement.

cat large1.txt | ./wordcount > large1.wc &
cat large2.txt | ./wordcount > large2.wc &

wait


Run time (localhost):
	$ time ./parallelwc

	real    0m6.800s

Modified script
- distribwc script to execute wordcount script on localhost plus a remotehost with two large input files
  - Remote parallel processing of large1.txt
    - large1.txt is processed on the remotehost (Linux machine), via ssh.
    - The “cat large1.txt | ssh $remotehost …” pipeline is put in the background, like a local process.
    - Contents of large1.txt are piped from localhost to remotehost's instance of the wordcount script via ssh.
      - Then, STDOUT from the remotehost's wordcount instance is redirected back to large1.wc on the localhost.

#!/bin/bash
# distribwc: script to execute ''wordcount'' script on localhost
# plus a remotehost with two large input files

remotehost=$1
# remotehost name, e.g. plato, rockhopper, csselin01, etc.

workdir=$(pwd)
# work directory where the wordcount script is located on both
# localhost and remotehost; assumes wordcount script and input
# files are in the same directory

cat large1.txt | ssh $remotehost "$workdir/wordcount" > large1.wc &
cat large2.txt | $workdir/wordcount > large2.wc &

wait


Run time (localhost + csselin09):
	$ time ./distribwc csselin09   # Ran on localhost, processed large1.txt on csselin09

	real    0m4.580s

Joe Chung
Monmouth U. Homepage

Table of Contents

Subshells and Parallel Processing

Subshells

Command List within Parentheses

Subshell Sandboxing

Shell Parallel Processing

Using Subshells

(Do in class) Examples

Using State Files or Logs

Background Remote Programs

Example: wordcounts of large files

Table of Contents

Subshells and Parallel Processing

Subshells

Command List within Parentheses

Subshell Sandboxing

Shell Parallel Processing

Using Subshells

(**Do in class**) Examples

Using State Files or Logs

Background Remote Programs

Example: wordcounts of large files

(Do in class) Examples