====== Subshells and Parallel Processing ======
----
* Some information taken the Advanced Bash-Scripting Guide section on subshells (http://en.tldp.org/LDP/abs/html/subshells.html)
* (**Do in class**) Setup:
$ mkdir -p ~/cs370/examples/subshells_parallel_proc
$ cd ~/cs370/examples/subshells_parallel_proc
----
===== Subshells =====
* Definition: A subshell is a child process launched by a shell (or shell script).
* Running a shell script launches a new process, a subshell.
* A shell script can itself launch subprocesses (subshells).
* These subshells let the script do parallel processing, in effect executing multiple subtasks simultaneously.
* Example:
#!/bin/bash
# subshell-test.sh
(
# Inside parentheses, and therefore a subshell . . .
while [ 1 ] # Endless loop.
do
echo "Subshell running . . ."
done
)
# Script will run forever,
# or at least until terminated by a Ctl-C.
exit $? # End of script (but will never get here).
* Run the script: $ ''./subshell-test.sh''
* And, while the script is running, from a different xterm: $ ''ps aux | grep subshell-test.sh''
UID PID PPID C STIME TTY TIME CMD
500 2698 2502 0 14:26 pts/4 00:00:00 sh subshell-test.sh
500 2699 2698 21 14:26 pts/4 00:00:24 sh subshell-test.sh
^^^^
* Analysis: PID 2698, the shell script, launched PID 2699, the subshell.
* Note: The "UID ..." line would be filtered out by the "grep" command, but is shown here for illustrative purposes.
* In general, an external command in a script forks off a subprocess, whereas a Bash built-in command does not. (For this reason, built-ins execute more quickly and use fewer system resources than their external command equivalents.)
==== Command List within Parentheses ====
( command1; command2; command3; ... )
* A command list (command chain) embedded between parentheses runs in a subshell.
* Variable scope in a subshell
* Variables used in a subshell are NOT visible outside the block of code in the subshell.
* They are not accessible to the parent process, the shell that launched the subshell.
* These are, in effect, variables local to the child process.
* Redirecting I/O to a subshell uses the "|" pipe operator, as in
ls -al | ( command1; command2; ... ).
* Example: Last section of [[https://cssegit.monmouth.edu/jchung/csse370repo/-/blob/main/scripts/ssh_setup.sh|ssh_setup.sh]]
# Run group of echo and cat commands through a subshell if grep fails:
grep -qf id_rsa.pub authorized_keys ||
(
echo
echo "SSH public key (id_rsa.pub) contents not found in ~/.ssh/authorized_keys."
echo "Appended id_rsa.pub to authorized_keys."
echo
cat id_rsa.pub >> authorized_keys
)
==== Subshell Sandboxing ====
* A subshell may be used to set up a "dedicated environment" for a command group.
* For example, you might want to suppress an environment variable temporarily:
# temporarily unset $http_proxy variable before a wget
$ (unset http_proxy; wget ...)
* Or you might have a task that must run in a specific working directory, and you want the remainder of the script (or your interactive shell) not to be affected by that.
* Another example:
COMMAND1
COMMAND2
COMMAND3
# Start of subshell
(
IFS=:
PATH=/bin
unset TERMINFO
set -C
shift 5
COMMAND4
COMMAND5
exit 3 # Only exits the subshell!
)
# The parent shell has not been affected, and the environment is preserved.
COMMAND6
COMMAND7
----
===== Shell Parallel Processing =====
* Also referred to as asynchronous execution.
* Utilize background processing and the bash //wait// shell built-in command.
* The bash //wait// command stops script execution until all jobs running in the background have terminated, or until the job number or process id specified as a //wait// option terminates.
* //wait// returns the exit status of waited-for command.
==== Using Subshells ====
* Process chains and pipelines can be made to execute in parallel within different subshells.
* This permits breaking a complex task into subcomponents processed concurrently, though subshells aren't necessarily needed to do this.
* Example: Running parallel processes in subshells
(cat list1 list2 list3 | sort | uniq > list123) &
(cat list4 list5 list6 | sort | uniq > list456) &
# Merges and sorts both sets of lists simultaneously.
# Running in background ensures parallel execution.
#
# Could also have been done without subshells:
# cat list1 list2 list3 | sort | uniq > list123 &
# cat list4 list5 list6 | sort | uniq > list456 &
wait # Don't execute the next command until subshells finish.
diff list123 list456
----
==== (**Do in class**) Examples ====
Download and extract the [[https://piazza.com/class_profile/get_resource/lwd125nsggo6gv/lxtdx4sloji7bp|parallel_shell_proc.tar.xz]] archive to try the following parallel processing examples. You can also download the archive with the command
wget https://plato.monmouth.edu/~jchung/download/parallel_shell_proc.tar.xz
==== Using State Files or Logs ====
Use log or state files that are generated by parallel processes:
#!/bin/sh
# run_scripts.sh
# Spawn off two sub-scripts to perform some function. This script
# will wait until they both signal their job is complete by creating
# an empty state file named sub_script1.done and sub_script2.done
# First, make sure the ".done" files don't exist In case there was an
# abrupt end and we didn't get a chance to cleanup the files:
sub_script1_file="sub_script1.done"
sub_script2_file="sub_script2.done"
rm -f ${sub_script1_file} ${sub_script2_file}
# Launch the two scripts in the background w/ &:
./sub_script1.sh ${sub_script1_file} &
./sub_script2.sh ${sub_script2_file} &
# Every 10 seconds, check whether both subscript state files have been
# created:
while [ ! -e ${sub_script1_file} -a ! -e ${sub_script2_file} ] ; do
sleep 10
done
# At this point, both sub-scripts are done, so clean up the .done
# files:
rm -f ${sub_script1_file} ${sub_script2_file}
sub_script1.sh (in the above example):
#!/bin/sh
# sub_script1.sh
# This script finds and writes a list of all the files in the /usr
# directory and all subdirectories:
echo "Running sub_script1.sh..."
find /usr -type f > sub_script1.data 2> /dev/null
# The above find command is finished. Now create an empty state file
# to signal we're done. We were given the path and the name of the
# file to create as a command-line argument ($1):
touch $1
echo "sub_script1.sh done"
sub_script2.sh (in the above example):
#!/bin/sh
# sub_script2.sh
# This script grabs the slashdot.org homepage:
echo "Running sub_script2.sh..."
wget --quiet http://slashdot.org -O sub_script2.data
# We're done, create the "done" state file that was given as a
# command-line argument ($1):
touch $1
echo "sub_script2.sh done"
__However__, the above example could have been greatly simplified just by using //wait// in bash.
#!/bin/bash
# run_scripts_wait.sh
# Spawn off two sub-scripts to perform some function.
# This script will wait until they both signal their job is complete
# Launch the scripts to create state files 1 & 2, though state files
# aren't actually needed for this version of the script:
./sub_script1.sh 1 &
./sub_script2.sh 2 &
# wait for both scripts to complete:
wait
# Clean up state files 1 & 2:
rm 1 2
==== Background Remote Programs ====
* Simple distributed parallel processing in scripts
* Can be accomplished using ssh after [[cs_370_-_introduction_unix_fundamentals#secure_shell_ssh|password-less logins]] have been set up.
* STDIN can be passed to remote programs through ssh
=== Example: wordcounts of large files ===
* See the ''parallel-wordcount'' subdirectory.
* Based on ''wordcount'' script developed in [[cs_370_-_text_processing_utilities#wordfreq|previous lab activity]]
* Original script
* Simple __sequential__ processing of ''large1.txt'', followed by ''large2.txt''
#!/bin/bash
# sequentialwc: Simple sequential processing of large1.txt,
# followed by large2.txt
cat large1.txt | ./wordcount > large1.wc
cat large2.txt | ./wordcount > large2.wc
Run time (localhost):
$ time ./sequentialwc
real 0m8.473s
* Modified script
* Local __parallel__ processing of ''large1.txt'' and ''large2.txt''
#!/bin/bash
# parallelwc: Simple local background parallel processing of large1.txt,
# and large2.txt, with a wait statement.
cat large1.txt | ./wordcount > large1.wc &
cat large2.txt | ./wordcount > large2.wc &
wait
Run time (localhost):
$ time ./parallelwc
real 0m6.800s
* Modified script
* ''distribwc'' script to execute ''wordcount'' script on localhost plus a remotehost with two large input files
* Remote parallel processing of ''large1.txt''
* ''large1.txt'' is processed on the remotehost (Linux machine), via ssh.
* The ''"cat large1.txt | ssh $remotehost ..."'' pipeline is put in the background, like a local process.
* Contents of ''large1.txt'' are piped from localhost to remotehost's instance of the ''wordcount'' script via ssh.
* Then, STDOUT from the remotehost's ''wordcount'' instance is redirected back to ''large1.wc'' on the localhost.
#!/bin/bash
# distribwc: script to execute ''wordcount'' script on localhost
# plus a remotehost with two large input files
remotehost=$1
# remotehost name, e.g. plato, rockhopper, csselin01, etc.
workdir=$(pwd)
# work directory where the wordcount script is located on both
# localhost and remotehost; assumes wordcount script and input
# files are in the same directory
cat large1.txt | ssh $remotehost "$workdir/wordcount" > large1.wc &
cat large2.txt | $workdir/wordcount > large2.wc &
wait
Run time (localhost + csselin09):
$ time ./distribwc csselin09 # Ran on localhost, processed large1.txt on csselin09
real 0m4.580s
----