====== Subshells and Parallel Processing ====== ---- * Some information taken the Advanced Bash-Scripting Guide section on subshells (http://en.tldp.org/LDP/abs/html/subshells.html) * (**Do in class**) Setup: $ mkdir -p ~/cs370/examples/subshells_parallel_proc $ cd ~/cs370/examples/subshells_parallel_proc ---- ===== Subshells ===== * Definition: A subshell is a child process launched by a shell (or shell script). * Running a shell script launches a new process, a subshell. * A shell script can itself launch subprocesses (subshells). * These subshells let the script do parallel processing, in effect executing multiple subtasks simultaneously. * Example: #!/bin/bash # subshell-test.sh ( # Inside parentheses, and therefore a subshell . . . while [ 1 ] # Endless loop. do echo "Subshell running . . ." done ) # Script will run forever, # or at least until terminated by a Ctl-C. exit $? # End of script (but will never get here). * Run the script: $ ''./subshell-test.sh'' * And, while the script is running, from a different xterm: $ ''ps aux | grep subshell-test.sh'' UID PID PPID C STIME TTY TIME CMD 500 2698 2502 0 14:26 pts/4 00:00:00 sh subshell-test.sh 500 2699 2698 21 14:26 pts/4 00:00:24 sh subshell-test.sh ^^^^ * Analysis: PID 2698, the shell script, launched PID 2699, the subshell. * Note: The "UID ..." line would be filtered out by the "grep" command, but is shown here for illustrative purposes. * In general, an external command in a script forks off a subprocess, whereas a Bash built-in command does not. (For this reason, built-ins execute more quickly and use fewer system resources than their external command equivalents.) ==== Command List within Parentheses ==== ( command1; command2; command3; ... ) * A command list (command chain) embedded between parentheses runs in a subshell. * Variable scope in a subshell * Variables used in a subshell are NOT visible outside the block of code in the subshell. * They are not accessible to the parent process, the shell that launched the subshell. * These are, in effect, variables local to the child process. * Redirecting I/O to a subshell uses the "|" pipe operator, as in ls -al | ( command1; command2; ... ). * Example: Last section of [[https://cssegit.monmouth.edu/jchung/csse370repo/-/blob/main/scripts/ssh_setup.sh|ssh_setup.sh]] # Run group of echo and cat commands through a subshell if grep fails: grep -qf id_rsa.pub authorized_keys || ( echo echo "SSH public key (id_rsa.pub) contents not found in ~/.ssh/authorized_keys." echo "Appended id_rsa.pub to authorized_keys." echo cat id_rsa.pub >> authorized_keys ) ==== Subshell Sandboxing ==== * A subshell may be used to set up a "dedicated environment" for a command group. * For example, you might want to suppress an environment variable temporarily: # temporarily unset $http_proxy variable before a wget $ (unset http_proxy; wget ...) * Or you might have a task that must run in a specific working directory, and you want the remainder of the script (or your interactive shell) not to be affected by that. * Another example: COMMAND1 COMMAND2 COMMAND3 # Start of subshell ( IFS=: PATH=/bin unset TERMINFO set -C shift 5 COMMAND4 COMMAND5 exit 3 # Only exits the subshell! ) # The parent shell has not been affected, and the environment is preserved. COMMAND6 COMMAND7 ---- ===== Shell Parallel Processing ===== * Also referred to as asynchronous execution. * Utilize background processing and the bash //wait// shell built-in command. * The bash //wait// command stops script execution until all jobs running in the background have terminated, or until the job number or process id specified as a //wait// option terminates. * //wait// returns the exit status of waited-for command. ==== Using Subshells ==== * Process chains and pipelines can be made to execute in parallel within different subshells. * This permits breaking a complex task into subcomponents processed concurrently, though subshells aren't necessarily needed to do this. * Example: Running parallel processes in subshells (cat list1 list2 list3 | sort | uniq > list123) & (cat list4 list5 list6 | sort | uniq > list456) & # Merges and sorts both sets of lists simultaneously. # Running in background ensures parallel execution. # # Could also have been done without subshells: # cat list1 list2 list3 | sort | uniq > list123 & # cat list4 list5 list6 | sort | uniq > list456 & wait # Don't execute the next command until subshells finish. diff list123 list456 ---- ==== (**Do in class**) Examples ==== Download and extract the [[https://piazza.com/class_profile/get_resource/lwd125nsggo6gv/lxtdx4sloji7bp|parallel_shell_proc.tar.xz]] archive to try the following parallel processing examples. You can also download the archive with the command wget https://plato.monmouth.edu/~jchung/download/parallel_shell_proc.tar.xz ==== Using State Files or Logs ==== Use log or state files that are generated by parallel processes: #!/bin/sh # run_scripts.sh # Spawn off two sub-scripts to perform some function. This script # will wait until they both signal their job is complete by creating # an empty state file named sub_script1.done and sub_script2.done # First, make sure the ".done" files don't exist In case there was an # abrupt end and we didn't get a chance to cleanup the files: sub_script1_file="sub_script1.done" sub_script2_file="sub_script2.done" rm -f ${sub_script1_file} ${sub_script2_file} # Launch the two scripts in the background w/ &: ./sub_script1.sh ${sub_script1_file} & ./sub_script2.sh ${sub_script2_file} & # Every 10 seconds, check whether both subscript state files have been # created: while [ ! -e ${sub_script1_file} -a ! -e ${sub_script2_file} ] ; do sleep 10 done # At this point, both sub-scripts are done, so clean up the .done # files: rm -f ${sub_script1_file} ${sub_script2_file} sub_script1.sh (in the above example): #!/bin/sh # sub_script1.sh # This script finds and writes a list of all the files in the /usr # directory and all subdirectories: echo "Running sub_script1.sh..." find /usr -type f > sub_script1.data 2> /dev/null # The above find command is finished. Now create an empty state file # to signal we're done. We were given the path and the name of the # file to create as a command-line argument ($1): touch $1 echo "sub_script1.sh done" sub_script2.sh (in the above example): #!/bin/sh # sub_script2.sh # This script grabs the slashdot.org homepage: echo "Running sub_script2.sh..." wget --quiet http://slashdot.org -O sub_script2.data # We're done, create the "done" state file that was given as a # command-line argument ($1): touch $1 echo "sub_script2.sh done" __However__, the above example could have been greatly simplified just by using //wait// in bash. #!/bin/bash # run_scripts_wait.sh # Spawn off two sub-scripts to perform some function. # This script will wait until they both signal their job is complete # Launch the scripts to create state files 1 & 2, though state files # aren't actually needed for this version of the script: ./sub_script1.sh 1 & ./sub_script2.sh 2 & # wait for both scripts to complete: wait # Clean up state files 1 & 2: rm 1 2 ==== Background Remote Programs ==== * Simple distributed parallel processing in scripts * Can be accomplished using ssh after [[cs_370_-_introduction_unix_fundamentals#secure_shell_ssh|password-less logins]] have been set up. * STDIN can be passed to remote programs through ssh === Example: wordcounts of large files === * See the ''parallel-wordcount'' subdirectory. * Based on ''wordcount'' script developed in [[cs_370_-_text_processing_utilities#wordfreq|previous lab activity]] * Original script * Simple __sequential__ processing of ''large1.txt'', followed by ''large2.txt'' #!/bin/bash # sequentialwc: Simple sequential processing of large1.txt, # followed by large2.txt cat large1.txt | ./wordcount > large1.wc cat large2.txt | ./wordcount > large2.wc Run time (localhost): $ time ./sequentialwc real 0m8.473s * Modified script * Local __parallel__ processing of ''large1.txt'' and ''large2.txt'' #!/bin/bash # parallelwc: Simple local background parallel processing of large1.txt, # and large2.txt, with a wait statement. cat large1.txt | ./wordcount > large1.wc & cat large2.txt | ./wordcount > large2.wc & wait Run time (localhost): $ time ./parallelwc real 0m6.800s * Modified script * ''distribwc'' script to execute ''wordcount'' script on localhost plus a remotehost with two large input files * Remote parallel processing of ''large1.txt'' * ''large1.txt'' is processed on the remotehost (Linux machine), via ssh. * The ''"cat large1.txt | ssh $remotehost ..."'' pipeline is put in the background, like a local process. * Contents of ''large1.txt'' are piped from localhost to remotehost's instance of the ''wordcount'' script via ssh. * Then, STDOUT from the remotehost's ''wordcount'' instance is redirected back to ''large1.wc'' on the localhost. #!/bin/bash # distribwc: script to execute ''wordcount'' script on localhost # plus a remotehost with two large input files remotehost=$1 # remotehost name, e.g. plato, rockhopper, csselin01, etc. workdir=$(pwd) # work directory where the wordcount script is located on both # localhost and remotehost; assumes wordcount script and input # files are in the same directory cat large1.txt | ssh $remotehost "$workdir/wordcount" > large1.wc & cat large2.txt | $workdir/wordcount > large2.wc & wait Run time (localhost + csselin09): $ time ./distribwc csselin09 # Ran on localhost, processed large1.txt on csselin09 real 0m4.580s ----