Distributed computing

From 22112
Jump to navigation Jump to search
Previous: Queueing System Next: What affects performance

Material for the lesson

Video: Distributed computing
Video: Profiling and subprocess
Powerpoint: Distributed computing
Video: Exercises

Do midterm evaluation <-- Important to do this week - Friday latest

Exercises

Using the Queueing System to do distributed computing. Warning: The most difficult part of these exercises is actually using the Queueing System. The python code itself is fairly easy, but getting the QS to work requires patience and experience. Check the QS examples in the powerpoint from last lecture.

1)
Return to last time’s exercise; read a fasta file, find the reverse complement for each entry and count the bases in the entry putting the numbers in the header line and save it all in one file. Now solve this using the method on slide 5, i.e. distributed programming in embarrassingly parallel fashion.
Test your programs on the small scale humantest.fsa file. When ready try the real human.fsa.

You have to make several programs; the administrator, the worker and the collector.

The administrator splits up the original input fasta file into several pieces (one fasta sequence per piece) and submits a job per piece (the worker) with the relevant file piece as input. The worker which reads a file with one fasta sequence (given), computes the complement strand and base count and outputs the result to a file (given). The collector program that collects all the result pieces and put them together in the original order in one file. This you run by yourself after the worker jobs finished. The structure of the administrator is like

foreach fastasequence in inputfile
    save fastasequence in file.x
    submit job with file.x

By naming/numbering the files in some systematic way, it is easier to collect them afterwards. Realize that you can test your code without using the QS, by simply running the worker directly. Also understand that this is an exercise in using the Queueing System, not in simple programming.


2)
Make the administrator and collector into one program.

foreach fastasequence in inputfile
    save fastasequence in file.x
    submit job with file.x
wait for all jobs to finish
collect data

It is more difficult to solve this exercise - so do number 1 first. You need to find a way of waiting for your worker jobs to be done before your start collecting. However, you can see if this distributed method is faster than last week’s sequential method.

Some ideas to wait for the jobs to be done.

  • Waiting and checking for all output files to appear. Cons; If a worker job breaks during execution you wait forever since the output file does not appear. If the output file is big, The worker might not have finished writing to it before collection starts. A trick to avoid this is for the worker to make an extra empty file at the end of the job and check for the presence of that file. An alternative is to write the file using a temporary filename, and rename the file to the correct name, when done writing.
  • Using squeue to check that the jobs are gone from the queue. You need to find a way to recognize your jobs. That can be to name them or get the jobid from when you submit the job or perhaps just show your jobs (not everybody’s). When the list is empty, you are done. Cons; If you submit fast enough, the jobs might not have had time enough to show up in the queue, misleading you to think you are done. If you just use the ”your own jobs” method, you can only run one main job at a time, i.e. compute on one project only.
  • Using sacct to show completed jobs. When your jobs are in the list, they are done.