Distributed computing and Coding against databases: Difference between pages

From 22112
(Difference between pages)
Jump to navigation Jump to search
 
 
Line 1: Line 1:
{| width=500  style="float:right; margin-left: 10px; margin-top: -56px;"
{| width=500  style="float:right; margin-left: 10px; margin-top: -56px;"
|Previous: [[Queueing System]]
|Previous: [[MySQL]]
|Next: [[What affects performance]]
|Next: [[Computer design]]
|}
|}
== Material for the lesson ==
== Material for the lesson ==
Video: [https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=a67ca717-9598-4169-9b52-af2701246519 Distributed computing]<br>
Note: The powerpoint has been updated with new library and transactions, but not the video.<br>
Video: [https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=e99121a6-df37-4d45-9891-af2701243c8a Profiling and subprocess]<br>
Video: [https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=a891e09a-bec1-4ab1-8869-af2701258352 Using Python with MySQL]<br>
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22112/22112_06-Distributed.ppt Distributed computing]<br>
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22112/22112_03-CodingDB.ppt Coding with Databases]<br>
Video: [https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=88d328c4-990a-4ce7-80cb-af1700719959 Exercises]
Video: [https://panopto.dtu.dk/Panopto/Pages/Viewer.aspx?id=68d743a1-9691-4690-b3a9-af170071ba0d Exercises]
 
[https://evaluering.dtu.dk/ Do midterm evaluation] <span style="color:red"><-- Important to do this week - Friday latest</span>


== Exercises ==
== Exercises ==
Using the Queueing System to do distributed computing. '''Warning''': The most difficult part of these exercises is actually using the Queueing System. The python code itself is fairly easy, but getting the QS to work requires patience and experience. Check the QS examples in the powerpoint from last lecture. Don't be afraid to consult the ''man'' pages on a slurm command. Great options are hidden there, f.ex how to get rid of header on output.
Create one or two programs that can answer the following questions about the database you created last lesson. The data is from 2016 and all records are "complete" for the time period they were collected in. Specifically, all children's parents are in the database until we go so far back that records did not exists (the information is lost in the mists of time).<br>
The programs should run on the server '''pupil1''', not your own machine.<br>
The mysql connect call looks like this.<br>
cnx = mysql.connector.connect(user='DTUusername', passwd='YOUR_PASSWORD', db='DTUusername', host='localhost')<br>
You can find your password in the .my.cnf file in your home.<br>
You will mostly use python to solve the exercises. It is not intended that you use advanced mysql queries, but feel free.<br>
'''Note, use this python: /home/ctools/anaconda3-2024.10-1/bin/python3'''


'''1)'''<br>
'''1)'''<br>
Return to last time’s exercise; read a fasta file, find the reverse complement for each entry and count the bases in the entry putting the numbers in the header line and save it all in one file.
Who are born out of wedlock? Definition: A child is born out of wedlock if the biological father and mother is not married when the child is born. In the absence of information, you can not claim a child is born out of wedlock (that means if you only know the mother from the database, you can not claim the child is a bastard).
Now solve this using the method on slide 5, i.e. distributed programming in embarrassingly parallel fashion.<br>
'''Test your programs on the small scale humantest.fsa file.''' When ready try the real human.fsa.


You have to make several programs; the administrator, the worker and the collector.
'''2)'''<br>
Are there families who have had the same disease through 3 generations? Who and what disease in such case? It is not a trick question - you will find some.<br>
In answering this question, then any type of cancer should just be regarded as cancer, i.e. the same disease (this is not true in reality). 3 generations should be understood as a ”straight line of inheritance”, i.e. child, father, father’s father/mother. Child, mother and father’s father is not straight.


The administrator splits up the original input fasta file into several pieces (one fasta sequence per piece) and submits a job per piece (the worker) with the relevant file piece as input.
If you look closely at your results, you will find some very interesting family patterns - such is the curse of (badly) generated data.
The worker which reads a file with one fasta sequence (given), computes the complement strand and base count and outputs the result to a file (given).
The collector program that collects all the result pieces and put them together in the original order in one file. This you run by yourself after the worker jobs finished. The structure of the administrator is like
foreach fastasequence in inputfile
    save fastasequence in file.x
    submit job with file.x
By naming/numbering the files in some systematic way, it is easier to collect them afterwards. Realize that you can test your code without using the QS, by simply running the worker directly. Also understand that this is an exercise in using the Queueing System, not in simple programming.


<!--
'''3) Somewhat optional'''<br>
In order to lighten your workload and lessen the confusion and grief the Queueing System can give you, I have made a a fairly general '''submit to QS''' python function for you to use if you wish.
Run your programs again, but this time locally on your laptop, while you use the database on the server.<br>
The command you submit is a string (which you can construct), and the command itself. The command must be a "simple" one meaning a program with options and parameters as you would write it on the command line.
This is an interesting challenge, which can be solved by SSH tunneling the mysql port 3306 to the server. The requirements for solving the exercise is understanding and setting up proper SSH tunneling as taught in lesson 1, and that mysql.connector is locally installed in your python distribution.
It does not offer the the convenience of the shell, meaning that piping, IO-redirection and wildcards does not work. Realize that a great source of error is not using full path to programs and files.
In principle, this is done by 1 unix command setting up the SSH tunnel.
<pre>
def submit(command, directory='', modules='', runtime, cores, ram, group='pr_course',
    jobscript='jobscript', output='/dev/null', error='/dev/null'):
    """
    Function to submit a job to the Queueing System - with jobscript file
    Parameters are:
    command:  The command/program you want executed together with any parameters.
              Must use full path unless the directory is given and program is there.
    directory: Working directory - where should your program run, place of your data.
              If not specified, uses current directory.
    modules:  String of space separated modules needed for the run.
    runtime:  Time in minutes set aside for execution of the job.
    cores:    How many cores are used for the job.
    ram:      How much memory in GB is used for the job.
    group:    Accounting - which group pays for the compute.
    jobscript: Standard name for the jobscript that needs to be made.
              You should number your jobscripts if you submit more than one.
    output:    Output file of your job.
    error:    Error file of your job.
    """
    runtime = int(runtime)
    cores = int(cores)
    ram = int(ram)
    if cores > 10:
        print("Can't use more than 10 cores on a node")
        sys.exit(1)
    if ram > 120:
        print("Can't use more than 120 GB on a node")
        sys.exit(1)
    if runtime < 1:
        print("Must allocate at least 1 minute runtime")
        sys.exit(1)
    minutes = runtime % 60
    hours = int(runtime/60)
    walltime = "{:d}:{:02d}:00".format(hours, minutes)
    if directory == '':
        directory = os.getcwd()
    # Making a jobscript
    script = '#!/bin/sh\n'
    script += '#PBS -A ' + group + ' -W group_list=' + group + '\n'
    script += '#PBS -e ' + error + ' -o ' + output + '\n'
    script += '#PBS -d ' + directory + '\n'
    script += '#PBS -l nodes=1:ppn=' + str(cores) + ',mem=' + str(ram) + 'GB' + '\n'
    script += '#PBS -l walltime=' + walltime + '\n'
    if modules != '':
        script += 'module load ' + modules + '\n'
    script += command + '\n'
    if not jobscript.startswith('/'):
        jobscript = directory + '/' + jobscript
    with open(jobscript, 'wt') as jobfile:
        jobfile.write(script)
    # The submit
    job = subprocess.run(['qsub', jobscript],stdout=subprocess.PIPE, universal_newlines=True)
    jobid = job.stdout.split('.')[0]
    return jobid


def submit2(command, directory='', modules='', runtime, cores, ram, group='pr_course',
=== Totally optional extra exercise I invented in my dreams Sunday morning: The Chat Board
    output='/dev/null', error='/dev/null'):
This is a fun cooperative exercise with databases. You should all make your own Chat client, where you can communicate with the other course participants. The exercise utilizes the concurrent features of databases - the easy update without worry if somebody else is accessing the database at the same time.
    """
    Function to submit a job to the Queueing System - without jobscript file
    Parameters are:
    command:  The command/program you want executed together with any parameters.
              Must use full path unless the directory is given and program is there.  
    directory: Working directory - where should your program run, place of your data.
              If not specified, uses current directory.
    modules:  String of space separated modules needed for the run.
    runtime:  Time in minutes set aside for execution of the job.
    cores:    How many cores are used for the job.
    ram:      How much memory in GB is used for the job.
    group:    Accounting - which group pays for the compute.
    output:    Output file of your job.
    error:    Error file of your job.
    """
    runtime = int(runtime)
    cores = int(cores)
    ram = int(ram)
    if cores > 10:
        print("Can't use more than 10 cores on a node")
        sys.exit(1)
    if ram > 120:
        print("Can't use more than 120 GB on a node")
        sys.exit(1)
    if runtime < 1:
        print("Must allocate at least 1 minute runtime")
        sys.exit(1)
    minutes = runtime % 60
    hours = int(runtime/60)
    walltime = "{:d}:{:02d}:00".format(hours, minutes)
    if directory == '':
        directory = os.getcwd()
    # Making a jobscript
    script = '#!/bin/sh\n'
    script += '#PBS -A ' + group + ' -W group_list=' + group + '\n'
    script += '#PBS -e ' + error + ' -o ' + output + '\n'
    script += '#PBS -d ' + directory + '\n'
    script += '#PBS -l nodes=1:ppn=' + str(cores) + ',mem=' + str(ram) + 'GB' + '\n'
    script += '#PBS -l walltime=' + walltime + '\n'
    if modules != '':
        script += 'module load ' + modules + '\n'
    script += command + '\n'
    # The submit
    job = subprocess.run(['qsub'], input=script, stdout=subprocess.PIPE, universal_newlines=True)
    jobid = job.stdout.split('.')[0]
    return jobid


You use a database called ''chatboard'' that is common to you all. In that database there will be just one table, ''messages'' with the fields; id (primary key autoincrement), mesg (varchar(500)), name (varchar(15), default NULL), channel (varchar(12), default NULL), timestamp (timestamp). This database and table is NOT CREATED YET.<br>
''mesg'' is the message on the chat.<br>
''name'' is name of the message sender. If NULL, then it is anonymous.<br>
''channel'' is the channel for this message. You make a subject a group of people with this. If NULL, it is the common channel.<br>
''timestamp'' is when the message is created.


# Is used like
The Chat client you make should have the following structure.
jobid = submit('myprogram myfile.fsa', directory='/home/projects/pr_course/people/pwsa',
              modules='tools anaconda3/4.0.0', jobscript='job.1',  runtime=10, cores=1, ram=2)
jobid = submit2('myprogram myfile.fsa', directory='/home/projects/pr_course/people/pwsa',
                modules='tools anaconda3/4.0.0', runtime=10, cores=1, ram=2)
</pre>
You can make almost the entire program on your own laptop if you substitute the below function with above function. Below just calls the worker sequentially not using the QS. You loose all the distribution the QS offers,
but avoid using computerome (and waiting for time). It is great in the development phase.
<pre>
<pre>
# Submit to the queueing system, runtime in minutes, ram in GB
connect to database chatboard
def unix_call(command):
pull last
    job = subprocess.run(command.split())
 
# Is used like
unix_call('myprogram myfile.fsa')
</pre>
-->
 
'''2)'''<br>
Make the administrator and collector into one program.
foreach fastasequence in inputfile
    save fastasequence in file.x
    submit job with file.x
wait for all jobs to finish
collect data
 
It is more difficult to solve this exercise - so do number 1 first. You need to find a way of waiting for your worker jobs to be done before your start collecting. However, you can see if this distributed method is faster than last week’s sequential method.
 
Some ideas to wait for the jobs to be done.
* Waiting and checking for all output files to appear. Cons; If a worker job breaks during execution you wait forever since the output file does not appear. If the output file is big, The worker might not have finished writing to it before collection starts. A trick to avoid this is for the worker to make an extra empty file at the end of the job and check for the presence of that file. An alternative is to write the file using a temporary filename, and rename the file to the correct name, when done writing.
* Using '''squeue''' to check that the jobs are gone from the queue. You need to find a way to recognize your jobs. That can be to name them or get the jobid from when you submit the job or perhaps just show your jobs (not everybody’s). When the list is empty, you are done. Cons; If you submit fast enough, the jobs might not have had time enough to show up in the queue, misleading you to think you are done. If you just use the ”your own jobs” method, you can only run one main job at a time, i.e. compute on one project only.
* Using '''sacct''' to show completed jobs. When your jobs are in the list, they are done.

Revision as of 07:12, 21 September 2025

Previous: MySQL Next: Computer design

Material for the lesson

Note: The powerpoint has been updated with new library and transactions, but not the video.
Video: Using Python with MySQL
Powerpoint: Coding with Databases
Video: Exercises

Exercises

Create one or two programs that can answer the following questions about the database you created last lesson. The data is from 2016 and all records are "complete" for the time period they were collected in. Specifically, all children's parents are in the database until we go so far back that records did not exists (the information is lost in the mists of time).
The programs should run on the server pupil1, not your own machine.
The mysql connect call looks like this.
cnx = mysql.connector.connect(user='DTUusername', passwd='YOUR_PASSWORD', db='DTUusername', host='localhost')
You can find your password in the .my.cnf file in your home.
You will mostly use python to solve the exercises. It is not intended that you use advanced mysql queries, but feel free.
Note, use this python: /home/ctools/anaconda3-2024.10-1/bin/python3

1)
Who are born out of wedlock? Definition: A child is born out of wedlock if the biological father and mother is not married when the child is born. In the absence of information, you can not claim a child is born out of wedlock (that means if you only know the mother from the database, you can not claim the child is a bastard).

2)
Are there families who have had the same disease through 3 generations? Who and what disease in such case? It is not a trick question - you will find some.
In answering this question, then any type of cancer should just be regarded as cancer, i.e. the same disease (this is not true in reality). 3 generations should be understood as a ”straight line of inheritance”, i.e. child, father, father’s father/mother. Child, mother and father’s father is not straight.

If you look closely at your results, you will find some very interesting family patterns - such is the curse of (badly) generated data.

3) Somewhat optional
Run your programs again, but this time locally on your laptop, while you use the database on the server.
This is an interesting challenge, which can be solved by SSH tunneling the mysql port 3306 to the server. The requirements for solving the exercise is understanding and setting up proper SSH tunneling as taught in lesson 1, and that mysql.connector is locally installed in your python distribution. In principle, this is done by 1 unix command setting up the SSH tunnel.

=== Totally optional extra exercise I invented in my dreams Sunday morning: The Chat Board This is a fun cooperative exercise with databases. You should all make your own Chat client, where you can communicate with the other course participants. The exercise utilizes the concurrent features of databases - the easy update without worry if somebody else is accessing the database at the same time.

You use a database called chatboard that is common to you all. In that database there will be just one table, messages with the fields; id (primary key autoincrement), mesg (varchar(500)), name (varchar(15), default NULL), channel (varchar(12), default NULL), timestamp (timestamp). This database and table is NOT CREATED YET.
mesg is the message on the chat.
name is name of the message sender. If NULL, then it is anonymous.
channel is the channel for this message. You make a subject a group of people with this. If NULL, it is the common channel.
timestamp is when the message is created.

The Chat client you make should have the following structure.

connect to database chatboard
pull last