Exercises and Organization

From 22112
Jump to navigation Jump to search
Previous: Hash usage Next: Main page

Material for the lesson

Video: Organization and Exercises
Powerpoint: Exercises
PDF: Organization of projects
PDF: Guide to Reproducible Code
PDF: HPC Analysis Framework by Leon Jessen
Tweet: Use Twitter for academic networking. You needn’t tweet anything yourself, but it’s a great way to follow bioinformatics influencers and news, Leon Jessen
Video: Alexander Schönhuth: "Snakemake: Reproducible and Scalable Data Analysis" Optional, but good to know about. Just get the idea.

Exercises

1) In this exercise we will work with an Illumina sequencing file: ngs1.fastq.gz. There is a test version, ngs1test.fastq.gz, and an even smaller version, ngs1small.fastq.gz, of this file, which you should use when developing your programs. You need to use the queueing system, or log in on an interactive node if you are just doing "small stuff".

All NGS fastq files are horrendously big. This is why they are always compressed. When working with these kind of files, you always use (de)compression, simply to save disk space. If you can manage to (de)compress on the fly (this means decompressing while reading) you might even save time since the file is being read compressed (i.e. small) and unfolds directly in RAM on the computer. The (de)compression takes CPU time and the reading/writing takes IO time.

The content of the file looks like these 4 lines repeated a lot:

@ST-K00127:16:H32JKBBXX:8:1101:1111:1246 1:N:0:AGCCATGC
NAAAAGGCTCGACGCCTGTTATTATCGGAGTAGAACCGGAAATATCCGGATTAAGTTTTTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGCCA
+
#AAFFKKKKKKKKKKFK<KFKKKKKKKKFKFFKKKKKKKKKKKKKKKKK7FKKKKKKKKK<FF<KKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKK

This is called a read. The lines are in order: header line, sequence, comment line (usually empty, just a +) and quality score. There is a char on the quality score for each base in the sequence, i.e. sequence and quality lines have the same length.

The file contains microbiome sequencing data from a number of persons. This is common practice. Each person has a unique barcode, which is the red DNA string in the header.

The act of splitting the file up in several files - one for each person - is called demultiplexing. We will be doing this in next exercise.

Any current sequencing technology creates sequencing errors. The barcode can actually be sequenced wrong, not just the dna in the read. This means that some of the reads in the file belongs to people we can not identify.

Your job in this exercise is therefore to identify which barcodes are real, and which are due to sequencing errors. The erronous barcodes and their data are simply discarded. You can assume that ”right” barcodes occur much more frequently than ”wrong” barcodes, but it is also true that there can be a significant difference in the frequency of "right" barcodes.
Another strong/useful observation is that barcodes are "spread" across the spectrum. The distance from a real barcode to another real barcode is big. The distance is counted as bases that do not match at the same position in the barcodes.

Real: CACTTCGA
Real: TGAAGAGA
Distance: 6

The distance between a real barcode and one that is sequenced wrong in one base is obviously 1.

Real:  CACTTCGA
Error: TACTTCGA
Distance: 1

This means that it is possible to assign some of the sequences to the right barcode even if the barcode is sequenced wrong. You are not required to do so in these exercises, but it is good to remember this in real life.

Be careful in how much CPU time you are using to solve the task.

You must report

The program
A list (file) of correct barcodes.
Report the number of persons.

For your benefit, I have made a function that opens a gzipped file, like the ngs file.

import gzip, sys
def openfile(filename):
    try:
        if filename.endswith('.gz'):
            fh = gzip.open(filename, mode='rb')
        else:
            fh = open(filename, 'rb')
    except:
        print("Can't open file:", filename)
    return fh

myfilehandle = openfile('mygzippedfile.gz')
for line in myfilehandle:
    print(line)
myfilehandle.close()

2) Now that you have a list of correct barcodes, your task is to write a program that can demultiplex the ngs1.fastq.gz and save the resulting fastq files in /home/projects/pr_course/people/yourname. This could be combined with the previous exercise.

You need to focus on performance and disk space. This means that the solution to the problem contains a jobscript (for submission to the queueing system) and likely a program or two, which handles the various aspects of the job.
Hint: you might look into the Popen call in subprocess.

You need to report the following:

Your programs.
How much disk space are you using to store your results?
How long time did it take to complete the job?

To conserve disk space in computerome, delete your results.

My numbers are: 30 GB and 97 minutes.