More parallelism

Next: MapReduce and Binary representation

Material for the lesson

Video: Concurrency and semaphores
Video: Semaphore demonstration - see video to prepare
Powerpoint: Semaphores and Concurrency
Video: Fair distribution
Powerpoint: Fair distribution
Video: Exercises

Exercises

1) Looking at human.fsa we have the human genome as as fasta file – one entry per chromosome. Let us search for functional (could be regulatory) elements in the human DNA. A candidate for such an element is an over-represented k-mer.
A k-mer is a stretch of DNA or protein sequence of size k. Hence a 10-mer represents 10 nucleotides or amino acids in a row.

Find which 5-mers, 6-mers and 7-mers are over-represented. This is simply done by counting all the different k-mers, and counting the different nucleotides.

Example for k-mer: ’atgca’ , P for probability
P(a) = number of a/number of nucleotides
count(atgca-mers)/count(5-mers) > P(a)*P(t)*P(g)*P(c)*P(a)

If this inequality is true, there is an over-representation of 'atgca'. We only want to find the significantly over-represented k-mers, so we arbitrarily decide that a k-mer is significantly overrepresented when the frequency with which it appears is double what we expect from the probability calculation.

Issues: In the fasta file there are stretches of unknown bases (n and other letters than atcg). All k-mers in these regions are disregarded.

2) Parallelize the counting. The main process finds stretches of DNA that should be given to worker processes, which return counting results to the main process, that sums them up. This could be done on entry basis, but does not have to be.
Idea for parallelizing (only one program using joblib):

# count function/worker
def countme (file, start, end):
     find sequence at start to end in file
     count everything
     return results
# main program
read fasta file and index sequence positions (start/end).
for positions in seqindex:
    allcounts = Parallel(n_jobs=4)(delayed(countme)(file, x[0], x[1]) for x in seqindex)
combine allcounts
compute over-representaion

Issues to consider:

If you split DNA in twain for different processes to handle, there are k-mers that won’t be counted in the split, unless you handle it in some way.
What happens with memory usage – especially when using 8, 9, 10-mers. How to deal with it?
Some sequences are perhaps larger than others. How will you spread the load on the worker processes, so they all end at approximately the same time. This is a hard problem to solve optimally, but rather easy to solve well, but not optimal. This problem is called load balancing.

More parallelism

Material for the lesson

Exercises

Navigation menu

Search