MapReduce and Binary representation

From 22112
Revision as of 11:57, 6 March 2024 by WikiSysop (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Previous: More parallelism Next: Hash usage

Material for the lesson

Video: MapReduce Framework, short intro
Video: Binary numbers, representation and operations
Powerpoint: MapReduce
Youtube: Binary numbers - watch from 2:00 to 6:00
Video: Exercises

Exercises

Questions to answer: How many 10-mers in the human genome (file human.fsa) occur only once? How many different 10-mers are in the human genome? How many different 10-mers are possible?
Any 10-mer containing anything else but A T C or G are to be disregarded in any form of counting in this exercise.
You do not need to look at the complement strand for this exercise, although in Real Life it would be mandatory.
Use the file humantest.fsa for your testing. The hard question to answer is the first once, the others follow from that.

You will discover that some of the difficulty in the exercise is that the sequence contains other chars than ATCG. If you want you can start out "soft" by using the humanchr1.fsa file which only contains one entry with a sequence containing only ATCG.

You must solve this problem in two ways, making two programs.
1) Use a dict.
2) Use a bytevector as described in the powerpoint.
First you must realize that a kmer containing only ATCG can be converted to a binary number - this also why we only look at kmers with only ATCG. I have made a function below to demonstrate how you can convert from a sequence to a number. Warning: It is for demonstration purpose only. If you use it in your program it will be horribly slow.
It should also be clear that every kmer only corresponds to one single number - there is a 1-1 connection between kmer and number, and you can go back and forth between these representations.
Now if you have a list (bytevector) then every position in that list will correspond to a specific kmer, which again means that you can convert a kmer to a position in a list and increment the number at that position for counting purposes.

# Computes a number from a DNA sequence
def dna2num(seq):
    if len(seq) > 31:
        print("Sequence is to large to be contained in a number")
        sys.exit(1)
    num = 0
    for char in seq:
        # Bitshift two bits to the left
        num <<= 2
        if char == 'A':
            pass
        elif char == 'T':
            num |= 0b11
        elif char == 'C':
            num |= 0b01
        elif char == 'G':
            num |= 0b10
        else:
            print("Illegal base in DNA sequence:", char)
            sys.exit(1)
    return num

print(dna2num('GTACGTACGTACG'))

# Creating a bytevector of a certain size - filled with 0
vector = bytearray(1000)

In both cases you must measure how much memory your main data structure uses and long it takes for the program to finish. Consider this a performance contest. The fastest one of you wins. :-)
It could be a good idea to measure the time each step takes.
Also develop your program so it will work with any size of k-mer. Simply have a variable called mersize that you can set to any positive integer value.

Questions: How does changing that to 16 affect your programs? Can you use any positive value as mersize?

My fastest time using a dict is 2670 seconds and a bytevector is 1228 seconds using purely sequential implementations. The number of 10-mers only occurring once is below 10.