Hash usage
Previous: MapReduce and Binary representation | Next: Exercises and Organization |
Material for the lesson
Video: Hashes, properties and use
Video: Bloom filter, a probabilistic set
Video: Count-min Sketch, a probabilistic counter
Video: MinHash, a probabilistic set similarity score
Powerpoint: The use of hash
Video: Exercises
Exercises
Determine if unknown DNA is human or not
The way to go about this is to create a Bloom filter containing all 30-mers from the human.fsa file and then query that with the unknown DNA in the mixeddna.fsa file. A 30-mer is chosen, as it is very likely unique.
Make two programs; one that creates the Bloom filter from the human DNA and saves it to a file, and one program that reads in the filter from file, and queries it based on an input fasta file.
1) Computing the Bloom filter parameters
The false positive rate, p must be 1% or lower. Figure out the size of the input, i.e. approximately how many elements will be put in the filter, n. Compute the size of the filter, m, and the number of hash functions, k necessary. Adjust values so they fit your purpose, and make a final calculation of p. Pick hash function(s) and explain your choice.
2) Making the Bloom filter
You can use previous functions and ideas. The Bloom filter must be a bytearray and it can be written directly to a binary file.
Since Bloom filters can be added together then you can parallelize the filter construction per chromosome. This is not required. Realize that using joblib creates several filters and they are sizable, i.e. running out of memory is possible.
To help out a bit I have created 3 functions: one that takes a k-mer (as a bytes object) and gives back an integer representing a 32 byte (256 bit) integer number - the hash-number. One that finds the next position of the bit to set, given a hash-number, and one that sets a bit in a bytearray given positions. These functions are here to increase your understanding, not the programs performance. Using them as they are will not give you the performance boost you need. I used 7 hours mainly in calculating the Bloom filter in a not-parallel program.
import sys, hashlib # Computes a hash based on a kmer def hashit(kmer): myhashobj = hashlib.sha256() myhashobj.update(kmer) return int.from_bytes(myhashobj.digest(), byteorder=sys.byteorder) # The size of the bit field (the slice of bits taken from the hash, how many bits you need) bitfieldsize = xxxx # your number that correlates with m. # "eats" a slice from the hash and returns the position in the bloom filter that slice corresponds to. # Also returns the "leftover" hash, so it can be used again for a new position. def nextposition(hashnumber): position = hashnumber & (2**bitfieldsize - 1) byteposition = position >> 3 bitposition = position & 7 hashnumber >>= bitfieldsize # discarding the used hash slice return(hashnumber, byteposition, bitposition) # Sets the bit in the bloom filter def setbit(bloomfilter, byteposition, bitposition): bloomfilter[byteposition] |= 1 << bitposition
3) Querying the Bloom filter.
The Bloom filter can/should be loaded with one read. When querying for sequence, also try the reverse complement. Query all 30-mers in the input fasta file, mixeddna.fsa,
and output the no_of_hits/no_of_kmers ratio in percent per sequence. If near 100%, it is human DNA, else not.
# This function return True if a bit is set on the position given. def isbitset(bloomfilter, byteposition, bitposition): return (bloomfilter[byteposition] & (1 << bitposition)) != 0