What affects performance
Previous: Distributed computing | Next: Algorithms |
Material for the lesson
Video: File Systems
Video: Improve performance
Video: Random access in Python
Powerpoint: Efficiency
Resource: Example code - memoization
Video: What is expected in exercises
Exercises
1)
Make a program that can index a fasta file, that is find all the positions in the file where a header starts and ends where a sequence starts and ends, so 4 numbers per entry.
The result (first numbers from human.fsa shown) should be printed like :
0 71 72 253105767 253105768 253105839 253105840 499335927 499335928 499335999 499336000 700936484 700936485 700936556 700936557 894321354 894321355 894321426 894321427 1078885323
Use the *.fsa files for practice.
2)
Make a program that can be given a fasta file and the 4 numbers from above on command line, and read and print a single entry.
3)
Think about speed. Insert ways to measure the performance. Don't use real profiling from lecture 3, but time the code using the python time module.
Make the programs in 1 and 2 work faster. 1 can be tricky, since the way to read data fast does not give precise information about the file pointer.
Perhaps chunk reading is good.
Consider that some fasta files have small entries, others have large entries. What impact does that have on your method?
For the record, I could do the indexing in 2.5 seconds on computerome, and 6.5 seconds on my laptop for human.fsa. I would have expected computerome to be slower.
There can be many reasons for the speed difference, but the main one here is computerome has more memory - more file buffers. It has simply stored the file in memory,
and when I repeatedly test it does not need to go to the disk except the first time.
Another test on computerome gave me 34,6 seconds for indexing when the file was not in the file buffers and 1.6 sec when the file was in the file buffers.