Dictionaries

From 22101
Jump to navigation Jump to search
Previous: Sets Next: How to Python

Required course material for the lesson

Powerpoint: Dictionaries
Video: Dictionaries
Video: Tips and Tricks
Resource: Clean Code Every time you read it, you will take something from it.
Resource: Example code - Dicts
Video: Live Coding

Subjects covered

  • Dictionaries, which are unordered tables of data.
  • Functions relevant to dictionaries:
    • keys, returns a list of keys in the dictionary,
    • values, returns a list of values in the dictionary,
    • in, determines if a key/value pair exist,
    • del, which deletes a key/value pair,
    • items, which iterates over all key/value pair in the dictionary.

Exercises to be handed in

  1. Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a codon list here.
  2. Use the dictionary from the previous exercise in a program, that translates all the nucleotide fasta entries in dna7.fsa to amino acid sequence. Save the results in a file aa7.fsa in fasta format. Since the sequence is now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.
  3. In the file ex5.acc are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, now we should count them. Make a program that reads the file once, and writes a file noorder5.acc with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ex5.acc.
  4. Improve the previous exercise by saving the accessions in order of occurrences with the top counts first in the file order5.acc.
  5. In the tab-separated files slinger.txt and hoist.txt are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file combined.txt. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.
  6. Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.

Exercises for extra practice

  • Given a tab-separated file with 3 columns; StudentID, CourseNumber, Grade. Can you find a way to load the grades for a student in a retrievable manner into (some of) the python data structures learned so far? Retrievable means here that you can find the grades for a student if you know the studentID.
    Explain your approach. Hint: It is not necessarily efficient.
  • The geneA-E.txt files all have the same structure on each line; first number is a float between 0 and 1, second number is an integer. For all files (the combined data set) find the average of the float, given the integer and display in ascending order of the integer. You need to add all the floats for a given integer together and divide by the number of floats for the integer, then you have the average for the integer. To succeed at this, you must use two dicts where the integer is the key in both. The corresponding values are the sum of the floats (for that key) and the number of times the key has been encountered in the files.
  • This exercise requires that you did the last two practice exercises in Simple Pattern Matching. In the data1-4.gb files count who many times the different codons in the coding sequence occurs. Display.
  • This exercise builds on mandatory exercise 2. You must read the dna7.fsa file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot.