Sets

From 22101
Jump to navigation Jump to search
Previous: Simple Pattern Matching Next: Dictionaries

Required course material for the lesson

Powerpoint: Sets
Video: Sets

Subjects covered

  • Sets, which are unordered data collections with no duplicates.
  • Methods relevant to sets:
    • clear, clears a set
    • add, adds and element to the set,
    • update, adds several elements to the set, performance,
    • remove, removes an element from the set, error if not present,
    • discard, removes an element from the set, no error,
    • in, determines if an element exist,
    • mathematical set operations, intersection (&), union (|)

Exercises to be handed in

  1. Read the sequences in the file dna7.fsa. Find out which and how many of the 64 codons are not used somewhere in the sequences. Print the unused codons.
  2. You have made a program (let's call it the X-program), which as input takes a file of accession numbers, start10.dat and produces some output, which is in res10.dat. Now you count the lines in your input file and your output file and you discover that the line numbers do not match. Horror - your program does not produce output for some input. Now the assignment is to discover which accession numbers did not produce output. This can be done in various ways, but now you have to use a set. Print the results.
  3. In the file ex5.acc are a lot of accession numbers, where some are duplicates. We have earlier cleaned this file of duplicates. Let's do that again using a set. Make a program that reads the file once, finds the unique accession numbers and write them to the file uniq5.acc
  4. In the data1.gb file there are 6 references (to articles). Make a program that extracts all authors from the references, eliminates those that are duplicates and print the list of authors. You will notice that some authors seems to be the same person using different initials. You should only consider a person a duplicate if the name matches exactly. This should also work for the other Genbank entries: data2.gb, data3.gb & data4.gb.
    Beware: there traps in this exercise, check your output properly.

Exercises for extra practice

A k-mer is a piece of sequence k units long. It can be both protein and DNA sequence. As an example I have this DNA sequence: GACTAC. It contains the following 4-mers: GACT, ACTA, CTAC.

  • From a genbank file like data1-4.gb extract the DNA sequence (been there - done that). Now insert all 5-mers in the DNA sequence into a set. Compute the following numbers: how many did you insert in the set, how many are in the set, how many different 5-mers could there possibly be. Display. It seems like the two first numbers are the same, but this is not guaranteed to be true, why?
  • Calculate the overlap of 5-mers between any two of the data1-4.gb files. Just ask for two filenames and calculate for these.
  • Make a program that asks for a number (integer) then reads the dna7.fsa file (which contains insulin-like genes) and saves the entry selected by the number in selected.fsa and the rest of the entries into the file rest.fsa.
  • Now that you can select a fasta entry, then read the selected.fsa and create 5-mers from the sequence. Now read the entries from rest.fsa and for every entry create the 5-mers from the sequence. Report which sequence in rest.fsa had the greatest overlap (and how much overlap) with the selected sequence. This must be the sequence that looks most like the selected one.