Set techniques: Difference between revisions

Revision as of 15:47, 3 September 2025

Required course material for the lesson

Powerpoint: Sets
Video: Sets

Subjects covered

Sets, which are unordered data collections with no duplicates.
Set methods
Set uses

Exercises to be handed in

Many modern bioinformatic algorithms utilizes k-mers. A k-mer is a piece of sequence k units long. The k in the k-mer usually changes size, i.e. length of the sequence depending on the algorithm. It can be both protein and DNA sequence. As an example I have this DNA sequence: GACTAC. It contains the following 4-mers: GACT, ACTA, CTAC. The first 4 exercises works with k-mers and sets and leads up to an algorithm for find the sequence in a group that looks most like a target sequence. Would they be homologs? Good bet.

From a genbank file like data1-4.gb extract the DNA sequence (been there - done that). Now insert all 5-mers in the DNA sequence into a set. Compute the following numbers: how many did you insert in the set, how many are in the set, how many different 5-mers could there possibly be. Display. It seems like the two first numbers are the same, but this is not guaranteed to be true. Explain why using your python knowledge.
Calculate the overlap of 5-mers between any two of the data1-4.gb files. Just ask for two filenames and calculate for these.
Make a program that asks for a number (integer) then reads the dna7.fsa file (which contains insulin-like genes) and saves the entry selected by the number in selected.fsa and the rest of the entries into the file rest.fsa This should be a fairly easy task since you have your fastaread and fastawrite functions from lesson 8, exercise 5 and 6.
Now that you can select a fasta entry, then read the selected.fsa and create 5-mers from the sequence. Now read the entries from rest.fsa and for every entry create the 5-mers from the sequence. Report which sequence in rest.fsa had the greatest overlap (and how much overlap) with the selected sequence. This must be the sequence that looks most like the selected one.
Read the sequences in the file dna7.fsa. Find out which and how many of the 64 codons are not used somewhere in the sequences. Print the unused codons.
You have made a program (let's call it the X-program), which as input takes a file of accession numbers, start10.dat and produces some output, which is in res10.dat. Now you count the lines in your input file and your output file and you discover that the line numbers do not match. Horror - your program does not produce output for some input. Now the assignment is to discover which accession numbers did not produce output. This can be done in various ways, but now you have to use a set. Print the results.
In the file ex5.acc are a lot of accession numbers, where some are duplicates. We have earlier cleaned this file of duplicates. Let's do that again using a set. Make a program that reads the file once, finds the unique accession numbers and write them to the file uniq5.acc
In the data1.gb file there are 6 references (to articles). Make a program that extracts all authors from the references, eliminates those that are duplicates and print the list of authors. You will notice that some authors seems to be the same person using different initials. You should only consider a person a duplicate if the name matches exactly. This should also work for the other Genbank entries: data2.gb, data3.gb & data4.gb.
Beware: there traps in this exercise, check your output properly.

@@ Line 15: / Line 15: @@
 == Exercises to be handed in ==
 Many modern bioinformatic algorithms utilizes '''k-mers'''. A k-mer is a piece of sequence k units long. The k in the k-mer usually changes size, i.e. length of the sequence depending on the algorithm. It can be both protein and DNA sequence. As an example I have this DNA sequence: GACTAC. It contains the following 4-mers: GACT, ACTA, CTAC. The first 4 exercises works with k-mers and sets and leads up to an algorithm for find the sequence in a group that looks most like a target sequence. Would they be homologs? Good bet.
-# From a genbank file like ''data1-4.gb'' extract the DNA sequence (been there - done that). Now insert all 5-mers in the DNA sequence into a set. Compute the following numbers: how many did you insert in the set, how many are in the set, how many different 5-mers could there possibly be. Display. It seems like the two first numbers are the same, but this is not guaranteed to be true. Explain why using your biological background.
+# From a genbank file like ''data1-4.gb'' extract the DNA sequence (been there - done that). Now insert all 5-mers in the DNA sequence into a set. Compute the following numbers: how many did you insert in the set, how many are in the set, how many different 5-mers could there possibly be. Display. It seems like the two first numbers are the same, but this is not guaranteed to be true. Explain why using your python knowledge.
 # Calculate the overlap of 5-mers between any two of the ''data1-4.gb'' files. Just ask for two filenames and calculate for these.
 # Make a program that asks for a number (integer) then reads the ''dna7.fsa'' file (which contains insulin-like genes) and saves the entry selected by the number in ''selected.fsa'' and the rest of the entries into the file ''rest.fsa'' This should be a fairly easy task since you have your '''fastaread''' and '''fastawrite''' functions from lesson 8, exercise 5 and 6.

Set techniques: Difference between revisions

Revision as of 15:47, 3 September 2025

Required course material for the lesson

Subjects covered

Exercises to be handed in

Exercises for extra practice

Navigation menu

Set techniques: Difference between revisions

Revision as of 15:47, 3 September 2025

Required course material for the lesson

Subjects covered

Exercises to be handed in

Exercises for extra practice

Navigation menu

Search