Dict techniques - Revision history

WikiSysop: /* Exercises to be handed in */

2025-10-03T13:58:35Z

Exercises to be handed in

← Older revision		Revision as of 15:58, 3 October 2025
Line 19:		Line 19:

	== Exercises to be handed in ==		== Exercises to be handed in ==
	# Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a [[codon list]] here. You are meant to make the dict "by hand" as there is a structure lesson in that. Add a bit of smart code that tests if you made the dict right.<br> Extra: If you feel like it you can in addition make a program that constructs the dict from a file, which you are responsible for making.		# Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a [[codon list]] here. You are meant to make the dict "by hand" as there is a structure lesson in that. Add a bit of smart code that tests if you made the dict right.<br> Extra: If you feel like it you can in addition make a program that constructs the dict from a file, which you are responsible for making.<br><br>
	# Use the dictionary from the previous exercise and your previous functions '''fastaread()''' and '''fastawrite()''' in a program, that translates all the nucleotide fasta entries in ''dna7.fsa'' to amino acid sequence. Save the results in a file ''aa7.fsa'' in fasta format. Since the sequences are now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.		# Use the dictionary from the previous exercise and your previous functions '''fastaread()''' and '''fastawrite()''' in a program, that translates all the nucleotide fasta entries in ''dna7.fsa'' to amino acid sequence. Save the results in a file ''aa7.fsa'' in fasta format. Since the sequences are now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.<br><br>
	# In the file ''ex5.acc'' are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, but now we should count them. Make a program that reads the file once, and writes a file ''order5.acc'' with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ''ex5.acc''. The accession numbers must be written in order, which means the accession with most duplicates is on top (the beginning) and the least on bottom. If two accessions have the same amount of duplicates, they need to be ordered according to the accession name, i.e. AC543322 is before BG001110.<br>Note: This is quite a tricky exercise. If you are absolutely stuck, then at least order the accessions by the number of duplicates and hand in.		# In the file ''ex5.acc'' are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, but now we should count them. Make a program that reads the file once, and writes a file ''order5.acc'' with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ''ex5.acc''. The accession numbers must be written in order, which means the accession with most duplicates is on top (the beginning) and the least on bottom. If two accessions have the same amount of duplicates, they need to be ordered according to the accession name, i.e. AC543322 is before BG001110.<br>Note: This is quite a tricky exercise. If you are absolutely stuck, then at least order the accessions by the number of duplicates and hand in.<br><br>
	# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>		# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font><br><br>
	# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.		# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.<br><br>
	# In the files ''geneA.txt'', ''geneB.txt'', all the way down to ''geneE.txt'' you have normalized mRNA expression data taken at the time of discovery of colon cancer for a number of patients and their survival. This is basically 2 columns in each file; The mRNA expression (x) and the number of months (y) the patient survived. For each gene you have to make a [https://en.wikipedia.org/wiki/Simple_linear_regression simple linear regression] analysis and find 3 numbers; the '''α''' (the intercept - where the line cuts the Y-axis) and '''β''' (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient ('''r''') which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. For every gene you start calculating these values.<br>[[File:calc.png\|220px]]              ''n'' = number of observations.<br>From the values you can compute the required parameters.<br>[[File:alfabeta.png\|130px]]          [[File:correlation.png\|200px]]<br>Remember to say which gene best describes survival - and why. A survival prediction can be made by calculating β * x + α, given x which is the mRNA expression.<br>Note: The genes will in reality interact with each other in ways that totally destroys our basic assumption for making a linear regression: That the data (gene expressions) are independent.<br>Make your code in a general way - there can for example be more data files. Make it easy to add them.<br>The gene with the best correlation coefficient is geneD, with a CC of 83.75%.		# In the files ''geneA.txt'', ''geneB.txt'', all the way down to ''geneE.txt'' you have normalized mRNA expression data taken at the time of discovery of colon cancer for a number of patients and their survival. This is basically 2 columns in each file; The mRNA expression (x) and the number of months (y) the patient survived. For each gene you have to make a [https://en.wikipedia.org/wiki/Simple_linear_regression simple linear regression] analysis and find 3 numbers; the '''α''' (the intercept - where the line cuts the Y-axis) and '''β''' (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient ('''r''') which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. For every gene you start calculating these values.<br>[[File:calc.png\|220px]]              ''n'' = number of observations.<br>From the values you can compute the required parameters.<br>[[File:alfabeta.png\|130px]]          [[File:correlation.png\|200px]]<br>Remember to say which gene best describes survival - and why. A survival prediction can be made by calculating β * x + α, given x which is the mRNA expression.<br>Note: The genes will in reality interact with each other in ways that totally destroys our basic assumption for making a linear regression: That the data (gene expressions) are independent.<br>Make your code in a general way - there can for example be more data files. Make it easy to add them.<br>The gene with the best correlation coefficient is geneD, with a CC of 83.75%.<br><br>
	# Repeat the previous exercise again with a new type of data file ''gene_combined.txt'', which is more typical in real life. All genes are in one tab separated file. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other.<br>Again, make general code. There can be more or fewer genes, and you do not need to know there names beforehand.		# Repeat the previous exercise again with a new type of data file ''gene_combined.txt'', which is more typical in real life. All genes are in one tab separated file. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other.<br>Again, make general code. There can be more or fewer genes, and you do not need to know there names beforehand.

WikiSysop: /* Exercises for extra practice */

2025-09-06T15:28:55Z

Exercises for extra practice

← Older revision		Revision as of 17:28, 6 September 2025
Line 29:		Line 29:
	== Exercises for extra practice ==		== Exercises for extra practice ==
	* The ''geneA-E.txt'' files all have the same structure on each line; first number is a float between 0 and 1, second number is an integer (representing months of survival after discovery of the cancer). For all files (the combined data set) find the average of the float, given the integer and display in ascending order of the integer. You need to add all the floats for a given integer together and divide by the number of floats for the integer, then you have the average for the integer. To succeed at this, you must use two dicts where the integer is the key in both. The corresponding values are the sum of the floats (for that key) and the number of times the key has been encountered in the files. Hint: Unfortunately, this does not make much biological sense, but is more in the nature of a programming exercise.		* The ''geneA-E.txt'' files all have the same structure on each line; first number is a float between 0 and 1, second number is an integer (representing months of survival after discovery of the cancer). For all files (the combined data set) find the average of the float, given the integer and display in ascending order of the integer. You need to add all the floats for a given integer together and divide by the number of floats for the integer, then you have the average for the integer. To succeed at this, you must use two dicts where the integer is the key in both. The corresponding values are the sum of the floats (for that key) and the number of times the key has been encountered in the files. Hint: Unfortunately, this does not make much biological sense, but is more in the nature of a programming exercise.
	* This exercise requires exercise 6 from [[Simple pattern matching]]. Modify the code a bit so you only compute what you have to. In the ''data1-4.gb'' files count ~~who~~ many times the different codons in the coding sequence occurs. Display.		* This exercise requires exercise 6 from [[Simple pattern matching]]. Modify the code a bit so you only compute what you have to. In the ''data1-4.gb'' files count how many times the different codons in the coding sequence occurs. Display.
	* This exercise builds on exercise 2 for this lesson. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot. You do not need to save the fasta file.		* This exercise builds on exercise 2 for this lesson. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot. You do not need to save the fasta file.

WikiSysop: /* Exercises to be handed in */

2025-09-06T11:36:07Z

Exercises to be handed in

← Older revision		Revision as of 13:36, 6 September 2025
Line 24:		Line 24:
	# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>		# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>
	# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.		# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.
	# In the files ''geneA.txt'', ''geneB.txt'', all the way down to ''geneE.txt'' you have normalized mRNA expression data taken at the time of discovery of colon cancer for a number of patients and their survival. This is basically 2 columns in each file; The mRNA expression (x) and the number of months (y) the patient survived. For each gene you have to make a [https://en.wikipedia.org/wiki/Simple_linear_regression simple linear regression] analysis and find 3 numbers; the '''α''' (the intercept - where the line cuts the Y-axis) and '''β''' (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient ('''r''') which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. For every gene you start calculating these values.<br>[[File:calc.png\|220px]]              ''n'' = number of observations.<br>From the values you can compute the required parameters.<br>[[File:alfabeta.png\|130px]]          [[File:correlation.png\|200px]]<br>Remember to say which gene best describes survival - and why. A survival prediction can be made by calculating β * x + α, given x which is the mRNA expression.<br>Note: The genes will in reality interact with each other in ways that totally destroys our basic assumption for making a linear regression: That the data (gene expressions) are independent.<br>Make your code in a general way - there can for example be more data files. Make it easy to add them.		# In the files ''geneA.txt'', ''geneB.txt'', all the way down to ''geneE.txt'' you have normalized mRNA expression data taken at the time of discovery of colon cancer for a number of patients and their survival. This is basically 2 columns in each file; The mRNA expression (x) and the number of months (y) the patient survived. For each gene you have to make a [https://en.wikipedia.org/wiki/Simple_linear_regression simple linear regression] analysis and find 3 numbers; the '''α''' (the intercept - where the line cuts the Y-axis) and '''β''' (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient ('''r''') which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. For every gene you start calculating these values.<br>[[File:calc.png\|220px]]              ''n'' = number of observations.<br>From the values you can compute the required parameters.<br>[[File:alfabeta.png\|130px]]          [[File:correlation.png\|200px]]<br>Remember to say which gene best describes survival - and why. A survival prediction can be made by calculating β * x + α, given x which is the mRNA expression.<br>Note: The genes will in reality interact with each other in ways that totally destroys our basic assumption for making a linear regression: That the data (gene expressions) are independent.<br>Make your code in a general way - there can for example be more data files. Make it easy to add them.<br>The gene with the best correlation coefficient is geneD, with a CC of 83.75%.
	# Repeat the previous exercise again with a new type of data file ''gene_combined.txt'', which is more typical in real life. All genes are in one tab separated file. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other.<br>Again, make general code. There can be more or fewer genes, and you do not need to know there names beforehand.		# Repeat the previous exercise again with a new type of data file ''gene_combined.txt'', which is more typical in real life. All genes are in one tab separated file. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other.<br>Again, make general code. There can be more or fewer genes, and you do not need to know there names beforehand.

WikiSysop: /* Exercises to be handed in */

2025-09-06T09:55:44Z

Exercises to be handed in

← Older revision		Revision as of 11:55, 6 September 2025
Line 21:		Line 21:
	# Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a [[codon list]] here. You are meant to make the dict "by hand" as there is a structure lesson in that. Add a bit of smart code that tests if you made the dict right.<br> Extra: If you feel like it you can in addition make a program that constructs the dict from a file, which you are responsible for making.		# Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a [[codon list]] here. You are meant to make the dict "by hand" as there is a structure lesson in that. Add a bit of smart code that tests if you made the dict right.<br> Extra: If you feel like it you can in addition make a program that constructs the dict from a file, which you are responsible for making.
	# Use the dictionary from the previous exercise and your previous functions '''fastaread()''' and '''fastawrite()''' in a program, that translates all the nucleotide fasta entries in ''dna7.fsa'' to amino acid sequence. Save the results in a file ''aa7.fsa'' in fasta format. Since the sequences are now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.		# Use the dictionary from the previous exercise and your previous functions '''fastaread()''' and '''fastawrite()''' in a program, that translates all the nucleotide fasta entries in ''dna7.fsa'' to amino acid sequence. Save the results in a file ''aa7.fsa'' in fasta format. Since the sequences are now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.
	# In the file ''ex5.acc'' are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, but now we should count them. Make a program that reads the file once, and writes a file ''order5.acc'' with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ''ex5.acc''. The accession numbers must be written in order, which means the accession with most duplicates is on top (the beginning) and the least on bottom. If two accessions have the same amount of duplicates, they need to be ordered according to the accession name, i.e. AC543322 is before BG001110.		# In the file ''ex5.acc'' are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, but now we should count them. Make a program that reads the file once, and writes a file ''order5.acc'' with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ''ex5.acc''. The accession numbers must be written in order, which means the accession with most duplicates is on top (the beginning) and the least on bottom. If two accessions have the same amount of duplicates, they need to be ordered according to the accession name, i.e. AC543322 is before BG001110.<br>Note: This is quite a tricky exercise. If you are absolutely stuck, then at least order the accessions by the number of duplicates and hand in.
	# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>		# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>
	# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.		# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.

WikiSysop at 09:01, 5 September 2025

2025-09-05T09:01:42Z

← Older revision		Revision as of 11:01, 5 September 2025
Line 24:		Line 24:
	# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>		# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>
	# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.		# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.
	# ~~The~~ ''geneA-E.txt'' ~~files~~ all have the ~~same structure on~~ each ~~line~~; ~~first~~ number is a ~~float between 0~~ and ~~1, second number is an integer~~ (~~representing months of survival after discovery of~~ the ~~cancer~~)~~. For all files~~ (the ~~combined data set~~) ~~find~~ the ~~average of~~ the ~~float~~, ~~given~~ the ~~integer and display in ascending order~~ of the ~~integer~~. You ~~need to add all~~ the ~~floats for a given integer together and divide by~~ the number of ~~floats for~~ the ~~integer, then~~ you ~~have~~ the ~~average for the integer~~. ~~To succeed at this~~, ~~you must use two dicts where the integer~~ is the ~~key in both~~. The ~~corresponding values are~~ the ~~sum of the floats~~ (for ~~that key) and~~ the ~~number~~ of ~~times the key has been encountered~~ in the ~~files~~.		# In the files ''geneA.txt'', ''geneB.txt'', all the way down to ''geneE.txt'' you have normalized mRNA expression data taken at the time of discovery of colon cancer for a number of patients and their survival. This is basically 2 columns in each file; The mRNA expression (x) and the number of months (y) the patient survived. For each gene you have to make a [https://en.wikipedia.org/wiki/Simple_linear_regression simple linear regression] analysis and find 3 numbers; the '''α''' (the intercept - where the line cuts the Y-axis) and '''β''' (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient ('''r''') which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. For every gene you start calculating these values.<br>[[File:calc.png\|220px]]              ''n'' = number of observations.<br>From the values you can compute the required parameters.<br>[[File:alfabeta.png\|130px]]          [[File:correlation.png\|200px]]<br>Remember to say which gene best describes survival - and why. A survival prediction can be made by calculating β * x + α, given x which is the mRNA expression.<br>Note: The genes will in reality interact with each other in ways that totally destroys our basic assumption for making a linear regression: That the data (gene expressions) are independent.<br>Make your code in a general way - there can for example be more data files. Make it easy to add them.
			# Repeat the previous exercise again with a new type of data file ''gene_combined.txt'', which is more typical in real life. All genes are in one tab separated file. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other.<br>Again, make general code. There can be more or fewer genes, and you do not need to know there names beforehand.

	== Exercises for extra practice ==		== Exercises for extra practice ==
			* The ''geneA-E.txt'' files all have the same structure on each line; first number is a float between 0 and 1, second number is an integer (representing months of survival after discovery of the cancer). For all files (the combined data set) find the average of the float, given the integer and display in ascending order of the integer. You need to add all the floats for a given integer together and divide by the number of floats for the integer, then you have the average for the integer. To succeed at this, you must use two dicts where the integer is the key in both. The corresponding values are the sum of the floats (for that key) and the number of times the key has been encountered in the files. Hint: Unfortunately, this does not make much biological sense, but is more in the nature of a programming exercise.
	* This exercise requires exercise 6 from [[Simple pattern matching]]. Modify the code a bit so you only compute what you have to. In the ''data1-4.gb'' files count who many times the different codons in the coding sequence occurs. Display.		* This exercise requires exercise 6 from [[Simple pattern matching]]. Modify the code a bit so you only compute what you have to. In the ''data1-4.gb'' files count who many times the different codons in the coding sequence occurs. Display.
	* This exercise builds on exercise 2 for this lesson. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot. You do not need to save the fasta file.		* This exercise builds on exercise 2 for this lesson. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot. You do not need to save the fasta file.

WikiSysop: /* Exercises for extra practice */

2025-09-03T11:57:13Z

Exercises for extra practice

← Older revision		Revision as of 13:57, 3 September 2025
Line 27:		Line 27:

	== Exercises for extra practice ==		== Exercises for extra practice ==
	* Given a tab-separated file with 3 columns; StudentID, CourseNumber, Grade. Can you find a way to load the grades for a student in a retrievable manner into (some of) the python data structures learned so far? Retrievable means here that you can find the grades for a student if you know the studentID.<br>Explain your approach. Hint: It is not necessarily efficient.		* This exercise requires exercise 6 from [[Simple pattern matching]]. Modify the code a bit so you only compute what you have to. In the ''data1-4.gb'' files count who many times the different codons in the coding sequence occurs. Display.
	* This exercise requires ~~that you did the last two practice exercises in~~ [[Simple ~~Pattern Matching~~]]. In the ''data1-4.gb'' files count who many times the different codons in the coding sequence occurs. Display.		* This exercise builds on exercise 2 for this lesson. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot. You do not need to save the fasta file.
	* This exercise builds on ~~mandatory~~ exercise 2. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot.

WikiSysop at 11:51, 3 September 2025

2025-09-03T11:51:39Z

← Older revision		Revision as of 13:51, 3 September 2025
Line 24:		Line 24:
	# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>		# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>
	# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.		# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.
			# The ''geneA-E.txt'' files all have the same structure on each line; first number is a float between 0 and 1, second number is an integer (representing months of survival after discovery of the cancer). For all files (the combined data set) find the average of the float, given the integer and display in ascending order of the integer. You need to add all the floats for a given integer together and divide by the number of floats for the integer, then you have the average for the integer. To succeed at this, you must use two dicts where the integer is the key in both. The corresponding values are the sum of the floats (for that key) and the number of times the key has been encountered in the files.

	== Exercises for extra practice ==		== Exercises for extra practice ==
	* Given a tab-separated file with 3 columns; StudentID, CourseNumber, Grade. Can you find a way to load the grades for a student in a retrievable manner into (some of) the python data structures learned so far? Retrievable means here that you can find the grades for a student if you know the studentID.<br>Explain your approach. Hint: It is not necessarily efficient.		* Given a tab-separated file with 3 columns; StudentID, CourseNumber, Grade. Can you find a way to load the grades for a student in a retrievable manner into (some of) the python data structures learned so far? Retrievable means here that you can find the grades for a student if you know the studentID.<br>Explain your approach. Hint: It is not necessarily efficient.
	* The ''geneA-E.txt'' files all have the same structure on each line; first number is a float between 0 and 1, second number is an integer. For all files (the combined data set) find the average of the float, given the integer and display in ascending order of the integer. You need to add all the floats for a given integer together and divide by the number of floats for the integer, then you have the average for the integer. To succeed at this, you must use two dicts where the integer is the key in both. The corresponding values are the sum of the floats (for that key) and the number of times the key has been encountered in the files.
	* This exercise requires that you did the last two practice exercises in [[Simple Pattern Matching]]. In the ''data1-4.gb'' files count who many times the different codons in the coding sequence occurs. Display.		* This exercise requires that you did the last two practice exercises in [[Simple Pattern Matching]]. In the ''data1-4.gb'' files count who many times the different codons in the coding sequence occurs. Display.
	* This exercise builds on mandatory exercise 2. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot.		* This exercise builds on mandatory exercise 2. You must read the ''dna7.fsa'' file and translate the DNA sequences to protein sequence. Report the frequencies of the various amino acids for the entire file - all sequences (not individual sequences). That is - count how many there is of each amino acid (a total) in the translated sequences, compute the frequency of each (Number_of_this_amino_acid/Total_number_of_amino_acids) and print the results as "S 0.0123", i.e. 4 digits after the dot.

WikiSysop: /* Exercises to be handed in */

2025-09-03T11:46:56Z

Exercises to be handed in

← Older revision		Revision as of 13:46, 3 September 2025
Line 19:		Line 19:

	== Exercises to be handed in ==		== Exercises to be handed in ==
	# Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a [[codon list]] here. You are meant to make the dict "by hand" as there is a structure lesson in that. If you feel like it you can in addition make a program that constructs the dict from a file.		# Create a dictionary where the keys are codons and the value are the one-letter-code for the amino acids. The dictionary will function as a look-up table. You can find a [[codon list]] here. You are meant to make the dict "by hand" as there is a structure lesson in that. Add a bit of smart code that tests if you made the dict right.<br> Extra: If you feel like it you can in addition make a program that constructs the dict from a file, which you are responsible for making.
	# Use the dictionary from the previous exercise in a program, that translates all the nucleotide fasta entries in ''dna7.fsa'' to amino acid sequence. Save the results in a file ''aa7.fsa'' in fasta format. Since the ~~sequence is~~ now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.		# Use the dictionary from the previous exercise and your previous functions '''fastaread()''' and '''fastawrite()''' in a program, that translates all the nucleotide fasta entries in ''dna7.fsa'' to amino acid sequence. Save the results in a file ''aa7.fsa'' in fasta format. Since the sequences are now consisting of amino acids add 'Amino Acid Sequence' to each header. The STOP codon is NOT a part of the amino acid sequence. Think about what STOP means.
	# In the file ''ex5.acc'' are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, now we should count them. Make a program that reads the file once, and writes a file ''~~noorder5~~.acc'' with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ''ex5.acc''.		# In the file ''ex5.acc'' are a lot of accession numbers, where some are duplicates. Earlier we just removed the duplicates, but now we should count them. Make a program that reads the file once, and writes a file ''order5.acc'' with the unique accession numbers and the number of occurrences in the file. A line should look like this: "AC24677 2", if this accession occurs twice in ''ex5.acc''. The accession numbers must be written in order, which means the accession with most duplicates is on top (the beginning) and the least on bottom. If two accessions have the same amount of duplicates, they need to be ordered according to the accession name, i.e. AC543322 is before BG001110.
	~~# Improve~~ the ~~previous exercise by saving~~ the accessions ~~in order~~ of ~~occurrences with~~ the ~~top counts first in the file ''order5~~.~~acc''~~.
	# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>		# <font color="#AA00FF">In the tab-separated files ''slinger.txt'' and ''hoist.txt'' are two columns with an accession number and a numeric result; a probability between 0 and 1. The numbers are from running 2 different programs (slinger and hoist, if you are in doubt). You must combine these probabilities - basically taking the average of the two numbers - for each accession number and write the result in a file ''combined.txt''. The file should look like the sources, i.e. tab-separated with accession in column 1 and number in column 2. Unfortunately, the two programs have not been run from the same set of accession numbers, so some of the results are only available in one of the input files. In such case you ignore/discard the data for that accession. Only save results in the output file when the accession is in both of the input files.</font>
	# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.		# Using above method gives you too little data. You try this time to combine your two input sets differently. If an accession is in both input files you use the average, if it is in only one, you just use the number straight in the output file. This is effectively making a union of the input instead of an intersection.

WikiSysop: Created page with "NOTOC {| width=500 style="font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;" |Previous: Set techniques |Next: Regular expressions |} == Required course material for the lesson == Powerpoint: [https://teaching.healthtech.dtu.dk/material/22116/22116_11-Dicts.ppt Dictionaries]