Python object model: Difference between revisions

From 22116
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 6: Line 6:
== Required course material for the lesson ==
== Required course material for the lesson ==
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22116/22116_13-Objects.ppt Object model and complex data]<br>
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22116/22116_13-Objects.ppt Object model and complex data]<br>
Resource: [[Example code - Data]]<br>
Resource: [[Example code - Complex data]]


== Subjects covered ==
== Subjects covered ==
Line 18: Line 18:
'''Important - read this before starting'''<br>
'''Important - read this before starting'''<br>
This exercise set will be similar to the format of the exam. The content will obviously be different.<br>
This exercise set will be similar to the format of the exam. The content will obviously be different.<br>
You have to download this python file. It contains some frame work code, but mostly some unfinished functions.
You have to download [https://teaching.healthtech.dtu.dk/material/22116/22116_13.py this python file.] It contains some frame work code, but mostly some unfinished functions.
Each exercise is about finishing one of the functions in the file. You can write the function directly in the python file, or use VScode or other editor to write it, but then it has to be copied over to the python file. You must hand in the finished python file, '''not''' a .ipynb file.
Each exercise is about finishing one of the function groups in the file. You can write the functions directly in the python file, or use VScode or other Jupyter Notebook editor to write it, but then it has to be copied over to the python file. You must hand in the finished python file, '''not''' a .ipynb file.<br>
Inability to understand or perform this process will make you fail the exam, so it is worth spending some time on the procedure.
 
# In the file ''test1.dat'' is results from an experiment where every line is in the form:<br>''AccessionNumber  Number Number Number ....''<br>In the files ''test2.dat'' and ''test3.dat'' are results from similar experiments but with a slightly different gene set. You want find the average the numbers from all experiments for each accession number. Save your results in the file ''combinedresults.txt'' in the form:<br>''AccessionNumber SingleAverageNumberOfAll3Experiments''<br>Of course it might happen that a certain gene is only in one or two experiments and in that case you calculate the average for those. You must use a one of complex data structures to store this data, hint hint -  a dict of lists.<br><br>
# Create a main function that reads a tab separated file with numbers, ''matrix.dat'', (to be understood as a matrix) and stores the numbers in a matrix (list of lists). Having read the matrix from file it should now transpose it (rows to columns and columns to rows) and save the transposed matrix in the file ''trans1matrix.dat''. The output should look like the input, that is '''not''' a python data structure.<br>You must construct a function like transpose(matrix), which gets a matrix as input an returned a transposed matrix - without using any global variables.<br>matrix = transpose(matrix)<br>This is the easiest, but momentarily most memory consuming method, you just return the transposed matrix, i.e. a new data structure.<br>How do you easily check if it works? Well, transposing twice yields the original matrix. [http://en.wikipedia.org/wiki/Transpose Check out Wikipedia's entry on transposing a matrix.]<br><br>
# This is the same problem as the previous exercise, except your transpose function have to transpose the matrix in-line, no returned matrix, i.e. the original matrix data structure is changed.<br>transpose(matrix)<br>The output file is this time ''trans2matrix.dat'' which should be identical with ''trans1matrix.dat''<br><br>
# Study the file ''dna-array.dat'' a bit. This is real DNA array data taken from a number of persons, some controls and some suffering from colon cancer. If you look at the second line there is a lot of 0 and 1. A '0' means that values in that column are from a cancer patient and a '1' means data are from a control (healthy person). The data are all log(intensity), i.e. the logarithm of the measured intensity of the relevant spot on the dna-chip. The data in this file will be used in coming exercises. The data/columns are tab separated. The second item on each line is the accession number for that particular gene.<br>Now make a main function that extracts data from one file and saves it in another, given the accession number, input file ''dna-array.dat'' and output file ''column.tab''. Search in the file for the data concerning that accession number. If it does not find it (you gave a wrong accession no), complain and stop. Otherwise it shall display the data in two tab separated columns. First column shall be the data from the cancer patients, second column for the controls. There are not the same number of sick and healthy people - be able to handle that.<br><br>
# The numbers in the input file ''dna-array.dat'' should be normalized between 0 and 1 for each line with an accession number, i.e. normalization only for the individual line - not across the data set. Write the result out in the file ''dna-array-norm.dat'', but NOT the control lines, i.e. lines where the annotation says 'control'. The resulting file will be similar to the original, but control lines are removed and the numbers are different. The problem can (and should) be solved one line at a time.<br><br>
# Read the file ''dna-array-norm.dat'' and transform all the numbers less than 0.5 to 0, and numbers at 0.5 or more to 1. Now for each line/accession calculate the average of the control group numbers and the cancer group numbers. If the two averages are more than 0.4 from each other, this is considered significant and the accession should be saved in the file ''regulation.txt'' along with a message '''up''' or '''down''' if it is an up regulation or a down regulation of the cancer group compared to the control. That means each output line looks like "H80240 up" or "H34534 down".


== Exercises for extra practice ==
== Exercises for extra practice ==

Latest revision as of 16:40, 3 October 2025

Previous: Regular expressions Next: Programme

Required course material for the lesson

Powerpoint: Object model and complex data
Resource: Example code - Complex data

Subjects covered

  • Python objects
  • Identity
  • Mutable vs immutable
  • Complex data
  • Exam format

Exercises to be handed in

Important - read this before starting
This exercise set will be similar to the format of the exam. The content will obviously be different.
You have to download this python file. It contains some frame work code, but mostly some unfinished functions. Each exercise is about finishing one of the function groups in the file. You can write the functions directly in the python file, or use VScode or other Jupyter Notebook editor to write it, but then it has to be copied over to the python file. You must hand in the finished python file, not a .ipynb file.
Inability to understand or perform this process will make you fail the exam, so it is worth spending some time on the procedure.

  1. In the file test1.dat is results from an experiment where every line is in the form:
    AccessionNumber Number Number Number ....
    In the files test2.dat and test3.dat are results from similar experiments but with a slightly different gene set. You want find the average the numbers from all experiments for each accession number. Save your results in the file combinedresults.txt in the form:
    AccessionNumber SingleAverageNumberOfAll3Experiments
    Of course it might happen that a certain gene is only in one or two experiments and in that case you calculate the average for those. You must use a one of complex data structures to store this data, hint hint - a dict of lists.

  2. Create a main function that reads a tab separated file with numbers, matrix.dat, (to be understood as a matrix) and stores the numbers in a matrix (list of lists). Having read the matrix from file it should now transpose it (rows to columns and columns to rows) and save the transposed matrix in the file trans1matrix.dat. The output should look like the input, that is not a python data structure.
    You must construct a function like transpose(matrix), which gets a matrix as input an returned a transposed matrix - without using any global variables.
    matrix = transpose(matrix)
    This is the easiest, but momentarily most memory consuming method, you just return the transposed matrix, i.e. a new data structure.
    How do you easily check if it works? Well, transposing twice yields the original matrix. Check out Wikipedia's entry on transposing a matrix.

  3. This is the same problem as the previous exercise, except your transpose function have to transpose the matrix in-line, no returned matrix, i.e. the original matrix data structure is changed.
    transpose(matrix)
    The output file is this time trans2matrix.dat which should be identical with trans1matrix.dat

  4. Study the file dna-array.dat a bit. This is real DNA array data taken from a number of persons, some controls and some suffering from colon cancer. If you look at the second line there is a lot of 0 and 1. A '0' means that values in that column are from a cancer patient and a '1' means data are from a control (healthy person). The data are all log(intensity), i.e. the logarithm of the measured intensity of the relevant spot on the dna-chip. The data in this file will be used in coming exercises. The data/columns are tab separated. The second item on each line is the accession number for that particular gene.
    Now make a main function that extracts data from one file and saves it in another, given the accession number, input file dna-array.dat and output file column.tab. Search in the file for the data concerning that accession number. If it does not find it (you gave a wrong accession no), complain and stop. Otherwise it shall display the data in two tab separated columns. First column shall be the data from the cancer patients, second column for the controls. There are not the same number of sick and healthy people - be able to handle that.

  5. The numbers in the input file dna-array.dat should be normalized between 0 and 1 for each line with an accession number, i.e. normalization only for the individual line - not across the data set. Write the result out in the file dna-array-norm.dat, but NOT the control lines, i.e. lines where the annotation says 'control'. The resulting file will be similar to the original, but control lines are removed and the numbers are different. The problem can (and should) be solved one line at a time.

  6. Read the file dna-array-norm.dat and transform all the numbers less than 0.5 to 0, and numbers at 0.5 or more to 1. Now for each line/accession calculate the average of the control group numbers and the cancer group numbers. If the two averages are more than 0.4 from each other, this is considered significant and the accession should be saved in the file regulation.txt along with a message up or down if it is an up regulation or a down regulation of the cancer group compared to the control. That means each output line looks like "H80240 up" or "H34534 down".

Exercises for extra practice