Scientific Libraries, Pandas, Numpy: Difference between revisions
(Created page with "__NOTOC__ {| width=500 style="font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;" |Previous: Unit test |Next: Runtime evaluation of algorithms |} == Required course material for the lesson == Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_09-NumpyPandas.ppt Scientific libraries, Pandas & NumPy]<br> Online: [https://pandas.pydata.org/docs/user_guide/index.html https://pandas.pydata.org/]Pandas documantation<br> Online: [http...") |
mNo edit summary |
||
Line 5: | Line 5: | ||
|} | |} | ||
== Required course material for the lesson == | == Required course material for the lesson == | ||
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_09- | Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_09-PandasNumoy.pptx Scientific libraries, Pandas & NumPy]<br> | ||
Online: [https://pandas.pydata.org/docs/user_guide/index.html https://pandas.pydata.org/]Pandas documantation<br> | Online: [https://pandas.pydata.org/docs/user_guide/index.html https://pandas.pydata.org/]Pandas documantation<br> | ||
Online: [https://numpy.org/doc/stable/ https://numpy.org/] NumPy documentation | Online: [https://numpy.org/doc/stable/ https://numpy.org/] NumPy documentation |
Revision as of 15:02, 4 April 2024
Previous: Unit test | Next: Runtime evaluation of algorithms |
Required course material for the lesson
Powerpoint: Scientific libraries, Pandas & NumPy
Online: https://pandas.pydata.org/Pandas documantation
Online: https://numpy.org/ NumPy documentation
Subjects covered
General into to scientific libraries
Pandas
NumPy
Exercises to be handed in
Pandas
During this part of the exercise, you will be working with the data that was used to validate the tool ResFinder (https://pubmed.ncbi.nlm.nih.gov/32780112/). In order to do it, different Centers around the world (Denmark, Germany, Belgium, UK and USA) isolated several bacteria species found in clinical and surveillance environments, and searched for antimicrobial resistance in the laboratory and using ResFinder. In the laboratory, the bacteria isolated were subjected to a MIC (Minimum Inhibitory Concentration) testing of different antimicrobials; in other words, how much antimicrobial we have to give to bacteria isolates until they stop growing. If the value of MIC is higher than certain standards, that indicates that that bacteria is resistant to that antimicrobial. Usually, bacteria that should be killed by an antimicrobial but suddenly they are resistant is because they have acquired a gene or mutation that makes them resistant to that substance. ResFinder is a bioinformatic tool that tries to find those genes/mutations on the sequenced DNA of bacteria.
A big part of the ResFinder tool validation was to receive the reports from the different centers (reports from laboratories and bioinformatic teams) and analyze them together. You will be making this step during this exercise. The data necessary is in the zip file pandas_exercise.zip.
- Load the metadata files (ending in _ids.txt) from Belgium, Denmark, Germany, UK and USA, and create a dataframe stacking the five dataframes. The final dataframe should include an extra column indicating which country each sample comes from. Get the amount of samples that come from Surveillance and from Clinical origins with respect the Source (Hint: groupby function is your friend).
- Do the same you have done in exercise 1 with the lab files (ending in lab_results.txt) and bioinformatic files (ending in bioinf_results.txt) for all countries. The columns of the bioinformatic results should be strings or objects; while the lab results should be strings (samples) and floats (the rest of columns). As you might have noticed, USA and UK did not follow the format that we asked. You will have to go from [MIC: <mic_value>] to [<mic_value>], where mic_value is float. UPDATE: Seems like UK also added a sneaky "<". Replace it with the same method.
- Join the three dataframes row-wise, using the dataframe IDs as a way of mapping the reads ids (bioinformatic results) and the sample ids (laboratory results). Notice you might lose data on the way; that is fine. Hint: merge or join function is your friend here.
- Not all the laboratories have performed analysis on all the antimicrobials. Try to get the antimicrobials that USA has not performed analysis on. (Hint: When a cell in a column made of float numbers is empty, pandas uses the value "np.NaN")
- Save the final dataset that you got from the last exercise under the name resfinder_project.tsv. Has to be tab separated, the index should not be included.
The following exercises should not be started before Thursday.
Numpy
- You are now going to work with gene expression data. Your employer has given you the results of the analysis from two different machines, but on the same samples. The analysis has been done in ten samples, and 5000 genes have been analyzed. In other words, you have the data from two machines (gene_expression1.txt and gene_expression2.txt), with an array each one of 10, 5000 (samples, genes). Read the gene_expression1 file and stored it in an array.
- Seems the second machine outputs the results in the format of genes, samples (5000,10). Read the file, stored it in an array and turn it into an array with shape (10, 5000).
- Your employer wants to normalize each sample. In other words, you need to subtract the mean of each row (Sample_normalizedn = Samplen - Mean_samplen)
- Your employer ask you to save both arrays in the same file, firstly stacking them row-wise, and then saving them in a .npy file: normalized_array.npy.