WikiSysop: /* Exercises to be handed in */

2024-03-13T14:05:00Z

Exercises to be handed in

WikiSysop: Created page with "NOTOC {| width=500 style="font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;" |Previous: Runtime evaluation of algorithms |Next: Scientific Libraries, Plotting |} == Required course material for the lesson == Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_11-Statistics_SciPy.ppt Statistics, SciPy]
== Subjects covered == Simple statistics with standard python
2024-03-06T14:08:20Z

Created page with "NOTOC {| width=500 style="font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;" |Previous: Runtime evaluation of algorithms |Next: Scientific Libraries, Plotting |} == Required course material for the lesson == Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_11-Statistics_SciPy.ppt Statistics, SciPy]  == Subjects covered == Simple statistics with standard python<br..."

New page
NOTOC
{| width=500 style="font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;"
|Previous: [[Runtime evaluation of algorithms]]
|Next: [[Scientific Libraries, Plotting]]
|}
== Required course material for the lesson ==
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_11-Statistics_SciPy.ppt Statistics, SciPy] 


== Subjects covered ==
Simple statistics with standard python 
Statistics with SciPy library

== Exercises to be handed in ==
In these exercises you should set the '''seed''' of the random number generator used in NumPy, so you can repeat your experiments.
# The reliability of statistics. NumPy can generate lists of random numbers drawn from normal distributed numbers (just use loc=0 and dev=1). These numbers will "obviously" also follow the normal distribution. SciPy can test how well a list of numbers follows the normal distribution. The task is to check how well the random numbers follows the normal distribution. SciPy's normaltest returns 2 values: 1) ''statistic'' which is a value for how "regular" (non-skewed and not too many outliers) the numbers are. The lower value the better, i.e. more regular. 2) ''pvalue'' which is strong evidence against the null hypothesis and suggests that the sample is not normally distributed. The higher value the better, i.e. is normal distributed - use the standard cutoff of 0.05. You are welcome to experiment with the cutoff. ''statistic'' and ''pvalue'' are in reality two values that express the same thing - to be clear; there is a linear relationship between ''statistic'' and log(''pvalue''). You must test if the size of your random list has any influence on the quality (as in being normal distributed) of the generated numbers. Test with list sizes from 20 to 10000 - use appropriate intervals. To make sure you have a good sample, generate 10000 samples for each list size. You should find out how many of your 10000 samples "makes the cut", i.e. has an acceptable ''pvalue''. When you find that number, you should consider the statistical value of checking if something is normal distributed. Do you feel you can convincingly say that "this" is a normal distribution? Perhaps read [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/] and [https://www.simplypsychology.org/p-value.html https://www.simplypsychology.org/p-value.html] before you answer.
# Since your datasets in previous exercise in essence originates from the same distribution (and you have seen how different the samples can be), is it possible to find two samples that look so different, that you can confidently (but in error) say that they come from two different distributions, specifically have different means? Hint: ttest_ind
# Some may recognize this exercise: You have a data file ''gene_combined.txt'' which is a tab separated file - perfect for pandas. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other. For each gene you have to make a simple linear regression analysis and find 3 numbers; the α (the intercept - where the line cuts the Y-axis) and β (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient (r) which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. Hint: SciPy linregress.

== Exercises for extra practice ==

Scientific Libraries, Statistics - Revision history

WikiSysop: /* Exercises to be handed in */