Scientific Libraries, Statistics
Previous: Runtime evaluation of algorithms | Next: Scientific Libraries, Plotting |
Required course material for the lesson
Powerpoint: Statistics, SciPy
Subjects covered
Simple statistics with standard python
Statistics with SciPy library
Exercises to be handed in
In these exercises you should set the seed of the random number generator used in NumPy, so you can repeat your experiments.
- The reliability of statistics. NumPy can generate lists of random numbers drawn from normal distributed numbers (just use loc=0 and dev=1). These numbers will "obviously" also follow the normal distribution. SciPy can test how well a list of numbers follows the normal distribution. The task is to check how well the random numbers follows the normal distribution. SciPy's normaltest returns 2 values:
1) statistic which is a value for how "regular" (non-skewed and not too many outliers) the numbers are. The lower value the better, i.e. more regular.
2) pvalue which is strong evidence against the null hypothesis and suggests that the sample is not normally distributed. The higher value the better, i.e. is normal distributed - use the standard cutoff of 0.05.
You are welcome to experiment with the cutoff. statistic and pvalue are in reality two values that express the same thing - to be clear; there is a linear relationship between statistic and log(pvalue). You must test if the size of your random list has any influence on the quality (as in being normal distributed) of the generated numbers. Test with list sizes from 20 to 10000 - use appropriate intervals. To make sure you have a good sample, generate 10000 samples for each list size. You should find out how many of your 10000 samples "makes the cut", i.e. has an acceptable pvalue.
When you find that number, you should consider the statistical value of checking if something is normal distributed. Do you feel you can convincingly say that "this" is a normal distribution? Perhaps read https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/ and https://www.simplypsychology.org/p-value.html before you answer. - Since your datasets in previous exercise in essence originates from the same distribution (and you have seen how different the samples can be), is it possible to find two samples that look so different, that you can confidently (but in error) say that they come from two different distributions, specifically have different means? Hint: ttest_ind
- Some may recognize this exercise: You have a data file gene_combined.txt which is a tab separated file - perfect for pandas. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other. For each gene you have to make a simple linear regression analysis and find 3 numbers; the α (the intercept - where the line cuts the Y-axis) and β (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient (r) which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. Hint: use SciPy linregress. Answer is geneD.