<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk/22113/index.php?action=history&amp;feed=atom&amp;title=Scientific_Libraries%2C_Statistics</id>
	<title>Scientific Libraries, Statistics - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk/22113/index.php?action=history&amp;feed=atom&amp;title=Scientific_Libraries%2C_Statistics"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22113/index.php?title=Scientific_Libraries,_Statistics&amp;action=history"/>
	<updated>2026-05-01T11:42:20Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22113/index.php?title=Scientific_Libraries,_Statistics&amp;diff=76&amp;oldid=prev</id>
		<title>WikiSysop: /* Exercises to be handed in */</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22113/index.php?title=Scientific_Libraries,_Statistics&amp;diff=76&amp;oldid=prev"/>
		<updated>2024-03-13T14:05:00Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Exercises to be handed in&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 16:05, 13 March 2024&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l16&quot;&gt;Line 16:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 16:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;# The reliability of statistics. NumPy can generate lists of random numbers drawn from normal distributed numbers (just use loc=0 and dev=1). These numbers will &amp;quot;obviously&amp;quot; also follow the normal distribution. SciPy can test how well a list of numbers follows the normal distribution. The task is to check how well the random numbers follows the normal distribution. SciPy&amp;#039;s normaltest returns 2 values:&amp;lt;br&amp;gt;1) &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; which is a value for how &amp;quot;regular&amp;quot; (non-skewed and not too many outliers) the numbers are. The lower value the better, i.e. more regular.&amp;lt;br&amp;gt;2) &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039; which is strong evidence against the null hypothesis and suggests that the sample is not normally distributed. The higher value the better, i.e. is normal distributed - use the standard cutoff of 0.05.&amp;lt;br&amp;gt;You are welcome to experiment with the cutoff. &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; and &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039; are in reality two values that express the same thing - to be clear; there is a linear relationship between &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; and log(&amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039;). You must test if the size of your random list has any influence on the quality (as in being normal distributed) of the generated numbers. Test with list sizes from 20 to 10000 - use appropriate intervals. To make sure you have a good sample, generate 10000 samples for each list size. You should find out how many of your 10000 samples &amp;quot;makes the cut&amp;quot;, i.e. has an acceptable &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039;.&amp;lt;br&amp;gt;When you find that number, you should consider the statistical value of checking if something is normal distributed. Do you feel you can convincingly say that &amp;quot;this&amp;quot; is a normal distribution? Perhaps read [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/] and [https://www.simplypsychology.org/p-value.html https://www.simplypsychology.org/p-value.html] before you answer.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;# The reliability of statistics. NumPy can generate lists of random numbers drawn from normal distributed numbers (just use loc=0 and dev=1). These numbers will &amp;quot;obviously&amp;quot; also follow the normal distribution. SciPy can test how well a list of numbers follows the normal distribution. The task is to check how well the random numbers follows the normal distribution. SciPy&amp;#039;s normaltest returns 2 values:&amp;lt;br&amp;gt;1) &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; which is a value for how &amp;quot;regular&amp;quot; (non-skewed and not too many outliers) the numbers are. The lower value the better, i.e. more regular.&amp;lt;br&amp;gt;2) &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039; which is strong evidence against the null hypothesis and suggests that the sample is not normally distributed. The higher value the better, i.e. is normal distributed - use the standard cutoff of 0.05.&amp;lt;br&amp;gt;You are welcome to experiment with the cutoff. &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; and &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039; are in reality two values that express the same thing - to be clear; there is a linear relationship between &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; and log(&amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039;). You must test if the size of your random list has any influence on the quality (as in being normal distributed) of the generated numbers. Test with list sizes from 20 to 10000 - use appropriate intervals. To make sure you have a good sample, generate 10000 samples for each list size. You should find out how many of your 10000 samples &amp;quot;makes the cut&amp;quot;, i.e. has an acceptable &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039;.&amp;lt;br&amp;gt;When you find that number, you should consider the statistical value of checking if something is normal distributed. Do you feel you can convincingly say that &amp;quot;this&amp;quot; is a normal distribution? Perhaps read [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/] and [https://www.simplypsychology.org/p-value.html https://www.simplypsychology.org/p-value.html] before you answer.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;# Since your datasets in previous exercise in essence originates from the same distribution (and you have seen how different the samples can be), is it possible to find two samples that look so different, that you can confidently (but in error) say that they come from two different distributions, specifically have different means? Hint: ttest_ind&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;# Since your datasets in previous exercise in essence originates from the same distribution (and you have seen how different the samples can be), is it possible to find two samples that look so different, that you can confidently (but in error) say that they come from two different distributions, specifically have different means? Hint: ttest_ind&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;# Some may recognize this exercise: You have a data file &#039;&#039;gene_combined.txt&#039;&#039; which is a tab separated file - perfect for pandas. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other. For each gene you have to make a simple linear regression analysis and find 3 numbers; the α (the intercept - where the line cuts the Y-axis) and β (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient (r) which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. Hint: SciPy linregress.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;# Some may recognize this exercise: You have a data file &#039;&#039;gene_combined.txt&#039;&#039; which is a tab separated file - perfect for pandas. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other. For each gene you have to make a simple linear regression analysis and find 3 numbers; the α (the intercept - where the line cuts the Y-axis) and β (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient (r) which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. Hint: &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;use &lt;/ins&gt;SciPy linregress&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;. Answer is geneD&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Exercises for extra practice ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Exercises for extra practice ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22113/index.php?title=Scientific_Libraries,_Statistics&amp;diff=39&amp;oldid=prev</id>
		<title>WikiSysop: Created page with &quot;__NOTOC__ {| width=500  style=&quot;font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;&quot; |Previous: Runtime evaluation of algorithms |Next: Scientific Libraries, Plotting |} == Required course material for the lesson == Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_11-Statistics_SciPy.ppt Statistics, SciPy]&lt;br&gt; &lt;!-- Resource: Example code - File Reading&lt;br&gt; --&gt;  == Subjects covered == Simple statistics with standard python&lt;br...&quot;</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22113/index.php?title=Scientific_Libraries,_Statistics&amp;diff=39&amp;oldid=prev"/>
		<updated>2024-03-06T14:08:20Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;__NOTOC__ {| width=500  style=&amp;quot;font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;&amp;quot; |Previous: &lt;a href=&quot;/22113/index.php/Runtime_evaluation_of_algorithms&quot; title=&quot;Runtime evaluation of algorithms&quot;&gt;Runtime evaluation of algorithms&lt;/a&gt; |Next: &lt;a href=&quot;/22113/index.php/Scientific_Libraries,_Plotting&quot; title=&quot;Scientific Libraries, Plotting&quot;&gt;Scientific Libraries, Plotting&lt;/a&gt; |} == Required course material for the lesson == Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_11-Statistics_SciPy.ppt Statistics, SciPy]&amp;lt;br&amp;gt; &amp;lt;!-- Resource: &lt;a href=&quot;/22113/index.php?title=Example_code_-_File_Reading&amp;amp;action=edit&amp;amp;redlink=1&quot; class=&quot;new&quot; title=&quot;Example code - File Reading (page does not exist)&quot;&gt;Example code - File Reading&lt;/a&gt;&amp;lt;br&amp;gt; --&amp;gt;  == Subjects covered == Simple statistics with standard python&amp;lt;br...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;__NOTOC__&lt;br /&gt;
{| width=500  style=&amp;quot;font-size: 10px; float:right; margin-left: 10px; margin-top: -56px;&amp;quot;&lt;br /&gt;
|Previous: [[Runtime evaluation of algorithms]]&lt;br /&gt;
|Next: [[Scientific Libraries, Plotting]]&lt;br /&gt;
|}&lt;br /&gt;
== Required course material for the lesson ==&lt;br /&gt;
Powerpoint: [https://teaching.healthtech.dtu.dk/material/22113/22113_11-Statistics_SciPy.ppt Statistics, SciPy]&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;!-- Resource: [[Example code - File Reading]]&amp;lt;br&amp;gt; --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Subjects covered ==&lt;br /&gt;
Simple statistics with standard python&amp;lt;br&amp;gt;&lt;br /&gt;
Statistics with SciPy library&lt;br /&gt;
&lt;br /&gt;
== Exercises to be handed in ==&lt;br /&gt;
In these exercises you should set the &amp;#039;&amp;#039;&amp;#039;seed&amp;#039;&amp;#039;&amp;#039; of the random number generator used in NumPy, so you can repeat your experiments.&lt;br /&gt;
# The reliability of statistics. NumPy can generate lists of random numbers drawn from normal distributed numbers (just use loc=0 and dev=1). These numbers will &amp;quot;obviously&amp;quot; also follow the normal distribution. SciPy can test how well a list of numbers follows the normal distribution. The task is to check how well the random numbers follows the normal distribution. SciPy&amp;#039;s normaltest returns 2 values:&amp;lt;br&amp;gt;1) &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; which is a value for how &amp;quot;regular&amp;quot; (non-skewed and not too many outliers) the numbers are. The lower value the better, i.e. more regular.&amp;lt;br&amp;gt;2) &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039; which is strong evidence against the null hypothesis and suggests that the sample is not normally distributed. The higher value the better, i.e. is normal distributed - use the standard cutoff of 0.05.&amp;lt;br&amp;gt;You are welcome to experiment with the cutoff. &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; and &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039; are in reality two values that express the same thing - to be clear; there is a linear relationship between &amp;#039;&amp;#039;statistic&amp;#039;&amp;#039; and log(&amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039;). You must test if the size of your random list has any influence on the quality (as in being normal distributed) of the generated numbers. Test with list sizes from 20 to 10000 - use appropriate intervals. To make sure you have a good sample, generate 10000 samples for each list size. You should find out how many of your 10000 samples &amp;quot;makes the cut&amp;quot;, i.e. has an acceptable &amp;#039;&amp;#039;pvalue&amp;#039;&amp;#039;.&amp;lt;br&amp;gt;When you find that number, you should consider the statistical value of checking if something is normal distributed. Do you feel you can convincingly say that &amp;quot;this&amp;quot; is a normal distribution? Perhaps read [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5017929/] and [https://www.simplypsychology.org/p-value.html https://www.simplypsychology.org/p-value.html] before you answer.&lt;br /&gt;
# Since your datasets in previous exercise in essence originates from the same distribution (and you have seen how different the samples can be), is it possible to find two samples that look so different, that you can confidently (but in error) say that they come from two different distributions, specifically have different means? Hint: ttest_ind&lt;br /&gt;
# Some may recognize this exercise: You have a data file &amp;#039;&amp;#039;gene_combined.txt&amp;#039;&amp;#039; which is a tab separated file - perfect for pandas. There are 3 columns; gene name, normalized mRNA expression and survival in months. There is no particular order in which the data appears and data lines for several genes might be mixed within each other. For each gene you have to make a simple linear regression analysis and find 3 numbers; the α (the intercept - where the line cuts the Y-axis) and β (the slope) coefficient that describes the line running through the data points best, and the correlation coefficient (r) which describes the fitness of the line. You must identify the gene that best indicates how long the patient survives. Hint: SciPy linregress.&lt;br /&gt;
&lt;br /&gt;
== Exercises for extra practice ==&lt;/div&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
</feed>