22140 - User contributions [en]

22140/22141 - Introduction to Systems Biology

2024-03-19T12:03:27Z

WikiSysop: /* Lecture plan */

'' Previously known as 36040 and 27040: Introduction to Systems Biology (bachelor course) ''

[[Image:Network_example1.png|600px|center|Network example: Protein-Protein interaction network for biomarkers involved in depression - Larsen & Wernersson, 2012.]]

== About the course ==
The ''"Introduction to Systems Biology"'' course is a '''Bachelor level''' course aiming at:
* Teaching a conceptual introduction to the core of Systems Biology.
* Having a very large focus on '''Network Biology''', including hands-on experience on how to work with, and critically interpret biological networks.
* All teaching will be very closely related to actual biology.
* We will work with examples of how the Systems Biology approach can be applied to biomedical research, and help find new biomarkers and potential drug targets.
* The students will learn how to use the biological networks as a scaffold for more refined analyses using data integration (e.g. filtering the network using tissue-specific expression, or mapping in protein meta data).

'''Course organizers:''' Lars Rønn Olsen (lronn@dtu.dk), Rasmus Wernersson (rawe@dtu.dk), and Kristoffer Vitting-Seerup (krivi@dtu.dk)

== Lecture plan ==
'''Current:'''
* [[Autumn2023|Plan for autumn 2023]] (Course 22140) - Thursday afternoon

'''Recent past:'''
* [[Autumn2022|Plan for autumn 2022]] (Course 22140) - Thursday afternoon
* [[Autumn2021|Plan for autumn 2021]] (Course 22140) - Thursday afternoon

Autumn2021

2024-03-18T09:27:41Z

WikiSysop: Created page with "= Course 22140 (previously 36040) - plan for autumn 2021 =  '''Teachers:''' * [https://www.biosurf.org Lars Rønn Olsen] (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk] * Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk] * [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] (lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk] * Giorgia Moranzoni (teaching assistant..."

= Course 22140 (previously 36040) - plan for autumn 2021 =

'''Teachers:'''
* [https://www.biosurf.org Lars Rønn Olsen] (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk]
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] (lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Giorgia Moranzoni (teaching assistant) - '''contact:''' [mailto:gimo@dtu.dk gimo@dtu.dk]



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/81276 Course 22140, Autumn 2021 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 27611 or 27622). If you need to read up on some bioinformatics topics, please use the links below.
* [http://teaching.healthtech.dtu.dk/36611 Course 36611] - ''Introduktion til Bioinformatik'' (in danish)
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= Cytoscape =
[[Image:Cytoscape_icon.png‎|right]]
For many of the computer exercise we will be using Cytoscape for inspecting and analyzing the biological networks. Cytoscape is Open Source and freely available for Windows, Mac and Linux. Make sure to have Cytoscape installed on your laptop prior to the course: http://www.cytoscape.org/



= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the Learn system. We collect the reports and give a general feedback to the entire class the following week.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.



= Lecture plan, autumn 2021 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>


== Block #1: Introduction ==
'''Responsible for this block:''' Rasmus Wernersson, Lars Rønn Olsen
----
=== Lecture 01 (Sep 2nd) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' To appear on CampusNet
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W01_Lazebnik_CancerCell2002.pdf PDF]) NOTE: Review paper - easy to read

:'''Exercise:''' [[ExCytoscapeIntro_v2|Introduction to Cytoscape and working with networks]] - '''Answers:''' [[ExCytoscapeIntro_Answers|Exercise #1 answers]]

=== Lecture 02 (Sep 9th) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Rasmus Wernersson

:'''Slides:''' To appear on CampusNet
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* Building protein-protein interaction networks from experimental data ([[file:W02_exercises_v7_corrected.pdf|PDF]]) - '''NEW LINK! (Sept. 9th 2021)''') - - '''Answers''': (NEW - updated Feb 16th) [[file:W02_exercises_with_answers_2021_CORRECTED.pdf]] (We will also show some slides about this on Tuesday)
:* Note taking sheet for help with ex. 5,7,8,9 ([https://teaching.healthtech.dtu.dk/27040/exercises/Exercise_help_sheet.pdf PDF]) - PRINT OUT and take notes.
:* Computer-exercise: [[ExPpiDataVisualization|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[ExPpiDataVisualization_Answers|Ex. 10+11 answers]]

=== Lecture 03 (Sep 16th) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Lars Rønn Olsen
:'''Slides:''' On CampusNet.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Cytoscape, topology/statistics/modules [[ExTopology1|Network topology and statistics]] - '''Answers''': See slides on DTU Learn (week 3). Wiki answers comming as soon as we have updated the plots to show Cytoscape 3.8 results. [[ExTopology1_answers|Answers to Cytoscape exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson, Lars Rønn Olsen
----

=== Lecture 04 (Sep 23rd) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to CampusNet
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/27040/exercises/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio1|Yeast cell cycle 1 - introduction to data and methods]] - '''Answers:''' [[ExYeastSysBio1_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 30th) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Lars Rønn Olsen
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [[File:Bbr002.pdf]] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [[File:GO_NATURE_GENETICS_2000.pdf]] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On CampusNet

:'''Exercise:''' [[ExGeneOntology_Yeast1.5|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast1.5_answers|wiki answers]]

=== Lecture 06 (Oct 7th) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([http://www.cbs.dtu.dk/~raz/teaching/IntroToMicroArrays_2013.pdf PDF]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on CampusNet

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics1|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' ([https://learn.inside.dtu.dk/d2l/le/content/81276/viewContent/340241/View])

=== Lecture 07 (Oct 14th) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson. (Recorded lecture).
:'''Readings:''' [[Media:Cyclebase1_2008.pdf‎|Cyclebase paper]] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on CampusNet

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2|Mapping temporal expression data onto networks]] '''Answers:''' ([https://learn.inside.dtu.dk/d2l/le/content/81276/viewContent/354860/View])

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Rasmus Wernersson, Lars Rønn Olsen
----

=== Lecture 08 (Oct 28th) - Systems Biology in Biomedical Research (Heart diseases) 1 ===

:'''Lecture:''' ''Systems Biology of Heart Disease'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/Furlong_Cell2012.pdf PDF])
:* Concentrate on: '''Figure 1''' and '''Box 3'''
:'''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:'''Slides:''' To appear on CampusNet


:'''Exercise:''' [[ExHumanSysbio1|Working with "Virtual Pulldowns"]] - '''Answers:''' [[ExHumanSysbio1_answers|"Virtual Pulldown" answers]]

=== Lecture 09 (Nov 4th) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Clustering, multiple testing correction & text mining'' - Lars Rønn Olsen and Rasmus Wernersson.
:'''Readings:'''
:* Read the cartoon to the right - it nicely illustrates the multiple testing problem (source: XKCD.org)
:* ''How does multiple testing correction work?'', Noble 2009 ([https://www.nature.com/articles/nbt1209-1135 Article link])
:* Make sure you understand how '''Bonferroni adjustment''' works.
:'''Extra:''' (not curriculum)
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Slides:''' To appear on CampusNet

:'''Exercise:'''
:# [[ExTM+MulTest|Multiple testing (text mining example)]] - '''Answers:''' [[ExTM+MulTest_answers|Multiple testing answers]]
:# [[ExHumanSysbio2|Clustering]] - '''Answers:''' ([https://learn.inside.dtu.dk/d2l/le/content/81276/viewContent/354865/View])

=== Lecture 10 (Nov 11th) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Functional cancer phenotyping: follicular thyroid cancer case'' - Lars Rønn Olsen.
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' ([https://learn.inside.dtu.dk/d2l/le/content/81276/viewContent/346362/View]) '''Answers''' ([https://learn.inside.dtu.dk/d2l/le/content/81276/viewContent/348132/View])

=== Lecture 11 (Nov 18th) - Systems Biology in Biomedical Research (inBio Discover framework, drug targets) 3 ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': '''https://www.intomics.com/covid19/'''
:* It's a web-page with a written explanation of the COVID-19 analysis + links to interactive networks.
:* Please read all of it - ''including methods''. It is written for a non-technical audience, and should be easy to understand.
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' [[ExHumanSysbio3|Exploring human networks]] - '''Answers:''' [[ExHumanSysbio3_answers|Exploring human networks ANSWERS]] (Link fixed)
:'''Link to inBio Discover:''' '''https://inbio-discover.intomics.com/'''
:*'''Please register''' your email with inBio Discover before the exercise.

=== Lecture 12 (Nov 25th) - A (cancer) case of isoform switches, functional class scorring and topology analysis ===
:'''Lecture: TBD''' - Kristoffer Vitting-Seerup.
:'''Readings: TBD'''
:'''Slides:''' To appear on CampusNet
:'''Exercise: TBD'''

=== Lecture 13 (Dec 2nd) - Advanced topics, Wrap-up, "Spørgetime" (Q&A session) ===

:'''Lecture:''' ''TBA

:'''Slides:''' To appear on CampusNet

<blockquote style="background-color: lavender; border: solid thin grey;">
Lars Rønn will present a new cancer disease case + an exercise analyzing multiple lines of evidence in the context of networks.
</blockquote>





= Old exam sets =

* On DTU Inside (file sharing)

= Links & curriculum summary =
* [[Links and curriculum|Links and curriculum index]]

= Exam =
* '''When & where: '''December 7th 2021'' ; 15:00 - 19:00 (some have extra time), '''TBD''

22140/22141 - Introduction to Systems Biology

2024-03-06T08:56:18Z

WikiSysop: /* Lecture plan */

File:Cytoscape icon.png

2024-03-05T16:12:49Z

WikiSysop:

Autumn2022

2024-03-05T16:11:56Z

WikiSysop: Created page with "= Course 22140 (previously 36040) - plan for autumn 2022 =  '''Teachers:''' * Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk] * [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk] * Giorgia Moranzoni (teaching assistant) - '''contact:''' [mailto:gimo@dtu.dk gimo@dtu.dk] * Lars Rønn Olsen (course organizer) is on parental lea..."

= Course 22140 (previously 36040) - plan for autumn 2022 =

'''Teachers:'''
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Giorgia Moranzoni (teaching assistant) - '''contact:''' [mailto:gimo@dtu.dk gimo@dtu.dk]
* Lars Rønn Olsen (course organizer) is on parental leave and will therefore not respond. Contact Kristoffer instead.



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/126026 Course 22140, Autumn 2022 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 22211 or one of its variants). If you need to read up on some bioinformatics topics, please use the links below.
* [https://teaching.healthtech.dtu.dk/22111/ Course 22111] - ''Introduction to Bioinformatics''
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= Cytoscape =
[[Image:Cytoscape_icon.png‎|right]]
For many of the computer exercise we will be using Cytoscape for inspecting and analyzing the biological networks. Cytoscape is Open Source and freely available for Windows, Mac and Linux. Make sure to have Cytoscape installed on your laptop prior to the course: http://www.cytoscape.org/



= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the Learn system. We collect the reports and give a general feedback to the entire class the following week.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.



= Lecture plan, autumn 2022 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>


== Block #1: Introduction ==
'''Responsible for this block:''' Rasmus Wernersson, Kristoffer Vitting-Seerup
----
=== Lecture 01 (Sep 1st) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' To appear on DTU Learn
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W01_Lazebnik_CancerCell2002.pdf PDF]) NOTE: Review paper - easy to read

:'''Exercise:''' [[ExCytoscapeIntro_v2|Introduction to Cytoscape and working with networks]] - '''Answers:''' [[ExCytoscapeIntro_Answers|Exercise #1 answers]]

=== Lecture 02 (Sep 8th) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Rasmus Wernersson

:'''Slides:''' To appear on DTU Learn
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* Building protein-protein interaction networks from experimental data ([[file:W02_exercises_v7_corrected.pdf|PDF]])) 
:* Note taking sheet for help with ex. 5,7,8,9 ([https://teaching.healthtech.dtu.dk/27040/exercises/Exercise_help_sheet.pdf PDF]) - PRINT OUT and take notes.
:* Computer-exercise: [[ExPpiDataVisualization|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[ExPpiDataVisualization_Answers|Ex. 10+11 answers]]

=== Lecture 03 (Sep 15th) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Kristoffer Vitting-Seerup
:'''Slides:''' On DTU Learn.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Cytoscape, topology/statistics/modules [[ExTopology1|Network topology and statistics]] '''Answers''': See slides on DTU Learn (week 3). [[ExTopology1_answers|Answers to Cytoscape exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson, Kristoffer Vitting-Seerup
----

=== Lecture 04 (Sep 22nd) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/27040/exercises/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio1|Yeast cell cycle 1 - introduction to data and methods]] '''Answers:''' [[ExYeastSysBio1_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 29th) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Rasmus Wernersson
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [[File:Bbr002.pdf]] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [[File:GO_NATURE_GENETICS_2000.pdf]] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On DTU Learn

:'''Exercise:''' [[ExGeneOntology_Yeast1.5|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast1.5_answers|wiki answers]]

=== Lecture 06 (Oct 6th) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/510349/View DTU Learn]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics1|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' ([https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479103/View])

=== Lecture 07 (Oct 13th) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson.
:'''Readings:''' [[Media:Cyclebase1_2008.pdf‎|Cyclebase paper]] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2|Mapping temporal expression data onto networks]] '''Answers:''' ([https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479132/View])

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Rasmus Wernersson, Kristoffer Vitting-Seerup
----

=== Lecture 08 (Oct 27th) - Systems Biology in Biomedical Research (Heart diseases) 1 ===

:'''Lecture:''' ''Systems Biology of Heart Disease'' - Rasmus Wernersson
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/Furlong_Cell2012.pdf PDF])
:* Concentrate on: '''Figure 1''' and '''Box 3'''
:'''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExHumanSysbio1|Working with "Virtual Pulldowns"]] '''Answers:''' [[ExHumanSysbio1_answers|"Virtual Pulldown" answers]]

=== Lecture 09 (Nov 3rd) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Clustering, multiple testing correction & text mining'' - Kristoffer Vitting-Seerup and Rasmus Wernersson.
:'''Readings:'''
:* Read the cartoon to the right - it nicely illustrates the multiple testing problem (source: XKCD.org)
:* ''How does multiple testing correction work?'', Noble 2009 ([https://www.nature.com/articles/nbt1209-1135 Article link])
:* Make sure you understand how '''Bonferroni adjustment''' works.
:'''Extra:'''
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Slides:''' To appear on DTU Learn

:'''Exercise:'''
:# [[ExTM+MulTest|Multiple testing (text mining example)]] - '''Answers:''' [[ExTM+MulTest_answers|Multiple testing answers]]
:# [[ExHumanSysbio2|Clustering]] - '''Answers:''' ([https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479133/View])

=== Lecture 10 (Nov 10th) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Functional cancer phenotyping: follicular thyroid cancer case'' - Kristoffer Vitting-Seerup
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' ([https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479122/View]) - '''Answers''' ([https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479141/View])

=== Lecture 11 (Nov 17th) - Systems Biology in Biomedical Research (inBio Discover framework, drug targets) 3 ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': '''https://www.intomics.com/covid19/'''
:* It's a web-page with a written explanation of the COVID-19 analysis + links to interactive networks.
:* Please read all of it - ''including methods''. It is written for a non-technical audience and should be easy to understand.
:'''Slides:''' To appear on DTU Learn
:'''Exercise:'''
:* [[ExInBioDiscover_v2|Exploring drug target and disease networks]] - '''Answers:''' [[ExInBioDiscover_v2_answers|ANSWERS]]
:* OPTIONAL: [[ExHumanSysbio3|Exploring human networks]] (If time allows) - '''Answers:''' [[ExHumanSysbio3_answers|Exploring human networks
ANSWERS]]
:'''Link to inBio Discover:''' '''https://inbio-discover.com/'''
:*'''Please register''' your email with inBio Discover before the exercise.

<br>

=== Lecture 12 (Nov 24th) - Isoform Switches in Cancer - Part 1 ===
:'''Lecture:''' Kristoffer Vitting-Seerup

:'''Readings:''' [https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/526969/View Isoform Switches in Cancer]

:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/531621/View On DTU Learn]. Note slides with solutions will be available after lecture

:'''Exercise:''' [https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479148/View Isoform exercise]

<br>

=== Lecture 13 (Dec 1st) - Isoform Switches in Cancer - Part 2, QnA, Course summary ===
:'''Content''':
:* Isoform switches in cancer - part 2
:* Exam info and QnA
:* Course summary

:'''Lecture:''' Kristoffer Vitting-Seerup

:'''Readings:''' [https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/526969/View Isoform Switches in Cancer]

:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/531621/View On DTU Learn]. Note slides with solutions will be available after lecture

:'''Exercise:''' [https://learn.inside.dtu.dk/d2l/le/content/126026/viewContent/479148/View Isoform exercise]

<br>

= Old exam sets =

* On DTU Inside (file sharing)



= Exam =
* '''Date: 7/12 2022
* '''Time: 09:00-13:00
* '''Where: (will be) available via https://eksamensplan.dtu.dk/

File:Community2.png

2024-03-05T16:09:56Z

WikiSysop:

DiscoNet2 answers

2024-03-05T16:09:05Z

WikiSysop: Created page with "= Multiomics data integration exercise - Answers = '''Answers written by:''' Lars Rønn Olsen First, load the data for today's exercise: <pre> load("/home/projects/22140/exercise10.Rdata") </pre> Then, randomly pick a patient number. Let's use R to make sure it is random: <pre> sample(1:10, 1) </pre> For that patient calculate the log2 fold change of gene expression in cancer vs normal log2(cancer expression / normal expression). Consider genes with a log2 fold chan..."

= Multiomics data integration exercise - Answers =
'''Answers written by:''' Lars Rønn Olsen

First, load the data for today's exercise:

<pre>
load("/home/projects/22140/exercise10.Rdata")
</pre>

Then, randomly pick a patient number. Let's use R to make sure it is random:

<pre>
sample(1:10, 1)
</pre>

For that patient calculate the log2 fold change of gene expression in cancer vs normal log2(cancer expression / normal expression). Consider genes with a log2 fold change either smaller than -1 (down regulated) or larger than 1 (up regulated) dysregulated. We also consider mutated genes (with a 1 in the mutation column) as aberrated (aka nonsynonymous), so you should keep these as well. Save a list of dysregulated and/or aberrated genes.

<pre>
pt <- as.data.frame(data[[sample(1:10, 1)]])
pt$log2fc <- log2(pt$tumor/pt$normal)
pt$gene <- rownames(pt)
seed <- rownames(pt[abs(pt$log2fc) > 1 | pt$mutation == 1,])
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
* What patient did you pick?
* How many genes are up or down regulated?
<pre>
table(pt[abs(pt$log2fc) > 1,])
</pre>

''Depends on your sample. Somewhere between 4 and 100.''

* Is the number of dysregulated genes what you would expect from a disease like cancer?

''Given that cancer arises from normal cells, you would expect the cancer cells to be somewhat similar to their cognate, normal cells. However, as we learned from reading the Hallmarks paper, cancer dysregulates a number of cellular functions, and for this I would probably guess between 100 and 1000 genes to be dysregulated.''

* How many genes harbor nonsynonymous mutations (Note: we only report nonsynonymous mutations in this data. Synonymous mutations have already been removed)?

''Depends on your sample. Somewhere between 1 and 32.''

* Theoretical question: Discuss the biological connection, or lack thereof, between nonsynonymous mutations and de-regulated of a) mutation and expression of the same gene and b) mutations in one gene and the expression of other genes. Use a maximum of 75 words.
For the patient sample you picked, make a node attribute table containing the following columns: “Gene”, “log2FC”, “somatic_mutations” (either 0 or 1) for ''all'' genes.

''Somatic mutations are stochastic events + selective pressure. Generally speaking, expression of a gene and its mutation status may not necessary correlate. However, in theory a loss of function mutation may activate a feedback loop increasing expression of the gene, and likewise a gain of function mutation may decrease expression resulting from feedback. Lastly, mutations of some genes may affect the expression of others. This is known as "expression quantitative trait loci"''

== Virtual pulldown ==
Load the DiscoNet package and prepare the inweb database:

<pre>
library(DiscoNet)

### Load translated database
load(file='/home/projects/22140/inweb.Rdata') # db
</pre>

Then, use the virtual_pulldown() function that you used last week, to perform a virtual pulldown with all dysregulated ''and/or'' mutated genes.

<pre>
network <- virtual_pulldown(seed_nodes = seed, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network$network)
node_attributes <- data.frame(network$node_attributes)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #2:'''
* How does a virtual pulldown work?

''It identifies proteins interacting with your input proteins, and…''

* What experiment does it simulate?

''…Simulates a complex pulldown experiment''

* What database does it query?

''The InWeb database of protein-protein interactions''

* What does the confidence score mean?

''The experimental confidence we have in the interaction''

* Is the default cutoff of 0.156 reasonable? Why/why not?

''In this context, it would appear so as we don’t get massive hairballs, nor do we get empty networks. In a real life research project, one would try multiple different thresholds and assess whether cluster member nodes share functional features to a degree where they can be considered a protein complex, or whether there a glaring outliers in terms of function, which should be filtered out''

We changed the package based on your excellent feedback last week. Now, the virtual_pulldown() function produces an interaction table (in the $network object), with confidence scores as edge attributes and a node attribute table (in the $node_attributes object) with a seed indicator and a topological score for each node.

Use the merge() function to add log2FC and mutation status to the node attribute table. Here's a hint so you don't spend too much time on this:

<pre>
node_attributes <- merge(x = node_attributes, y = pt, by.x = "nodes", by.y = "gene", all.x = TRUE)
</pre>

Make sure you understand what goes on in the function above.

Use the function graph_from_data_frame() to make an igraph object. Remember to make it undirected and add node attributes.

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
</pre>

Use the relevance_filtering() function to make 3 different versions of the pulldown, with cutoffs "0", "0.5", "0.8".

<pre>
g1 <- relevance_filtering(g, 0.0)
g1 <- relevance_filtering(g, 0.5)
g1 <- relevance_filtering(g, 0.8)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #3:'''
* How many nodes/edges are in your three pulldown networks?

''1974/63033
592/18079
307/6764
''
* Plot the 0.8 filtered pulldown network, colored by log2FC and mutated genes highlighted with a shape, and seed nodes highlighted in some other way. Make sure the plots added to the report are easy to read.

<pre>
ggraph(g1) +
geom_edge_link() +
geom_node_point(aes(color = log2fc, shape = as.factor(mutation)), size = 5) +
scale_color_gradient2(low = "red", mid = "white", high = "blue")
</pre>

* Does the networks look as you would expect in terms of which genes are up or down regulated or mutated? I.e. is the mutation status generally related to expression status? What would you expect?

'''For the following exercises use the network with a cutoff of 0.8'''

== Community detection / protein complex inference ==
Run the community_detection() function with the MCODE algorithm with parameters D = 0.05, haircut = TRUE, fluff = FALSE. The resulting communities may represent protein complexes that could be causative or indicative of the disease.

<pre>
communities <- community_detection(g1, algorithm = "mcode", D = 0.05, haircut = TRUE, fluff = FALSE, fdt = 0.8, loops = FALSE)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* How many nodes/edges are in your top 25 scoring communities (note: you many not have this many, in which case just report nodes/edges for all of them)

<pre>
lapply(communities[[1]], function(x) paste(vcount(x), ecount(x)))
</pre>

* How is the community score from MCODE calculate and what does it mean?

* Have a quick look [here|https://en.wikipedia.org/wiki/Protein_complex]. What do you expect a protein complex to look like? How many proteins? What clustering coefficient (how densely clustered)? Why?

''A protein complex is a group of proteins forming a quaternary structure to perform a given function. As such, one would not expect, e.g. five proteins linked together by single edges to form a long string to constitute a protein complex. Similarly, 200 proteins in a hairball probably wouldn’t be a protein complex either.''

* Based on node/edge counts alone, which of your communities could be complexes?
''Probably all but the top one with 81 nodes and 3242 edges. Let's proceed with community 3''

<pre>
ggraph(communities$communities[[3]]) +
geom_edge_link() +
geom_node_point(aes(color = log2fc, shape = as.factor(mutation)), size = 5) +
geom_node_text(aes(label = name), size = 10, repel = TRUE) +
scale_color_gradient2(low = "red", mid = "white", high = "blue")
</pre>

== Gene ontology over-representation analysis ==
Now use the the fora() function from fgsea to perform an over-representation analysis on all the detected communities. Use the full list of genes from the exercise data as your background, and the biological process ontology.

<pre>
library(fgsea)
library(msigdbr)
# First, fetch gene sets
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)
# repeat the analysis below with the communities you deem to be potential protein complexes
fora(pathways = BP_list, genes = V(communities$communities[[1]])$name, universe = rownames(pt))
</pre>

'''Note''': If you get a timeout error just run the function again.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Why are we asking you to use the "full list of genes" from the microarray as the background, and not ''all'' human genes? (Note that microarrays, while comprehensive, do not have probes for every single human gene).

''Because we don't know what the result would be for genes we did not measure''

* What does an over-representation test do?

''Whether certain GO terms are found at a higher frequency in a target list of genes than their frequency in background of all genes.''

* What is “over-represented” and where?

''GO terms in the target genes''

== Interpreting the results of the over-representation analysis ==
Relate functions of the significantly over-represented GO terms in each of the top 25 protein complexes to the hallmarks of cancer. Keep in mind that neither all complexes, nor all GO terms are necessarily related to cancer.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6:'''

* 6.1) Do any of your complexes relate to one or more of the hallmarks of cancer? How? (You will probably need to search Google here – try [GO term]+cancer or even [GO term]+”follicular thyroid cancer”).

''This depends on your sample, your clusters, and how thorough you are at checking what your GO terms mean. ''

* 6.2) If you cannot reasonably relate some of the GO terms to a cancer hallmark (and it is not unlikely that you cannot!), why do you think these non-cancer related ontologies are over-represented?

''Given that we are looking at thyroid tissue either way (healthy or sick), we expect some thyroid functions to turn up in the list.''

* 6.3) Plot the three communities you find most interesting in the context of follicular thyroid cancer. Highlight log2FC, mutation status, and seed nodes
* 6.4) How do you think these dysregulated protein complexes may drive cancer pathogenesis?
* 6.5) For the targets of your three chosen protein complexes, pick the protein best suited as a drug target and explain why based on what you have learned about network topology.

''Again, the answers to the three questions above, depends on what sample you have been working on. I chose patient 1 and a relevance score cutoff of 0.8. My community number 2 looks like this:''

[[File:Community2.png|500px]]

''Running an over-representation analysis of the complex, reveals multiple GO terms related to cell cycle and devision. Looking closer at the dysregulated genes in the complex, tells me that CDC27 might be the culprit here. I then decided to Google the gene and discovered that "CDC27 Facilitates Gastric Cancer Cell Proliferation, Invasion and Metastasis via Twist-Induced Epithelial-Mesenchymal Transition" (https://pubmed.ncbi.nlm.nih.gov/30308498/)''

''In terms of targeting, CDC27 seems like an obvious choice, since this is a small complex and all nodes have the same degree. Had the complex been larger, it could have made sense to look at the degree and betweenness centrality of the nodes as targeting hubs is a generally a good way disrupt a network''

DiscoNet2

2024-03-05T16:08:26Z

WikiSysop: Created page with "= Multiomics data integration exercise = '''Exercise written by:''' Lars Rønn Olsen '''Learning objectives:''' * Overall objective: learn how to extract meaningful networks from human PPI data *# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins" *# Finding tightly connected clusters in larger networks *# Using the DiscoNet package in R == Introduction == Today’s exercise requires the use of many of the methods you learned throughout this..."

= Multiomics data integration exercise =
'''Exercise written by:''' Lars Rønn Olsen

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
Today’s exercise requires the use of many of the methods you learned throughout this course. The objective is to identify aberrant, cancer-related protein complexes in follicular thyroid cancer, by analyzing gene expression and somatic mutation data from 10 patients. You will do this using the following workflow of skills you have learned in this course:
* Calculating of log2FC from healthy to cancer tissue
* Performing a virtual pulldown with the DiscoNet package
* Performing protein complex detection using MCODE
* Performing over representation analysis of GO terms using the fgsea package

== Processing data and extracting dysregulated or mutated genes ==

'''First, you need to restart your R session! "Session" -> "Restart R"!'''

First, load the data for today's exercise:

<pre>
load("/home/projects/22140/exercise10.Rdata")
</pre>

Then, randomly pick a patient number. Let's use R to make sure it is random:

<pre>
sample(1:10, 1)
</pre>

For that patient calculate the log2 fold change of gene expression in cancer vs normal log2(cancer expression / normal expression). Consider genes with a log2 fold change either smaller than -1 (down regulated) or larger than 1 (up regulated) dysregulated. We also consider mutated genes (with a 1 in the mutation column) as aberrated (aka nonsynonymous), so you should keep these as well. Save a list of dysregulated and/or aberrated genes.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
* What patient did you pick?
* How many genes are up or down regulated?
* Is the number of dysregulated genes what you would expect from a disease like cancer?
* How many genes harbor nonsynonymous mutations (Note: we only report nonsynonymous mutations in this data. Synonymous mutations have already been removed)?
* Theoretical question: Discuss the biological connection, or lack thereof, between nonsynonymous mutations and de-regulated of a) ''mutation and expression of the same gene'' and b) ''mutations in one gene and the expression of other genes''. Use a maximum of 75 words.

For the patient sample you picked, make a node attribute table containing the following columns: “Gene”, “log2FC”, “somatic_mutations” (either 0 or 1) for ''all'' genes.

== Virtual pulldown ==
Load the DiscoNet package and prepare the inweb database:

<pre>
library(DiscoNet)

### Load translated database
load(file='/home/projects/22140/inweb.Rdata') # db
</pre>

Then, use the virtual_pulldown() function that you used last week, to perform a virtual pulldown with all dysregulated ''and/or'' mutated genes.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #2:'''
* How does a virtual pulldown work?
* What experiment does it simulate?
* What database does it query?
* What does the confidence score mean?
* Is the default cutoff of 0.156 reasonable? Why/why not?

We changed the package based on your excellent feedback last week. Now, the virtual_pulldown() function produces an interaction table (in the $network object), with confidence scores as edge attributes and a node attribute table (in the $node_attributes object) with a seed indicator and a topological score for each node.

Use the merge() function to add log2FC and mutation status to the node attribute table. Here's a hint so you don't spend too much time on this:

<pre>
node_attributes <- merge(x = node_attributes, y = pt, by.x = "nodes", by.y = "gene", all.x = TRUE)
</pre>

Make sure you understand what goes on in the function above.

Use the function graph_from_data_frame() to make an igraph object. Remember to make it undirected and add node attributes.

Use the relevance_filtering() function to make 3 different versions of the pulldown, with cutoffs "0", "0.5", "0.8".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #3:'''
* How many nodes/edges are in your three pulldown networks?
* Plot the 0.8 filtered pulldown network, colored by log2FC and mutated genes highlighted with a shape, and seed nodes highlighted in some other way. Make sure the plots added to the report are easy to read.
* Does the networks look as you would expect in terms of which genes are up or down regulated or mutated? I.e. is the mutation status generally related to expression status? What would you expect?

'''For the following exercises use the network with a cutoff of 0.8'''

== Community detection / protein complex inference ==
Run the community_detection() function with the MCODE algorithm with parameters D = 0.05, haircut = TRUE, fluff = FALSE. The resulting communities may represent protein complexes that could be causative or indicative of the disease.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* How many nodes/edges are in your top 25 scoring communities (note: you many not have this many, in which case just report nodes/edges for all of them)
* How is the community score from MCODE calculate and what does it mean?
* Have a quick look [here|https://en.wikipedia.org/wiki/Protein_complex]. What do you expect a protein complex to look like? How many proteins? What clustering coefficient (how densely clustered)? Why?
* Based on node/edge counts alone, which of your communities could be complexes?

== Gene ontology over-representation analysis ==
Now use the the fora() function from fgsea to perform an over-representation analysis on all the detected communities. Use the full list of genes from the exercise data as your background, and the biological process ontology.

<pre>
library(fgsea)
library(msigdbr)
# First, fetch gene sets
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)
# repeat the analysis below with the communities you deem to be potential protein complexes
fora(pathways = BP_list, genes = V(communities$communities[[1]])$name, universe = rownames(pt))
</pre>

'''Note''': If you get a timeout error just run the function again.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Why are we asking you to use the "full list of genes" from the microarray as the background, and not ''all'' human genes? (Note that microarrays, while comprehensive, do not have probes for every single human gene).
* What does an over-representation test do?
* What is “over-represented” and where?

== Interpreting the results of the over-representation analysis ==
Relate functions of the significantly over-represented GO terms in each of the top 25 protein complexes to the hallmarks of cancer. Keep in mind that neither all complexes, nor all GO terms are necessarily related to cancer.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6:'''

* 6.1) Do any of your complexes relate to one or more of the hallmarks of cancer? How? (You will probably need to search Google here – try [GO term]+cancer or even [GO term]+”follicular thyroid cancer”).
* 6.2) If you cannot reasonably relate some of the GO terms to a cancer hallmark (and it is not unlikely that you cannot!), why do you think these non-cancer related ontologies are over-represented?
* 6.3) Plot the three communities you find most interesting in the context of follicular thyroid cancer. Highlight log2FC, mutation status, and seed nodes
* 6.4) How do you think these dysregulated protein complexes may drive cancer pathogenesis?
* 6.5) For the largets of your three chosen protein complexes, pick the protein best suited as a drug target and explain why based on what you have learned about network topology.

File:Community2 w9.png

2024-03-05T16:05:24Z

WikiSysop:

File:Community1.png

2024-03-05T16:04:56Z

WikiSysop:

File:G3.png

2024-03-05T16:04:21Z

WikiSysop:

File:G2.png

2024-03-05T16:03:49Z

WikiSysop:

File:G1.png

2024-03-05T16:03:19Z

WikiSysop:

DiscoNet answers

2024-03-05T16:02:38Z

WikiSysop: Created page with "= Human diseases / virtual pulldown exercise = '''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, and Rasmus Wernersson left '''TASK/REPORT QUESTION #1:''' # Load the packages <pre> library(DiscoNet) library(msigdbr) library(fgsea) The PPI database we will use is InWeb: db <- translate_database("inweb") </pre> # Run DiscoNet with this list of proteins with the following parameters: <pre> network_ex2 <- virtu..."

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, and Rasmus Wernersson

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
# Load the packages
<pre>
library(DiscoNet)
library(msigdbr)
library(fgsea)

The PPI database we will use is InWeb:
db <- translate_database("inweb")

</pre>

# Run DiscoNet with this list of proteins with the following parameters:

<pre>
network_ex2 <- virtual_pulldown(seed_nodes = seed_nodes_ex2, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network_ex2$network)
node_attributes <- data.frame(network_ex2$node_attributes)
node_attributes <- merge(x = node_attributes, y = pt, by.x = "nodes", by.y = "gene", all.x = TRUE)
</pre>

# Convert network into igraph object with the following relevance score cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised

''Relevance score cutoff 0 (no filtering): 452 nodes, 8806 edges''
''Relevance score cutoff 0.5: 77 nodes, 649 edges''
''Relevance score cutoff 1: 19 nodes, 11 edges''

# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

''Relevance score cutoff 0.2: 242 nodes, 4766 edges''
''We observe approximately half the number of nodes and edges with a cutoff of 0.2. This means that only half the nodes had at least 20% of the edges within the network. The other half had less than that. It's unlikely that half the proteins in the unfiltered network were sticky proteins, but they probably had more to do outside the network than inside, so filtering them could be a good idea.''

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

<pre>
ggraph(g1, layout = "kk") +
geom_edge_link() +
geom_node_point(size = 5)

ggraph(g2, layout = "kk") +
geom_edge_link() +
geom_node_point(size = 5)

ggraph(g3, layout = "kk") +
geom_edge_link() +
geom_node_point(size = 5)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

[[Image:G1.png|500px]]

[[Image:G2.png|500px]]

[[Image:G3.png|500px]]

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes. The can be done with the "community_detection" function of DiscoNet:

<pre>
mcode_network <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

''MCODE produces the following communities:''

<pre>
lapply(communities[[1]], function(x) paste(vcount(x), ecount(x)))

[[1]]
[1] "364 8311"

[[2]]
[1] "5 8"

[[3]]
[1] "3 3"

[[4]]
[1] "3 3"
</pre>

''Based on what we have learned, the community 1 is definitely to large to be a protein complex (protein complexes should have more than maybe 30-40 proteins, and mostly likely less than that. The rest could be good candidates, so let's visualize community 1 (bad example) and 2 (good example)''

<pre>
ggraph(communities[[1]][[1]], layout = "kk") +
geom_edge_link() +
geom_node_point(size = 5)

ggraph(communities[[1]][[2]], layout = "kk") +
geom_edge_link() +
geom_node_point(size = 5)
</pre>

''Which produces''

[[Image:Community1.png|500px]]

[[Image:Community2_w9.png|500px]]

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

''As we saw in the previous question, community 2, 3, and 4 could be potential complexes''

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

head(fora(pathways = BP_list, genes = V(communities$communities[[2]])$name, universe = all_gene_ids))
head(fora(pathways = BP_list, genes = V(communities$communities[[3]])$name, universe = all_gene_ids))
head(fora(pathways = BP_list, genes = V(communities$communities[[4]])$name, universe = all_gene_ids))
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

''It's immediately clear that complex 2 is involved in cardiac development:''

<pre>
1: GOBP_CARDIAC_VENTRICLE_MORPHOGENESIS 6.443594e-16 4.934504e-12
2: GOBP_CARDIAC_CHAMBER_MORPHOGENESIS 1.047694e-14 2.674414e-11
3: GOBP_CARDIAC_VENTRICLE_DEVELOPMENT 1.047694e-14 2.674414e-11
4: GOBP_CARDIAC_CHAMBER_DEVELOPMENT 4.371267e-14 8.368790e-11
5: GOBP_CELL_SURFACE_RECEPTOR_SIGNALING_PATHWAY_INVOLVED_IN_HEART_DEVELOPMENT 1.045901e-13 1.601902e-10
6: GOBP_HEART_MORPHOGENESIS 3.705408e-13 4.729336e-10
</pre>

''Same for complex 3:''

<pre>
1: GOBP_GERM_CELL_MIGRATION 0.0002322131 0.6914851 1 6
2: GOBP_CARDIAC_MUSCLE_CELL_CARDIAC_MUSCLE_CELL_ADHESION 0.0002709118 0.6914851 1 7
3: GOBP_PROTEIN_MODIFICATION_BY_SMALL_PROTEIN_CONJUGATION 0.0003764297 0.6914851 2 872
4: GOBP_AV_NODE_CELL_TO_BUNDLE_OF_HIS_CELL_SIGNALING 0.0004256966 0.6914851 1 11
5: GOBP_PROTEIN_MODIFICATION_BY_SMALL_PROTEIN_CONJUGATION_OR_REMOVAL 0.0005195136 0.6914851 2 1025
6: GOBP_AV_NODE_CELL_TO_BUNDLE_OF_HIS_CELL_COMMUNICATION 0.0005417747 0.6914851 1 14
</pre>

File:Heart+networks.png

2024-03-05T16:00:58Z

WikiSysop:

Heart Disease and Virtual Pulldown

2024-03-05T15:56:26Z

WikiSysop: Created page with "= Human diseases / virtual pulldown exercise = '''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, and Rasmus Wernersson '''Learning objectives:''' * Overall objective: learn how to extract meaningful networks from human PPI data *# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins" *# Finding tightly connected clusters in larger networks *# Using the DiscoNet package in R == Introduction == Image:Heart+networks.png|600px|right..."

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, and Rasmus Wernersson

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
<br style="clear: both" />

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seed_nodes_ex2 <- c(“ALDH1A2”, “BMP2", “CXADR”, “GATA4", “HAS2”, “NF1", “NKX2-5”, “PITX2", “PKD2”, “RXRA”, “TBX1”, “TBX2", “ZFPM1”, “ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
# Load the packages
<pre>
library(DiscoNet)
library(msigdbr)
library(fgsea)

The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb.Rdata')
</pre>

# Run DiscoNet with this list of proteins with the following parameters:

<pre>
network_ex2 <- virtual_pulldown(seed_nodes = seed_nodes_ex2, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network_ex2$network)
node_attributes <- data.frame(network_ex2$node_attributes)
</pre>

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Autumn2023

2024-03-05T15:55:40Z

WikiSysop: /* Lecture 09 (Nov 2) - Systems Biology in Biomedical Research (Heart diseases) 2 */

= Course 22140 - plan for autumn 2023 =

'''Teachers:'''
* Lars Rønn Olsen (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk]
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* Rasmus Wernersson (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Hanxi Li (teaching assistant) - '''contact:''' [mailto:hanxli@dtu.dk hanxli@dtu.dk]



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/167355 Course 22140, Autumn 2022 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 22211 or one of its variants). If you need to read up on some bioinformatics topics, please use the links below.
* [https://teaching.healthtech.dtu.dk/22111/ Course 22111] - ''Introduction to Bioinformatics''
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= R =
For the computer exercises we will be using R to process data, analyze, and visualize the biological networks. R is Open Source and freely available for Windows, Mac and Linux. We will be utilizing a RStudio server cloud solution to make sure that everyone uses the same version of R and the needed packages. You can log in with your DTU credentials [https://teaching.healthtech.dtu.dk/22140/rstudio.php here].

'''NOTE''': In order to produce plots with RStudio server, you need to have the appropriate graphics device activated. If you have X11 installed, this should work without any further actions. If you do not, you will get an error whenever you try to plot anything. To mitigate this, open Rstudio server, go to "Tools" (options bar at the top of the screen), select "Global options" from the drop down menu, select the "Graphics" tab, and change "Backend" to "Cairo".
<br><br>

= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the peer grade system. We will assign your report to three co-students to provide you with feedback.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.

= Lecture plan, autumn 2023 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>

== Block #1: Introduction ==
'''Responsible for this block:''' Lars Rønn Olsen and Rasmus Wernersson
----
=== Lecture 01 (August 31) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/material/22140/W01_Lazebnik_CancerCell2002.pdf PDF])
:'''Exercise:''' [[igraphIntro_Ex_v1|Introduction to working with networks in R]] - '''Answers:''' [[igraphIntro_Answers_v1|Exercise #1 answers]]

=== Lecture 02 (Sep 7) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Lars Rønn Olsen

:'''Slides:''' To appear on DTU Learn
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/material/22140/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/material/22140/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* [[Media:W02_exercises_v7_corrected.pdf|Building protein-protein interaction networks from experimental data]] (solutions now on DTU Learn)
:* [[Media:Exercise_help_sheet.pdf|Note taking sheet for help with ex. 5,7,8,9]] - Consider printing this for taking notes
:* [[Ex_handouts_igraph|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[Ex_handouts_igraph_solution|Exercise #2 answers]]

=== Lecture 03 (Sep 14) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Lars Rønn Olsen
:'''Slides:''' On DTU Learn.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/material/22140/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/material/22140/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Topology/statistics/modules [[ExTopology1_igraph|Network topology and statistics]] - '''Answers''': [[ExTopology1_igraph_solutions|Answers to igraph exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson and Kristoffer Vitting-Seerup
----

=== Lecture 04 (Sep 21) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/material/22140/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio_R|Yeast cell cycle 1 - introduction to data and methods]] - '''Answers:''' [[ExYeastSysBio_R_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 28) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Rasmus Wernersson
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [https://teaching.healthtech.dtu.dk/material/22140/Bbr002.pdf PDF] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [https://teaching.healthtech.dtu.dk/material/22140/GO_NATURE_GENETICS_2000.pdf PDF] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On DTU Learn

:'''Exercise:''' [[ExGeneOntology_Yeast_R|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast_R_answers|answers]]

=== Lecture 06 (Oct 5) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/691406/View DTU Learn]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/material/22140/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics_R|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' [[ExYeastCellCycle_answers|Answers]]

=== Lecture 07 (Oct 12) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson.
:'''Readings:''' [https://teaching.healthtech.dtu.dk/material/22140/Cyclebase1_2008.pdf PDF Cyclebase paper] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2_R|Mapping temporal expression data onto networks]] '''Answers:''' [[ExYeastCellCycleTranscriptomics2_R_answers|answers]]

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Lars Rønn Olsen, Rasmus Wernersson, and Kristoffer Vitting-Seerup
----

=== Lecture 08 (Oct 26) - Systems Biology in Biomedical Research (Heart diseases) 1 - CANCELLED ===



=== Lecture 09 (Nov 2) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Virtual pulldown and protein complex detection'' - Lars Rønn Olsen and Giorgia Moranzoni
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/material/22140/Furlong_Cell2012.pdf PDF]) - Concentrate on: '''Figure 1''' and '''Box 3'''
:* '''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Exercises:''' [[DiscoNet|DiscoNet]] - '''Answers:''' [[DiscoNet_answers|DiscoNet answers]]



=== Lecture 10 (Nov 9) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Systems Biology in Cancer'' - Kristoffer Vitting-Seerup
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/704041/View On DTU Learn]
:'''Exercises:''' [[DiscoNet2|Multiomics data integration]] '''Answers:''' [[DiscoNet2_answers|answers]]



=== Lecture 11 (Nov 16) - Essential R functions + exam exercise ===

:'''Lecture:''' Week 10 (and to some extent week 9) exercise walk-through
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' Old exam set (adapted to R)

<br>

=== Lecture 12 (Nov 23) - QnA / AMA ===

:'''Lecture:''' Kristoffer Vitting-Seerup
:'''Topics:''' Anything you would like to a refresher about

<br>

=== Lecture 13 (Nov 30) - Systems Biology in Biomedical Research 3 (ZS Revelen framework, drug targets) ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': ''' Network analysis of COVID19 – Intomics''' (PDF on DTU Learn)
:* It's a document with a written explanation of the COVID-19 analysis. (PDF of Intomics web-page, apologies for the sub-optimal formatting)
:* Please read all of it - ''including methods''. It is written for a non-technical audience and should be easy to understand.

:'''Slides:''' To appear on DTU Learn
:'''Exercise:'''
:* PDF on Learn
:'''Link to ZS Revelen:''' '''https://zs-revelen.com/'''
:*'''Please register''' your email with ZS Revelen before the exercise. '''Please use your DTU email.'''
<br>

= Old exam sets =

* On Learn



= Exam =
* '''Date: 6/12 2022
* '''Time: 15:00-19:00
* '''Where: (will be) available via https://eksamensplan.dtu.dk/

File:XKCD significant.png

2024-03-05T15:54:31Z

WikiSysop:

ExYeastCellCycleTranscriptomics2 R answers

2024-03-05T15:51:09Z

WikiSysop: Created page with "===Question 1=== <pre> library(igraph) library(ggraph) load("home/projects/22140/exercise4.Rdata") load("home/projects/22140/exercise6.Rdata") expr$log2fc <- log2(expr$GSM287992/expr$GSM287991) node_attributes_updated <- merge(x = node_attributes, y = expr[!duplicated(expr$SysName),c(1,6)], by.x = "ID", by.y = "SysName", all.x = TRUE) g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes_updated) ggraph(g) + geom_edge_link() + g..."

===Question 1===

<pre>
library(igraph)
library(ggraph)

load("home/projects/22140/exercise4.Rdata")
load("home/projects/22140/exercise6.Rdata")

expr$log2fc <- log2(expr$GSM287992/expr$GSM287991)

node_attributes_updated <- merge(x = node_attributes, y = expr[!duplicated(expr$SysName),c(1,6)], by.x = "ID", by.y = "SysName", all.x = TRUE)

g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes_updated)

ggraph(g) +
geom_edge_link() +
geom_node_point(aes(color = log2fc, shape = abs(log2fc)>2), size = 3) +
scale_color_gradient2(low = "red", mid = "gray", midpoint = 0, high = "blue")

expr[grepl(pattern = "kar", x = expr$PopName, ignore.case = TRUE),]
</pre>

The KAR genes are involved in the karyogamy process (also discussed in the answers for last weeks exercise), and it makes good sense that they are overexpressed in the alpha-factor arrested genes. (Alpha-factor triggers the mating response, which in turn prepares for a fusion of the nuclei of the A- and alpha-cells – this is the process known as karyogamy).

===Question 2===

The title of the publication is found directly in the GEO database under "Citation(s)".

The number of measurements (50) can be found looking at the number of “samples” being associated with this database entry.

===Question 3 + 4===

<pre>
# all in one
df <- alpha30_38[alpha30_38$gene %in% c("YPL153C", "YMR199W", "YBL002W", "YGR108W", "YKL185W") & alpha30_38$experiment == "alpha30",]
ggplot(df, aes(x = timepoint, y = log2fc, color = gene)) +
geom_point() +
geom_line() +
ggtitle("alpha30")

# or one at the time if you prefer
ggplot(df, aes(x = timepoint, y = log2fc, group = gene)) +
geom_point() +
geom_line() +
facet_wrap(~gene) +
ggtitle("alpha30")

df <- alpha30_38[alpha30_38$gene %in% c("YPL153C", "YMR199W", "YBL002W", "YGR108W", "YKL185W") & alpha30_38$experiment == "alpha38",]
ggplot(df, aes(x = timepoint, y = log2fc, color = gene)) +
geom_point() +
geom_line() +
ggtitle("alpha38")

ggplot(df, aes(x = timepoint, y = log2fc, color = gene)) +
geom_point() +
geom_line() +
facet_wrap(~gene) +
ggtitle("alpha38")

# Notice that this is the dye swap experiment (in essence the sign will be swapped for the log2 ratios compared to the alpha30 plot).
</pre>

From looking at the plot I estimate the following interval between the peaks (or between the low points if you prefer). Ignore small bumps on the graphs (measurement uncertainty) and look for the larger trends.

RAD53/YPL153C: 55 min

CLN1/YMR199W: 50 min

HTB2/YBL002W: 65 – 70 min

CLB1/YGR108W: 60 min

ASH1/YKL185W: 65 - 70 min

RAD53/YPL153C: 60 min

CLN1/YMR199W: 65 min

HTB2/YBL002W: 60 min

CLB1/YGR108W: 60 min

ASH1/YKL185W: 65 min

All in all the “true” interdivision time (looking across all 5 genes and both experiments) appears to be ~60 min.

===Question 5===

Since only a low percentage of all yeast genes are expected to be cell cycle regulated (most are needed for other stuff like basic metabolism) we should expect a random sample of genes to contain few or no cyclic patterns.

<pre>
df <- alpha30_38[alpha30_38$gene %in% sample(rownames(alpha30), 5) & alpha30_38$experiment == "alpha30",]
ggplot(df, aes(x = timepoint, y = log2fc, color = gene)) +
geom_point() +
geom_line() +
ggtitle("alpha30")
</pre>

===Question 6===

<pre>
df <- alpha30_38[alpha30_38$gene %in% "YBL002W",]
ggplot(df, aes(x = timepoint, y = log2fc, color = experiment)) +
geom_point() +
geom_line() +
ggtitle("HTB2 in alpha30 and alpha38")
</pre>

As expected the two graphs are (almost) mirror images of each other. Notice that the actual mRNA/cDNA is the same on both cases, but that the labeling has been reversed, and independent hybridizations against CONTROL has been performed for each timepoint. The slight variation between the graphs is due to the small fluctuations there will always be between independent measurements (“technical variance”).

===Question 7===

The trick is here to work with the information we were given about where the 5 genes are supposed to peak, and then start translating the time in minutes into phases.

As stated in the exercise manual, we assume that each phase is 25% of the interdivision time.

<pre style="overflow:auto;">
RAD53 [YPL153C] G1 (mid) DNA repair/cell cycle arrest
CLN1 [YMR199W] G1/S1 G1-cyclin - controls entry to the S-phase
HTB2 [YBL002W] S (mid) Histone H2B - histones are needed for the new chromosomes
CLB1 [YGR108W] G2 (mid) B-type cyclin
ASH1 [YKL185W] M (late) Transcriptional regulator (during anaphase)
</pre>

Here we are going to use both the graphs from question 3+4 and the estimated interdivision time (~60). For example, from the graph below (alpha 30) it appears that HTB2 peak at 65 minutes, which will translate 65 min (and 5 and 125 min) to be in the middle of the S phase. Likewise the timepoint 30 min before and after (35, 95 min) will be directly opposite in the “phase wheel” and can be assigned to be middle of the M phase.

By going over the graphs and working out the phases from the 5 genes, it will in the end be possible to come up with some good estimates of where the phases are found.

{| class="wikitable" style="margin:auto"
|-
! Cell cycle phase !! Time in minutes (approximate)
|-
| G1 || 0, 50-60, 110-120
|-
| S || 5-15, 65-75
|-
| G2 || 20-30, 80-90
|-
| M || 35-45, 95-105
|}

===Question 8===

Coming soon!

===Question 9===

Peak time (as percentage of cell cycle) from cyclebase.org:

RAD53/YPL153C: 17

CLN1/YMR199W: 25

HTB2/YBL002W: 40

CLB1/YGR108W: 63

ASH1/YKL185W: 97

===Question 10===

<pre>
peaktime[abs(peaktime$peaktime - peaktime[peaktime$gene == "HTB2",]$peaktime)<=5,]$gene
</pre>

===Question 11===

Coming soon!

File:YBL002W.cdc15.png

2024-03-05T15:49:33Z

WikiSysop:

File:YBL002W.30.png

2024-03-05T15:49:04Z

WikiSysop:

File:HTB2-peaktime.png

2024-03-05T15:48:33Z

WikiSysop:

File:YKL185W.30.png

2024-03-05T15:46:36Z

WikiSysop:

ExYeastCellCycleTranscriptomics2 R

2024-03-05T15:45:50Z

WikiSysop: Created page with "= Yeast cell cycle / transcriptomics exercise #2 = '''Exercise written by:''' Rasmus Wernersson and Lars Rønn Olsen = PART 1: Network analysis of the Alpha Factor Arrest data = left IMPORTANT: We continue working with the data set from week 4. == Network analysis == The final part of the analysis of the alpha-factor arrest data set we started in the previous exercise, is to map it onto the Yeast protein-protein interaction netwo..."

= Yeast cell cycle / transcriptomics exercise #2 =
'''Exercise written by:''' Rasmus Wernersson and Lars Rønn Olsen

= PART 1: Network analysis of the Alpha Factor Arrest data =
[[Image:Emblem-important_tiny.png‎|left]] IMPORTANT: We continue working with the data set from week 4.

== Network analysis ==
The final part of the analysis of the alpha-factor arrest data set we started in the previous exercise, is to map it onto the Yeast protein-protein interaction network we worked with in week 4.

'''TASK: reload base session, prepare Excel data for import'''

<pre>
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise6.Rdata")
# Calculate log2fc from expression
expr$log2fc <- log2(expr$GSM287992/expr$GSM287991)
# Add log2fc to your node attribute table
node_attributes_updated <- merge(x = node_attributes, y = expr[!duplicated(expr$SysName),c(1,6)], by.x = "ID", by.y = "SysName", all.x = TRUE)
# If you haven't worked with the "merge" command, take a moment to understand the line above. It's a super useful command
</pre>

'''TASK: Visualize the Log2(FC)'''
* Plot the network using ggraph with the following mappings:
* Color nodes by log2FC
* Use the "scale_color_gradient2" argument to ggraph to color down regulated nodes blue, (mid color gray), and up regulated red
<pre>
scale_color_gradient2(low = "red", mid = "gray", midpoint = 0, high = "blue")
</pre>
* Set node shape based on whether the log2fc is lower than -2 or higher than 2. Hint:
<pre>
shape = abs(log2fc)>2)
</pre>

The next task is to have a look at the network and try to interpret the results. Here it should be noted that the cells have been arrested in what may be a bit "boring" part of the cell cycle (late G1), but we can still make a few interesting observations.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1: Inspect the network'''
* Can you find any clusters (even small ones) with several genes being regulated in the same direction?
* Discuss the biological meaning of this within the group (note that not all clusters are super-easy to interpret).
* Which of the KAR genes are regulated?
* Include a screenshot of the network in your report

= PART 2: Arrest and Release time-series experiment =

== The "alpha-30/alpha-38" arrest and release experiments ==
[[Image:YKL185W.30.png|right|400px|thumb|'''ASH1''' expression as a function of time after release from alpha-factor arrest (from Pramila et al, 2006)]]

For the first part of the exercise we'll be working an alpha-factor arrest-and-release experiment (the "'''alpha-30/alpha38'''" experiment from the Breeden lab). Briefly, the experimental set-up is as follows:
* '''ARREST:''' The culture was arrested using alpha-factor (as we have seen before).
* '''RELEASE:''' When most of the cells had been arrested in cell cycle, the cells were spun down, and re-suspended in fresh media (thus removing the alpha-factor).
* '''SAMPLING:''' small samples were collected from the culture at 5-minute intervals following release (and experimental tricks were used to quickly kill the cells and protect the RNA).
* '''ARRAY ANALYSIS:''' for each timepoint the synchronized cells were compared to an asynchronous culture on a '''two color array''' (competitive hybridization) using Cy3 and Cy5 labeling.
** '''DYE SWAP:''' The experiment was carried out twice, once with CASE (Cy3) vs. CONTROL (Cy5) and once with CASE (Cy5) vs. CONTROL (Cy3) - this is done to eliminated technical biases in the labeling process.

=== Understanding the data ===
'''TASK: find the "alpha-30" experiment in GEO"
* Search for the accession ID: GSE4987

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Find the title of the publication describing the experiment.
* How many arrays (measurements) are associated with the experiment?

'''CLEANED UP DATA:'''
* From the data available for download at GEO (the MATRIX file mentioned above), we have prepared an extract (and very slight reformatting) of the data we need for this exercise.
<pre>
load("/home/projects/22140/exercise7.Rdata")
</pre>
* The data frame "alpha30_38" contains the data in long format. Take a minute to explore the data frame. The log2fc is the fold change between the two conditions case and control. In the alpha30 experiment, fold change is case = Cy3 / control = Cy5. In alpha38 the dyes were swapped and hence control = Cy3 / case = Cy5

=== Estimating the inter-division time ===
Before we move on with the analysis of biology described by the data, we need to have a better understanding of how the '''time series''' relates to the cell cycle. We'll start out by estimating the '''inter-division time''' (number of minutes it take for a full cell cycle).

This can be done by simply plotting the '''expression profiles''' of a few genes that we expect to follow a '''cyclic pattern''' in the data set, for example:

<pre style="overflow:auto;">
RAD53 [YPL153C] G1 (mid) DNA repair/cell cycle arrest
CLN1 [YMR199W] G1/S1 G1-cyclin - controls entry to the S-phase
HTB2 [YBL002W] S (mid) Histone H2B - histones are needed for the new chromosomes
CLB1 [YGR108W] G2 (mid) B-type cyclin
ASH1 [YKL185W] M (late) Transcriptional regulator (during anaphase)
</pre>

'''TASK:''' below is a hint how to use ggplot to plot x = timepoint and y = log2fc

<pre>
# all in one
df <- alpha30_38[alpha30_38$gene %in% c("YPL153C", "YMR199W", "YBL002W", "YGR108W", "YKL185W") & alpha30_38$experiment == "alpha30",]
ggplot(df, aes(x = timepoint, y = log2fc, color = gene)) +
geom_point() +
geom_line() +
ggtitle("alpha30")

# or one at the time if you prefer
ggplot(df, aes(x = timepoint, y = log2fc, group = gene)) +
geom_point() +
geom_line() +
facet_wrap(~gene) +
ggtitle("alpha30")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3: Estimate inter-division times'''
* Start out by plotting the graphs using the code above for the 5 genes for the CASE vs. CONTROL part of the data ("Cy3/Cy5" - "Alpha30")
* Estimate the distance between the peaks (just look at them), and report the results in '''minutes''':
** '''RAD53/YPL153C:'''
** '''CLN1/YMR199W:'''
** '''HTB2/YBL002W:'''
** '''CLB1/YGR108W:'''
** '''ASH1/YKL185W:'''
** Include the plot in your report

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4: Estimate inter-division times again''' - this time from the CONTROL vs. CASE data'''
* Plot the graph for the five genes using the '''dye swap''' data ("Alpha38"), and report the estimated inter-division times in minutes:
** '''RAD53/YPL153C:'''
** '''CLN1/YMR199W:'''
** '''HTB2/YBL002W:'''
** '''CLB1/YGR108W:'''
** '''ASH1/YKL185W:'''
** Include the plot in your report
* '''CONCLUDE:'''
** Is there good agreement about the inter-division time?
** Make a combined estimate about the "true" inter-division time
** How many cell divisions do the time series cover?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5: cyclic genes - what should be expected?'''
* Think about this: if you pick a handful of '''random''' genes from the big data matrix (that is, across the entire genome), would you expect them to follow a cyclic pattern with the inter-division time you have just estimated above?
* To back up your argumentation you are welcome to make an expression plot of 10 randomly selected genes.

=== A brief look at the dye-swap experiment ===
[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #6:'''
* Select 1-2 of the cyclic genes from the table above, and make '''one plot for each gene''' showing the expression data for both "normal" and "dye-swap".
* Do the two expression profiles (for each gene) follow a pattern you would expect?
* Include your plot(s) in the report.

=== Mapping the cell cycle phases onto the time points ===
[[Image:HTB2-peaktime.png|frame|right|'''HTB2''' peaks roughly halfway through the S-phase. Assuming each phase is 25% of the cycle, HTB2 will be mapped into the S-phase as shown here]]
From your observations above (the phases of the 5 genes listed in the table), it should be possible to do a '''rough''' mapping from time in '''minutes''' to cell cycle phases.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #7:'''
* Make a table (e.g. Excel, Word, text-based) where you map the time in minutes (0-120) to an '''estimate''' of the corresponding point in cell cycle.
* Hint: start by mapping out the known peaks, and fill in the rest from there.
* Include the table in your report.
* '''SANITY CHECK:''' Is your table in alignment with the fact that Alpha-factor arrest is linked to the G1/S phase transition?

=== Network analysis of the time-series data ===
'''TASK:'''
* Add the log2fc and timepoint from alpha30 experiment to your node attribute table using the "merge" function.

[[Image:Yeast_network_navigation_guide_CS_3.5.1.png|right|200px|border]]
'''TASK:'''
* Reload a graph object with the updated node attribute table

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #8:'''
* Select one time point in the '''S-phase''' and one in the '''M-phase''' based on your work above.
* Report the selected time points in minutes.
* For '''each''' of the 2 time points, plot the network with nodes colored by log2fc:
* Which of the previously defined clusters 1-8 appear to be up/down regulated in this cell cycle phase?
* Document your finding with a few screenshots of selected clusters
* Is this in good agreement with our previous functional analysis of the clusters?
<br>

= PART 3: Combined arrest-and-release experiments and peak-time =
[[Image:YBL002W.30.png|right|300px|thumb|'''HTB2''' expression as a function of time after release from alpha-factor arrest (data from Pramila et al, 2006)]]
[[Image:YBL002W.cdc15.png|right|300px|thumb|'''HTB2''' expression as a function of time after release from CDC15 arrest (data from Spellman et al, 1998)]]

As the final part of the exercise, we investigate what we can learn from an '''integrative analysis''' of '''entire''' cell cycle data sets. As we have discussed in today's lecture the idea is to perform the following analysis:
# Use a mathematical model to determine '''which''' of the genes are periodically expressed.
# As part of this analysis estimate the '''peak time''' of all the periodically expressed genes.

== The "data alignment" problem ==
In order to use data from multiple different experiments we need to overcome a few difficulties - most importantly:
# The growth conditions may be different (e.g. medium and temperature) leading to different inter-division times
# Different arrest methods halt the cell division in different stages (meaning that "time-zero" is not in the same phase).
# The experiments may last a different number of cell divisions (typically 1.5 - 2.5).

[[Image:Cogs_brain.png|50px]]
'''QUESTION/DISCUSSION POINT:''' - discuss the following in the group (you don't need to put it in your report)
* What steps would be needed in order to make two experiments comparable?
** Hint: Use the curves of HTB2 shown to the right as the basis of the discussion.

<br style="clear: both" />

== Introducing Cyclebase.org ==
Here will use the online resource [http://cyclebase.org cyclebase.org], which is dedicated to cell cycle analysis, and which contains a lot of easy-to-browse information about which yeast genes are periodically expressed. Note that from the circle plot you can get extra information by hovering your mouse over each section for 2-3 seconds.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:'''
* Go to [https://cyclebase.org/ cyclebase.org] and look up the PEAK TIME (in percent of cell cycle) for the 5 genes we have already worked with:
** '''RAD53/YPL153C:'''
** '''CLN1/YMR199W:'''
** '''HTB2/YBL002W:'''
** '''CLB1/YGR108W:'''
** '''ASH1/YKL185W:'''

As is evident from the very detailed pages for each gene at cyclebase.org, quite a lot of advanced analysis went into boiling down the information contained in the experiments into a few key numbers. Here, we only need to concern ourselves with the PEAK TIME (and then we, for now, trust that the authors did a decent job at finding the periodically expressed genes).

[[Image:Document-save.png|left|25px]]
'''TASK: download data'''
* Most data from CycleBase is available for download at: [https://cyclebase.org/Downloads CycleBase download] as TEXT files that are pretty easy to work with.
* However, in order to save some time, we have prepared a data frame, "peaktime", which contains the most important information for the '''periodic genes''' (you will find it in the data for exercise 7)

<blockquote style="background-color: lavender; border: solid thin grey;overflow:auto;">
'''Quoting from CycleBase:'''<br><br>
The '''peaktime''' describes when in the cell cycle a gene is maximally expressed. Peaktime is calculated as a percent, with both 0 and 100 representing the M/G1 transition in the cell cycle. These percents are displayed as discrete phases or transitions of the cell cycle.<br><br>

A peaktime for a single expression profile first requires that a sine wave be fitted to the profile. The algorithm scans through all possible offsets and selects the sine wave that has the best correlation with the observed expression profile. The peaktime is then computed as the peak of this sine wave.<br><br>

To compute a peaktime for a single gene across all available experiments, the time scale was 'shifted' such that time was represented as a fraction of the cell cycle. In this scale, both 0 and 100 correspond to the M/G1 transition. As experiments with not very periodic profiles produce poor peaktimes, the combined peaktime was weighted to take this into account.<br>
</blockquote>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:'''
* Extract all the genes with a peaktime within +/-5 of the gene HTB2

== Network analysis of peak-time data ==
[[Image:Cogs_brain.png|50px]]
'''OPEN ASSIGNMENT:''' this requires some work on your own, dialogue within the group
* STEP 1: Map the peak-time data into the Yeast PPI network (create a new work session), and make a discrete color-code showing the peak-time:
** G1-phase: 1-25
** S-phase: 26-50
** G2-phase: 51-75
** M-phase: 76-100
** (find a good neutral color for nodes with no data)

* STEP 2: find regulated clusters
** '''Party hubs:''' Any clusters with a clear peak-time signal? Are all members periodically expressed?
** '''Date hubs:''' Any clusters with key proteins interacting with different proteins throughout the cell cycle?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #11:'''
* Document your findings as best as you can - include figures where needed (and remember to explain the color-coding).

Autumn2023

2024-03-05T15:43:53Z

WikiSysop: /* Lecture 02 (Sep 7) - Intro 2 */

= Course 22140 - plan for autumn 2023 =

'''Teachers:'''
* Lars Rønn Olsen (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk]
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* Rasmus Wernersson (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Hanxi Li (teaching assistant) - '''contact:''' [mailto:hanxli@dtu.dk hanxli@dtu.dk]



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/167355 Course 22140, Autumn 2022 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 22211 or one of its variants). If you need to read up on some bioinformatics topics, please use the links below.
* [https://teaching.healthtech.dtu.dk/22111/ Course 22111] - ''Introduction to Bioinformatics''
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= R =
For the computer exercises we will be using R to process data, analyze, and visualize the biological networks. R is Open Source and freely available for Windows, Mac and Linux. We will be utilizing a RStudio server cloud solution to make sure that everyone uses the same version of R and the needed packages. You can log in with your DTU credentials [https://teaching.healthtech.dtu.dk/22140/rstudio.php here].

'''NOTE''': In order to produce plots with RStudio server, you need to have the appropriate graphics device activated. If you have X11 installed, this should work without any further actions. If you do not, you will get an error whenever you try to plot anything. To mitigate this, open Rstudio server, go to "Tools" (options bar at the top of the screen), select "Global options" from the drop down menu, select the "Graphics" tab, and change "Backend" to "Cairo".
<br><br>

= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the peer grade system. We will assign your report to three co-students to provide you with feedback.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.

= Lecture plan, autumn 2023 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>

== Block #1: Introduction ==
'''Responsible for this block:''' Lars Rønn Olsen and Rasmus Wernersson
----
=== Lecture 01 (August 31) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/material/22140/W01_Lazebnik_CancerCell2002.pdf PDF])
:'''Exercise:''' [[igraphIntro_Ex_v1|Introduction to working with networks in R]] - '''Answers:''' [[igraphIntro_Answers_v1|Exercise #1 answers]]

=== Lecture 02 (Sep 7) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Lars Rønn Olsen

:'''Slides:''' To appear on DTU Learn
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/material/22140/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/material/22140/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* [[Media:W02_exercises_v7_corrected.pdf|Building protein-protein interaction networks from experimental data]] (solutions now on DTU Learn)
:* [[Media:Exercise_help_sheet.pdf|Note taking sheet for help with ex. 5,7,8,9]] - Consider printing this for taking notes
:* [[Ex_handouts_igraph|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[Ex_handouts_igraph_solution|Exercise #2 answers]]

=== Lecture 03 (Sep 14) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Lars Rønn Olsen
:'''Slides:''' On DTU Learn.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/material/22140/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/material/22140/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Topology/statistics/modules [[ExTopology1_igraph|Network topology and statistics]] - '''Answers''': [[ExTopology1_igraph_solutions|Answers to igraph exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson and Kristoffer Vitting-Seerup
----

=== Lecture 04 (Sep 21) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/material/22140/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio_R|Yeast cell cycle 1 - introduction to data and methods]] - '''Answers:''' [[ExYeastSysBio_R_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 28) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Rasmus Wernersson
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [https://teaching.healthtech.dtu.dk/material/22140/Bbr002.pdf PDF] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [https://teaching.healthtech.dtu.dk/material/22140/GO_NATURE_GENETICS_2000.pdf PDF] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On DTU Learn

:'''Exercise:''' [[ExGeneOntology_Yeast_R|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast_R_answers|answers]]

=== Lecture 06 (Oct 5) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/691406/View DTU Learn]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/material/22140/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics_R|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' [[ExYeastCellCycle_answers|Answers]]

=== Lecture 07 (Oct 12) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson.
:'''Readings:''' [https://teaching.healthtech.dtu.dk/material/22140/Cyclebase1_2008.pdf PDF Cyclebase paper] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2_R|Mapping temporal expression data onto networks]] '''Answers:''' [[ExYeastCellCycleTranscriptomics2_R_answers|answers]]

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Lars Rønn Olsen, Rasmus Wernersson, and Kristoffer Vitting-Seerup
----

=== Lecture 08 (Oct 26) - Systems Biology in Biomedical Research (Heart diseases) 1 - CANCELLED ===



=== Lecture 09 (Nov 2) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Virtual pulldown and protein complex detection'' - Lars Rønn Olsen and Giorgia Moranzoni
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/Furlong_Cell2012.pdf PDF]) - Concentrate on: '''Figure 1''' and '''Box 3'''
:* '''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Exercises:''' [[DiscoNet|DiscoNet]] - '''Answers:''' [[DiscoNet_answers|DiscoNet answers]]



=== Lecture 10 (Nov 9) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Systems Biology in Cancer'' - Kristoffer Vitting-Seerup
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/704041/View On DTU Learn]
:'''Exercises:''' [[DiscoNet2|Multiomics data integration]] '''Answers:''' [[DiscoNet2_answers|answers]]



=== Lecture 11 (Nov 16) - Essential R functions + exam exercise ===

:'''Lecture:''' Week 10 (and to some extent week 9) exercise walk-through
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' Old exam set (adapted to R)

<br>

=== Lecture 12 (Nov 23) - QnA / AMA ===

:'''Lecture:''' Kristoffer Vitting-Seerup
:'''Topics:''' Anything you would like to a refresher about

<br>

=== Lecture 13 (Nov 30) - Systems Biology in Biomedical Research 3 (ZS Revelen framework, drug targets) ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': ''' Network analysis of COVID19 – Intomics''' (PDF on DTU Learn)
:* It's a document with a written explanation of the COVID-19 analysis. (PDF of Intomics web-page, apologies for the sub-optimal formatting)
:* Please read all of it - ''including methods''. It is written for a non-technical audience and should be easy to understand.

:'''Slides:''' To appear on DTU Learn
:'''Exercise:'''
:* PDF on Learn
:'''Link to ZS Revelen:''' '''https://zs-revelen.com/'''
:*'''Please register''' your email with ZS Revelen before the exercise. '''Please use your DTU email.'''
<br>

= Old exam sets =

* On Learn



= Exam =
* '''Date: 6/12 2022
* '''Time: 15:00-19:00
* '''Where: (will be) available via https://eksamensplan.dtu.dk/

IgraphIntro Ex v1

2024-03-05T15:39:36Z

WikiSysop: /* Collecting node attributes */

= Introduction to working with networks in R =

'''Exercise written by:''' Rasmus Wernersson and Lars Rønn Olsen

The purpose is to give a general introduction to the R packages [https://r.igraph.org/ '''igraph'''] and [http://users.dimi.uniud.it/~massimo.franceschet/ns/syllabus/make/ggraph/ggraph.html '''ggraph'''], which we will be using for a large part of the course for:
# Building and storing networks in R
# Visualization / inspection of biological networks
# Data integration of networks and supporting information

== Working with TEXT files ==
As was the case in the prerequisite course ''Introduction to Bioinformatics'' (27611/27622) we will be working a lot with PLAIN TEXT files in this course, to import data into R. For this purpose you'll need a good TEXT EDITOR, that can save a file without a lot of formatting information. You can either use the built-in editor in RStudio or something like [https://www.sublimetext.com/ Sublime text] or [http://www.jedit.org/ J edit]

[[Image:jEdit_screenshot.png|center]]
If you need a reminder on how to use text editors, you can briefly run through the old [[ExJEdit|jEdit exercise]].

= Example: protein complexes =
In the Systems Biology course we will be working a lot with '''protein-protein interaction''' data (''physical'' interactions between proteins), and we'll start this exercise with a look at how we can represent a simple well-known protein complex in R, and how we can expand our analysis from here.

[[image:Hemoglobin_structure_200px.png|right|frame|Structure of horse hemoglobin (from PDB) - the structure is a TETRAMER consisting of two ALPHA globins and two BETA globins.|link="http://www.pdb.org/pdb/101/motm.do?momID=142"]]
One of the simplest formats for storing graphs is using a two column data frame with connected proteins in each row (column names does not matter). For example, the physical interaction between ALPHA and BETA GLOBIN in the HEMOGLOBIN complex could be stated as:

<pre style="overflow:auto;">
hemoglobin <- data.frame (from = c("ALPHA_GLOBIN", "ALPHA_GLOBIN", "BETA_GLOBIN"), to = c(c("ALPHA_GLOBIN", "BETA_GLOBIN", "BETA_GLOBIN")
</pre>

Each of the ALPHA and the BETA globins also physically interacts with itself (see the structure for explanation).

== igraph ==
igraph offers many ways to create a graph. The simplest one is the function [https://r.igraph.org/articles/igraph.html#creating-a-graph make_empty_graph], but graphs can also be imported from and exported to a [https://r.igraph.org/articles/igraph.html#igraph-and-the-outside-world variety of file formats]. The [https://r.igraph.org/articles/igraph.html r.igraph website] is a great introductory resource that you are encouraged to explore. The igraph package can do loads more than what is listed on their introductory website, and you are encouraged to use Google to find functions and examples for specialized tasks.

'''TASK: Make simple network in igraph'''
# Login to the RStudio server.
# Load the igraph package
# Make an igraph object from the hemoglobin data frame using the [https://igraph.org/r/doc/graph_from_data_frame.html graph_from_data_frame] function (set "directed = FALSE" - we will explain why you should do this in detail throughout the course)
# Plot the graph object using the base plot function (plot())

When you're done you should have a network that looks similar to the screenshot below:

[[Image:igraph_hemoglobin.png|400px|border]]

* Make sure you understand what the NODES (the circles) and the EDGES (the lines) represent: what is the BIOLOGICAL interpretation of the network?

== DNA Polymerase Delta ==
[[Image:1000px-DNA_replication_en.svg.png‎|center|thumb|800px|Schematic overview of the Eukaryotics replication machinery - notice Polymerase Delta working on the lower DNA strand. Source: Wikipedia]]

Before we move on to the more advanced visualization feature of ggraph, we'll introduce a slightly more complex network which we can expand upon as we go along: '''DNA Polymerase Delta (Pol δ)'''. Pol δ has "proofreading" fuctionality (3'→5' exonuclease activity) and consist of the "''proliferating cell nuclear antigen''" (PCNA), a multi-subunint complex named "''replication factor C''" and the ''polymerase subunit'' itself, which consists of four proteins: POLD1, POLD2, POLD3 and POLD4.

=== Pol δ network ===
We will start out with having a look at the polymerase sub-unit. Since we want to expand the network and add in more information as we go along, we choose to map the proteins to actual '''UniProt identifiers''', which will make it easy to look up additional information as we go along:

Gene Protein
---- -------
POLD1 DPOD1_HUMAN
POLD2 DPOD2_HUMAN
POLD3 DPOD3_HUMAN
POLD4 DPOD4_HUMAN

UniProt links (for optional browsing):
* http://www.uniprot.org/uniprot/DPOD1_HUMAN
* http://www.uniprot.org/uniprot/DPOD2_HUMAN
* http://www.uniprot.org/uniprot/DPOD3_HUMAN
* http://www.uniprot.org/uniprot/DPOD4_HUMAN

[[Image:igraph_pol_delta.png|300px|right|border]]
'''TASK: create data frame for the polymerase sub-unit interactions:'''
* The subunit is a tetramer consisting of one of each protein.
* Each protein interacts with all other proteins.
* Your igraph network should look similar (have the same topology) as the network shown here.

===Pol δ node attributes===
In order to visualize graph to understand and communicate their properties, we can add attributes to both the nodes and edges of the graph.

For example if you follow the '''UniProt''' links above, you can read a wealth of information about the names, descriptions, biological function and much more for each of the proteins.

'''IMPORTANT NOTE''': in graph theory, nodes can also be referred to as "vertices" (singular: vertex) and this is the convention in igraph.

====Adding attributes to igraph object====
To add node attributes to an igraph object, please see that the [https://igraph.org/r/doc/graph_from_data_frame.html graph_from_data_frame function] has a variable "vertices" which allows you to add node attributes in the form of a data frame when you build the igraph object. Adjusting and retrieving attributes of an igraph object once it is made, can be done using the functions V() (to add vertex attributes) and E() (to add edge attributes). Please see the [https://cran.r-project.org/web/packages/igraph/vignettes/igraph.html#setting-and-retrieving-attributes igraph documentation] for information on how this works.

In other words, the simplest way to add node attributes is to create a node attribute data frame. The first column of the data frame is assumed to contain symbolic vertex names, this will be added to the graphs as the ‘name’ vertex attribute. Other columns will be added as additional vertex attributes.

For example:

<pre style="overflow:auto;">
UniProtId GeneID Catalytic Description AA
DPOD1_HUMAN PolD1 yes DNA polymerase delta catalytic subunit 1009
DPOD2_HUMAN PolD2 no DNA polymerase delta subunit 2 469
DPOD3_HUMAN PolD3 no DNA polymerase delta subunit 3 466
DPOD4_HUMAN PolD4 no DNA polymerase delta subunit 4 107
</pre>

As can be seen from the example, 4 categories of information have been added, and each line contains information related to a single protein. The node attributes will be assigned the column names, such that for example "Description" can be retrieved or edited using V(g)$Description. "AA" refers to the length of the amino acid sequence of the protein.

'''TASK: Import the node attribute table into a new pol delta igraph object'''
# Make a data frame of the node attribute table above (do this manually for now - we will later learn how upload and read data into RStudio)
# Make a new igraph object of the pol delta network and include the node attributes
# Check that this worked using the function "V()"

=== Visualizing the graph using ggraph ===
ggraph (pronounced "g-giraph) is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees. If you have not yet worked with ggplot2, or feel like you need a reminder of how it works, [https://www.rforecology.com/post/a-simple-introduction-to-ggplot2/ here is a good primer]. [https://www.data-imaginist.com/2017/ggraph-introduction-layouts/ This web site] gives an excellent overview of the functionalities of ggraph and serves as a great reference. Take a moment to browse ggraph's functionalities.

'''TASK - use the node annotations for customizing visualization'''
# Plot the pol delta complex using ggraph and the default layout
way to visualize protein-protein interactions?
# '''Label:'''
#* Modify it to show the GeneIDs (you will have to find out how yourself - HINT: try to search google for "ggraph node labels")
# '''Fill Color:'''
#* Color the nodes based on the "Catalytic" variable in the
# '''Node size:'''
#* Make the size of the nodes correspond to the length of the amino acid sequence

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q1: Paste a screenshot of your nicely decorated network into you report

=== Extended Pol δ network ===
For the final part of the exercise we'll be working with set of '''experimental data''' centered around the Pol δ complex. Later in the course we will learn a lot of details about how such experimental data is generated, what strengths and weaknesses the different methods have, and how we can address the noise in the data.

For now it sufficient to note the following:
* The experiment has detected proteins that physically interacts with the PolD1-PolD4 complex we have just worked with.
* Both '''stable''' and '''transient''' interactions have been identified.
* The experiment shows '''some''' of the most likely interactions - additional experiments may find more.
* The data '''may''' contain false positive (proteins indicated to interact, while that is not true under real biological conditions).

==== Network and layout ====
[[Image:Poldelta_extended_stress_layout.png|right|thumb|350px|ggraph visualization with default (stress) layout. Could this be improved?]]
<pre style="overflow:auto;">
poldelta_extented_interactions <- data.frame(from = c("DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD3_HUMAN", "DPOD3_HUMAN", "DPOD3_HUMAN", "DPOD3_HUMAN", "DPOD4_HUMAN", "DPOD4_HUMAN", "DPOD4_HUMAN", "DPOD4_HUMAN", "PRI1_HUMAN", "WRIP1_HUMAN"), to = c("DPOD2_HUMAN", "DPOD3_HUMAN", "DPOD4_HUMAN", "S7A6O_HUMAN", "TREX2_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "WRIP1_HUMAN", "DPOD3_HUMAN", "DPOD4_HUMAN", "PDIP2_HUMAN", "BACD1_HUMAN", "WRIP1_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "DPOD4_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "WRIP1_HUMAN", "PRI2_HUMAN", "WRIP1_HUMAN"))
</pre>

'''TASK - import the network'''

'''TASK - find a good network layout'''
# Play around with a few other layouts. Think about whether they are good or bad for visualizing a protein-protein interactions (for example, do you think that the "linear" layout would be a good visualization for this topology?)

==== Collecting node attributes ====
The next step in the network analysis, is to gather a set of useful information about the protein in the network, that can help guide our understanding of the biology behind the network. This information gathering will have two goals:

* To collect NODE ATTRIBUTE information useful for visualization of the network.
* To get an initial understanding of what the roles of the individual proteins may be:
** E.g. biological process, cellular compartment, description, notes about function etc.

[[Image:PolDelta_network_node_info_sheet_2020.PNG|center|thumb|900px|Node attributes in the process of being collected in an Excel sheet]]
There are a number of (semi) automatic ways to gather such data, but since we're working with a small network here, it's feasible to manually gather the data from a well respected data source such as UniProt, and keeping track of it in a spreadsheet you can then load into R, or you can record the info directly into a data frame, if you prefer.

'''TASK - gather protein information'''
* We have prepared a partially filled out Excel sheet (see the screenshot above) which will form the basis for the data gathering.
** Download the Excel sheet from HERE: [https://teaching.healthtech.dtu.dk/material/22140/PolDelta_Extended_NodeAttribute_Worksheet.xlsx PolDelta_Extended_NodeAttribute_Worksheet_2020.xlsx] 
* Use the UniProt links below to find the following information (ask the instructor for help if you get stuck):
** Description
** Gene name
** Is the protein known to bind DNA? (+/-)
** Which cellular compartments is the protein known to be located in?
* For the proteins marked as UNCERTAIN with regard to role in replication:
** Can you find any additional information that indicates that they are actually working together with the other proteins in the network?
** (Note: some of them may be in the network due to experimental error.)
* Bonus question: WRIP1_HUMAN has an interaction with itself. Is there a good explanation for this?

'''UniProt links:'''
* http://www.uniprot.org/uniprot/DPOD1_HUMAN
* http://www.uniprot.org/uniprot/DPOD2_HUMAN
* http://www.uniprot.org/uniprot/DPOD3_HUMAN
* http://www.uniprot.org/uniprot/DPOD4_HUMAN
* http://www.uniprot.org/uniprot/BACD1_HUMAN
* http://www.uniprot.org/uniprot/DNA2_HUMAN
* http://www.uniprot.org/uniprot/PDIP2_HUMAN
* http://www.uniprot.org/uniprot/PRI1_HUMAN
* http://www.uniprot.org/uniprot/PRI2_HUMAN
* http://www.uniprot.org/uniprot/S7A6O_HUMAN
* http://www.uniprot.org/uniprot/TREX2_HUMAN
* http://www.uniprot.org/uniprot/WRIP1_HUMAN

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q2: paste a screenshot of the final data frame into a your report.

==== Visualizing node attributes ====
The next step is to make an igraph object of the interactions and the node attributes you collected.

'''TASK - visualize the Node attributes'''
* Node label - use GeneID
* Node color - color based on "Role in replication" (invent your own coloring scheme)
* Node shape - pick two shapes to represent whether the protein is known to bind to DNA

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:'''
Q3: Answer the following question: Does it make sense that some of the proteins are not annotated to bind DNA yet are supposed to have a role in DNA replication? (For example DPOD3_HUMAN and DPOD4_HUMAN)

=== Edge attributes ===
As the final part of the exercise we'll include edge attributes to the igraph object. This can be done simply by including additional columns in the interaction data frame. By default, the graph_from_data_frame() function reads the first two columns as node names, and all following columns as attributes of the edges between the nodes.

Each edge in the Pol δ network represents a '''protein-protein''' interaction determined '''experimentally'''. A number of different pieces of information could potentially be associated with each interaction:
* Experimental method used.
* Whether the interaction is stable or transient.
* How much experimental support is there for the interactions (e.g. a single experiment, 3 experiments or 100+ experiments).

Below are the confidence scores of all the interactions in the extended pol delta network. Simply add the vector below to the poldelta_extented_interactions data frame.

<pre style="overflow:auto;">
c(1.00, 1.00, 1.00, 0.18, 1.00, 1.00, 1.00, 1.00, 0.52, 1.00, 1.00, 1.00, 1.00, 0.54, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.57, 1.00, 0.65)
</pre>

You may consider the following interpretation of the score:

0.0 - 0.3 : poor experimental support
0.3 - 0.9 : "good enough" experimental support
0.9 - 1.0 : excellent experimental support

'''TASK - import and visualize network'''
* Make an igraph object with the interactions, edge attributes, and node attributes
* Visualize the network with ggraph, adding some formatting of the edges continuously by the confidence score (color, width, or transparency are good options)
* Make a discrete vector based on the three categories above, and add reload the igraph object. Make three different colors, widths, line types, or whatever else you can come up with to make a visually pleasing and informative visualization.

Which do you prefer - continuous or discrete visualization of line colors?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q4: Document your network(s) by pasting screenshots into your report.

'''FINAL QUESTION - Re-evaluate the three "uncertain" proteins (BACD1_HUMAN, PDIP2_HUMAN, S7A6O_HUMAN)''':
* Consider the following points and make a conclusion based on the ''combined evidence'' on which of the three proteins are likely to be true interaction partners:
** The proteins they are interacting with.
** The experimental support for the interactions.
** Any biological information (any hints, basically) you may have picked up from skimming through the UniProt pages for each of the three proteins.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q5: Briefly describe your consideration, findings, and potential updates.

IgraphIntro Ex v1

2024-03-05T15:37:56Z

WikiSysop: /* Visualizing the graph using ggraph */

= Introduction to working with networks in R =

'''Exercise written by:''' Rasmus Wernersson and Lars Rønn Olsen

The purpose is to give a general introduction to the R packages [https://r.igraph.org/ '''igraph'''] and [http://users.dimi.uniud.it/~massimo.franceschet/ns/syllabus/make/ggraph/ggraph.html '''ggraph'''], which we will be using for a large part of the course for:
# Building and storing networks in R
# Visualization / inspection of biological networks
# Data integration of networks and supporting information

== Working with TEXT files ==
As was the case in the prerequisite course ''Introduction to Bioinformatics'' (27611/27622) we will be working a lot with PLAIN TEXT files in this course, to import data into R. For this purpose you'll need a good TEXT EDITOR, that can save a file without a lot of formatting information. You can either use the built-in editor in RStudio or something like [https://www.sublimetext.com/ Sublime text] or [http://www.jedit.org/ J edit]

[[Image:jEdit_screenshot.png|center]]
If you need a reminder on how to use text editors, you can briefly run through the old [[ExJEdit|jEdit exercise]].

= Example: protein complexes =
In the Systems Biology course we will be working a lot with '''protein-protein interaction''' data (''physical'' interactions between proteins), and we'll start this exercise with a look at how we can represent a simple well-known protein complex in R, and how we can expand our analysis from here.

[[image:Hemoglobin_structure_200px.png|right|frame|Structure of horse hemoglobin (from PDB) - the structure is a TETRAMER consisting of two ALPHA globins and two BETA globins.|link="http://www.pdb.org/pdb/101/motm.do?momID=142"]]
One of the simplest formats for storing graphs is using a two column data frame with connected proteins in each row (column names does not matter). For example, the physical interaction between ALPHA and BETA GLOBIN in the HEMOGLOBIN complex could be stated as:

<pre style="overflow:auto;">
hemoglobin <- data.frame (from = c("ALPHA_GLOBIN", "ALPHA_GLOBIN", "BETA_GLOBIN"), to = c(c("ALPHA_GLOBIN", "BETA_GLOBIN", "BETA_GLOBIN")
</pre>

Each of the ALPHA and the BETA globins also physically interacts with itself (see the structure for explanation).

== igraph ==
igraph offers many ways to create a graph. The simplest one is the function [https://r.igraph.org/articles/igraph.html#creating-a-graph make_empty_graph], but graphs can also be imported from and exported to a [https://r.igraph.org/articles/igraph.html#igraph-and-the-outside-world variety of file formats]. The [https://r.igraph.org/articles/igraph.html r.igraph website] is a great introductory resource that you are encouraged to explore. The igraph package can do loads more than what is listed on their introductory website, and you are encouraged to use Google to find functions and examples for specialized tasks.

'''TASK: Make simple network in igraph'''
# Login to the RStudio server.
# Load the igraph package
# Make an igraph object from the hemoglobin data frame using the [https://igraph.org/r/doc/graph_from_data_frame.html graph_from_data_frame] function (set "directed = FALSE" - we will explain why you should do this in detail throughout the course)
# Plot the graph object using the base plot function (plot())

When you're done you should have a network that looks similar to the screenshot below:

[[Image:igraph_hemoglobin.png|400px|border]]

* Make sure you understand what the NODES (the circles) and the EDGES (the lines) represent: what is the BIOLOGICAL interpretation of the network?

== DNA Polymerase Delta ==
[[Image:1000px-DNA_replication_en.svg.png‎|center|thumb|800px|Schematic overview of the Eukaryotics replication machinery - notice Polymerase Delta working on the lower DNA strand. Source: Wikipedia]]

Before we move on to the more advanced visualization feature of ggraph, we'll introduce a slightly more complex network which we can expand upon as we go along: '''DNA Polymerase Delta (Pol δ)'''. Pol δ has "proofreading" fuctionality (3'→5' exonuclease activity) and consist of the "''proliferating cell nuclear antigen''" (PCNA), a multi-subunint complex named "''replication factor C''" and the ''polymerase subunit'' itself, which consists of four proteins: POLD1, POLD2, POLD3 and POLD4.

=== Pol δ network ===
We will start out with having a look at the polymerase sub-unit. Since we want to expand the network and add in more information as we go along, we choose to map the proteins to actual '''UniProt identifiers''', which will make it easy to look up additional information as we go along:

Gene Protein
---- -------
POLD1 DPOD1_HUMAN
POLD2 DPOD2_HUMAN
POLD3 DPOD3_HUMAN
POLD4 DPOD4_HUMAN

UniProt links (for optional browsing):
* http://www.uniprot.org/uniprot/DPOD1_HUMAN
* http://www.uniprot.org/uniprot/DPOD2_HUMAN
* http://www.uniprot.org/uniprot/DPOD3_HUMAN
* http://www.uniprot.org/uniprot/DPOD4_HUMAN

[[Image:igraph_pol_delta.png|300px|right|border]]
'''TASK: create data frame for the polymerase sub-unit interactions:'''
* The subunit is a tetramer consisting of one of each protein.
* Each protein interacts with all other proteins.
* Your igraph network should look similar (have the same topology) as the network shown here.

===Pol δ node attributes===
In order to visualize graph to understand and communicate their properties, we can add attributes to both the nodes and edges of the graph.

For example if you follow the '''UniProt''' links above, you can read a wealth of information about the names, descriptions, biological function and much more for each of the proteins.

'''IMPORTANT NOTE''': in graph theory, nodes can also be referred to as "vertices" (singular: vertex) and this is the convention in igraph.

====Adding attributes to igraph object====
To add node attributes to an igraph object, please see that the [https://igraph.org/r/doc/graph_from_data_frame.html graph_from_data_frame function] has a variable "vertices" which allows you to add node attributes in the form of a data frame when you build the igraph object. Adjusting and retrieving attributes of an igraph object once it is made, can be done using the functions V() (to add vertex attributes) and E() (to add edge attributes). Please see the [https://cran.r-project.org/web/packages/igraph/vignettes/igraph.html#setting-and-retrieving-attributes igraph documentation] for information on how this works.

In other words, the simplest way to add node attributes is to create a node attribute data frame. The first column of the data frame is assumed to contain symbolic vertex names, this will be added to the graphs as the ‘name’ vertex attribute. Other columns will be added as additional vertex attributes.

For example:

<pre style="overflow:auto;">
UniProtId GeneID Catalytic Description AA
DPOD1_HUMAN PolD1 yes DNA polymerase delta catalytic subunit 1009
DPOD2_HUMAN PolD2 no DNA polymerase delta subunit 2 469
DPOD3_HUMAN PolD3 no DNA polymerase delta subunit 3 466
DPOD4_HUMAN PolD4 no DNA polymerase delta subunit 4 107
</pre>

As can be seen from the example, 4 categories of information have been added, and each line contains information related to a single protein. The node attributes will be assigned the column names, such that for example "Description" can be retrieved or edited using V(g)$Description. "AA" refers to the length of the amino acid sequence of the protein.

'''TASK: Import the node attribute table into a new pol delta igraph object'''
# Make a data frame of the node attribute table above (do this manually for now - we will later learn how upload and read data into RStudio)
# Make a new igraph object of the pol delta network and include the node attributes
# Check that this worked using the function "V()"

=== Visualizing the graph using ggraph ===
ggraph (pronounced "g-giraph) is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees. If you have not yet worked with ggplot2, or feel like you need a reminder of how it works, [https://www.rforecology.com/post/a-simple-introduction-to-ggplot2/ here is a good primer]. [https://www.data-imaginist.com/2017/ggraph-introduction-layouts/ This web site] gives an excellent overview of the functionalities of ggraph and serves as a great reference. Take a moment to browse ggraph's functionalities.

'''TASK - use the node annotations for customizing visualization'''
# Plot the pol delta complex using ggraph and the default layout
way to visualize protein-protein interactions?
# '''Label:'''
#* Modify it to show the GeneIDs (you will have to find out how yourself - HINT: try to search google for "ggraph node labels")
# '''Fill Color:'''
#* Color the nodes based on the "Catalytic" variable in the
# '''Node size:'''
#* Make the size of the nodes correspond to the length of the amino acid sequence

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q1: Paste a screenshot of your nicely decorated network into you report

=== Extended Pol δ network ===
For the final part of the exercise we'll be working with set of '''experimental data''' centered around the Pol δ complex. Later in the course we will learn a lot of details about how such experimental data is generated, what strengths and weaknesses the different methods have, and how we can address the noise in the data.

For now it sufficient to note the following:
* The experiment has detected proteins that physically interacts with the PolD1-PolD4 complex we have just worked with.
* Both '''stable''' and '''transient''' interactions have been identified.
* The experiment shows '''some''' of the most likely interactions - additional experiments may find more.
* The data '''may''' contain false positive (proteins indicated to interact, while that is not true under real biological conditions).

==== Network and layout ====
[[Image:Poldelta_extended_stress_layout.png|right|thumb|350px|ggraph visualization with default (stress) layout. Could this be improved?]]
<pre style="overflow:auto;">
poldelta_extented_interactions <- data.frame(from = c("DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD1_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD2_HUMAN", "DPOD3_HUMAN", "DPOD3_HUMAN", "DPOD3_HUMAN", "DPOD3_HUMAN", "DPOD4_HUMAN", "DPOD4_HUMAN", "DPOD4_HUMAN", "DPOD4_HUMAN", "PRI1_HUMAN", "WRIP1_HUMAN"), to = c("DPOD2_HUMAN", "DPOD3_HUMAN", "DPOD4_HUMAN", "S7A6O_HUMAN", "TREX2_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "WRIP1_HUMAN", "DPOD3_HUMAN", "DPOD4_HUMAN", "PDIP2_HUMAN", "BACD1_HUMAN", "WRIP1_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "DPOD4_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "DNA2L_HUMAN", "PRI1_HUMAN", "PRI2_HUMAN", "WRIP1_HUMAN", "PRI2_HUMAN", "WRIP1_HUMAN"))
</pre>

'''TASK - import the network'''

'''TASK - find a good network layout'''
# Play around with a few other layouts. Think about whether they are good or bad for visualizing a protein-protein interactions (for example, do you think that the "linear" layout would be a good visualization for this topology?)

==== Collecting node attributes ====
The next step in the network analysis, is to gather a set of useful information about the protein in the network, that can help guide our understanding of the biology behind the network. This information gathering will have two goals:

* To collect NODE ATTRIBUTE information useful for visualization of the network.
* To get an initial understanding of what the roles of the individual proteins may be:
** E.g. biological process, cellular compartment, description, notes about function etc.

[[Image:PolDelta_network_node_info_sheet_2020.PNG|center|thumb|900px|Node attributes in the process of being collected in an Excel sheet]]
There are a number of (semi) automatic ways to gather such data, but since we're working with a small network here, it's feasible to manually gather the data from a well respected data source such as UniProt, and keeping track of it in a spreadsheet you can then load into R, or you can record the info directly into a data frame, if you prefer.

'''TASK - gather protein information'''
* We have prepared a partially filled out Excel sheet (see the screenshot above) which will form the basis for the data gathering.
** Download the Excel sheet from HERE: [https://teaching.healthtech.dtu.dk/27040/exercises/PolDelta_Extended_NodeAttribute_Worksheet.xlsx PolDelta_Extended_NodeAttribute_Worksheet_2020.xlsx] 
* Use the UniProt links below to find the following information (ask the instructor for help if you get stuck):
** Description
** Gene name
** Is the protein known to bind DNA? (+/-)
** Which cellular compartments is the protein known to be located in?
* For the proteins marked as UNCERTAIN with regard to role in replication:
** Can you find any additional information that indicates that they are actually working together with the other proteins in the network?
** (Note: some of them may be in the network due to experimental error.)
* Bonus question: WRIP1_HUMAN has an interaction with itself. Is there a good explanation for this?

'''UniProt links:'''
* http://www.uniprot.org/uniprot/DPOD1_HUMAN
* http://www.uniprot.org/uniprot/DPOD2_HUMAN
* http://www.uniprot.org/uniprot/DPOD3_HUMAN
* http://www.uniprot.org/uniprot/DPOD4_HUMAN
* http://www.uniprot.org/uniprot/BACD1_HUMAN
* http://www.uniprot.org/uniprot/DNA2_HUMAN
* http://www.uniprot.org/uniprot/PDIP2_HUMAN
* http://www.uniprot.org/uniprot/PRI1_HUMAN
* http://www.uniprot.org/uniprot/PRI2_HUMAN
* http://www.uniprot.org/uniprot/S7A6O_HUMAN
* http://www.uniprot.org/uniprot/TREX2_HUMAN
* http://www.uniprot.org/uniprot/WRIP1_HUMAN

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q2: paste a screenshot of the final data frame into a your report.

==== Visualizing node attributes ====
The next step is to make an igraph object of the interactions and the node attributes you collected.

'''TASK - visualize the Node attributes'''
* Node label - use GeneID
* Node color - color based on "Role in replication" (invent your own coloring scheme)
* Node shape - pick two shapes to represent whether the protein is known to bind to DNA

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:'''
Q3: Answer the following question: Does it make sense that some of the proteins are not annotated to bind DNA yet are supposed to have a role in DNA replication? (For example DPOD3_HUMAN and DPOD4_HUMAN)

=== Edge attributes ===
As the final part of the exercise we'll include edge attributes to the igraph object. This can be done simply by including additional columns in the interaction data frame. By default, the graph_from_data_frame() function reads the first two columns as node names, and all following columns as attributes of the edges between the nodes.

Each edge in the Pol δ network represents a '''protein-protein''' interaction determined '''experimentally'''. A number of different pieces of information could potentially be associated with each interaction:
* Experimental method used.
* Whether the interaction is stable or transient.
* How much experimental support is there for the interactions (e.g. a single experiment, 3 experiments or 100+ experiments).

Below are the confidence scores of all the interactions in the extended pol delta network. Simply add the vector below to the poldelta_extented_interactions data frame.

<pre style="overflow:auto;">
c(1.00, 1.00, 1.00, 0.18, 1.00, 1.00, 1.00, 1.00, 0.52, 1.00, 1.00, 1.00, 1.00, 0.54, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.57, 1.00, 0.65)
</pre>

You may consider the following interpretation of the score:

0.0 - 0.3 : poor experimental support
0.3 - 0.9 : "good enough" experimental support
0.9 - 1.0 : excellent experimental support

'''TASK - import and visualize network'''
* Make an igraph object with the interactions, edge attributes, and node attributes
* Visualize the network with ggraph, adding some formatting of the edges continuously by the confidence score (color, width, or transparency are good options)
* Make a discrete vector based on the three categories above, and add reload the igraph object. Make three different colors, widths, line types, or whatever else you can come up with to make a visually pleasing and informative visualization.

Which do you prefer - continuous or discrete visualization of line colors?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q4: Document your network(s) by pasting screenshots into your report.

'''FINAL QUESTION - Re-evaluate the three "uncertain" proteins (BACD1_HUMAN, PDIP2_HUMAN, S7A6O_HUMAN)''':
* Consider the following points and make a conclusion based on the ''combined evidence'' on which of the three proteins are likely to be true interaction partners:
** The proteins they are interacting with.
** The experimental support for the interactions.
** Any biological information (any hints, basically) you may have picked up from skimming through the UniProt pages for each of the three proteins.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Report:''' Q5: Briefly describe your consideration, findings, and potential updates.

Autumn2023

2024-03-05T15:35:21Z

WikiSysop: /* Lecture 01 (August 31) - Intro 1 */

= Course 22140 - plan for autumn 2023 =

'''Teachers:'''
* Lars Rønn Olsen (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk]
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* Rasmus Wernersson (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Hanxi Li (teaching assistant) - '''contact:''' [mailto:hanxli@dtu.dk hanxli@dtu.dk]



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/167355 Course 22140, Autumn 2022 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 22211 or one of its variants). If you need to read up on some bioinformatics topics, please use the links below.
* [https://teaching.healthtech.dtu.dk/22111/ Course 22111] - ''Introduction to Bioinformatics''
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= R =
For the computer exercises we will be using R to process data, analyze, and visualize the biological networks. R is Open Source and freely available for Windows, Mac and Linux. We will be utilizing a RStudio server cloud solution to make sure that everyone uses the same version of R and the needed packages. You can log in with your DTU credentials [https://teaching.healthtech.dtu.dk/22140/rstudio.php here].

'''NOTE''': In order to produce plots with RStudio server, you need to have the appropriate graphics device activated. If you have X11 installed, this should work without any further actions. If you do not, you will get an error whenever you try to plot anything. To mitigate this, open Rstudio server, go to "Tools" (options bar at the top of the screen), select "Global options" from the drop down menu, select the "Graphics" tab, and change "Backend" to "Cairo".
<br><br>

= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the peer grade system. We will assign your report to three co-students to provide you with feedback.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.

= Lecture plan, autumn 2023 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>

== Block #1: Introduction ==
'''Responsible for this block:''' Lars Rønn Olsen and Rasmus Wernersson
----
=== Lecture 01 (August 31) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/material/22140/W01_Lazebnik_CancerCell2002.pdf PDF])
:'''Exercise:''' [[igraphIntro_Ex_v1|Introduction to working with networks in R]] - '''Answers:''' [[igraphIntro_Answers_v1|Exercise #1 answers]]

=== Lecture 02 (Sep 7) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Lars Rønn Olsen

:'''Slides:''' To appear on DTU Learn
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* [[Media:W02_exercises_v7_corrected.pdf|Building protein-protein interaction networks from experimental data]] (solutions now on DTU Learn)
:* [[Media:Exercise_help_sheet.pdf|Note taking sheet for help with ex. 5,7,8,9]] - Consider printing this for taking notes
:* [[Ex_handouts_igraph|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[Ex_handouts_igraph_solution|Exercise #2 answers]]

=== Lecture 03 (Sep 14) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Lars Rønn Olsen
:'''Slides:''' On DTU Learn.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/material/22140/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/material/22140/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Topology/statistics/modules [[ExTopology1_igraph|Network topology and statistics]] - '''Answers''': [[ExTopology1_igraph_solutions|Answers to igraph exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson and Kristoffer Vitting-Seerup
----

=== Lecture 04 (Sep 21) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/material/22140/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio_R|Yeast cell cycle 1 - introduction to data and methods]] - '''Answers:''' [[ExYeastSysBio_R_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 28) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Rasmus Wernersson
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [https://teaching.healthtech.dtu.dk/material/22140/Bbr002.pdf PDF] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [https://teaching.healthtech.dtu.dk/material/22140/GO_NATURE_GENETICS_2000.pdf PDF] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On DTU Learn

:'''Exercise:''' [[ExGeneOntology_Yeast_R|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast_R_answers|answers]]

=== Lecture 06 (Oct 5) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/691406/View DTU Learn]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/material/22140/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics_R|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' [[ExYeastCellCycle_answers|Answers]]

=== Lecture 07 (Oct 12) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson.
:'''Readings:''' [https://teaching.healthtech.dtu.dk/material/22140/Cyclebase1_2008.pdf PDF Cyclebase paper] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2_R|Mapping temporal expression data onto networks]] '''Answers:''' [[ExYeastCellCycleTranscriptomics2_R_answers|answers]]

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Lars Rønn Olsen, Rasmus Wernersson, and Kristoffer Vitting-Seerup
----

=== Lecture 08 (Oct 26) - Systems Biology in Biomedical Research (Heart diseases) 1 - CANCELLED ===



=== Lecture 09 (Nov 2) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Virtual pulldown and protein complex detection'' - Lars Rønn Olsen and Giorgia Moranzoni
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/Furlong_Cell2012.pdf PDF]) - Concentrate on: '''Figure 1''' and '''Box 3'''
:* '''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Exercises:''' [[DiscoNet|DiscoNet]] - '''Answers:''' [[DiscoNet_answers|DiscoNet answers]]



=== Lecture 10 (Nov 9) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Systems Biology in Cancer'' - Kristoffer Vitting-Seerup
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/704041/View On DTU Learn]
:'''Exercises:''' [[DiscoNet2|Multiomics data integration]] '''Answers:''' [[DiscoNet2_answers|answers]]



=== Lecture 11 (Nov 16) - Essential R functions + exam exercise ===

:'''Lecture:''' Week 10 (and to some extent week 9) exercise walk-through
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' Old exam set (adapted to R)

<br>

=== Lecture 12 (Nov 23) - QnA / AMA ===

:'''Lecture:''' Kristoffer Vitting-Seerup
:'''Topics:''' Anything you would like to a refresher about

<br>

=== Lecture 13 (Nov 30) - Systems Biology in Biomedical Research 3 (ZS Revelen framework, drug targets) ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': ''' Network analysis of COVID19 – Intomics''' (PDF on DTU Learn)
:* It's a document with a written explanation of the COVID-19 analysis. (PDF of Intomics web-page, apologies for the sub-optimal formatting)
:* Please read all of it - ''including methods''. It is written for a non-technical audience and should be easy to understand.

:'''Slides:''' To appear on DTU Learn
:'''Exercise:'''
:* PDF on Learn
:'''Link to ZS Revelen:''' '''https://zs-revelen.com/'''
:*'''Please register''' your email with ZS Revelen before the exercise. '''Please use your DTU email.'''
<br>

= Old exam sets =

* On Learn



= Exam =
* '''Date: 6/12 2022
* '''Time: 15:00-19:00
* '''Where: (will be) available via https://eksamensplan.dtu.dk/

Autumn2023

2024-03-05T15:33:41Z

WikiSysop: /* Lecture 07 (Oct 12) - Yeast Systems Biology 4 */

= Course 22140 - plan for autumn 2023 =

'''Teachers:'''
* Lars Rønn Olsen (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk]
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* Rasmus Wernersson (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Hanxi Li (teaching assistant) - '''contact:''' [mailto:hanxli@dtu.dk hanxli@dtu.dk]



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/167355 Course 22140, Autumn 2022 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 22211 or one of its variants). If you need to read up on some bioinformatics topics, please use the links below.
* [https://teaching.healthtech.dtu.dk/22111/ Course 22111] - ''Introduction to Bioinformatics''
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= R =
For the computer exercises we will be using R to process data, analyze, and visualize the biological networks. R is Open Source and freely available for Windows, Mac and Linux. We will be utilizing a RStudio server cloud solution to make sure that everyone uses the same version of R and the needed packages. You can log in with your DTU credentials [https://teaching.healthtech.dtu.dk/22140/rstudio.php here].

'''NOTE''': In order to produce plots with RStudio server, you need to have the appropriate graphics device activated. If you have X11 installed, this should work without any further actions. If you do not, you will get an error whenever you try to plot anything. To mitigate this, open Rstudio server, go to "Tools" (options bar at the top of the screen), select "Global options" from the drop down menu, select the "Graphics" tab, and change "Backend" to "Cairo".
<br><br>

= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the peer grade system. We will assign your report to three co-students to provide you with feedback.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.

= Lecture plan, autumn 2023 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>

== Block #1: Introduction ==
'''Responsible for this block:''' Lars Rønn Olsen and Rasmus Wernersson
----
=== Lecture 01 (August 31) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W01_Lazebnik_CancerCell2002.pdf PDF])
:'''Exercise:''' [[igraphIntro_Ex_v1|Introduction to working with networks in R]] - '''Answers:''' [[igraphIntro_Answers_v1|Exercise #1 answers]]

=== Lecture 02 (Sep 7) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Lars Rønn Olsen

:'''Slides:''' To appear on DTU Learn
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* [[Media:W02_exercises_v7_corrected.pdf|Building protein-protein interaction networks from experimental data]] (solutions now on DTU Learn)
:* [[Media:Exercise_help_sheet.pdf|Note taking sheet for help with ex. 5,7,8,9]] - Consider printing this for taking notes
:* [[Ex_handouts_igraph|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[Ex_handouts_igraph_solution|Exercise #2 answers]]

=== Lecture 03 (Sep 14) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Lars Rønn Olsen
:'''Slides:''' On DTU Learn.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/material/22140/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/material/22140/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Topology/statistics/modules [[ExTopology1_igraph|Network topology and statistics]] - '''Answers''': [[ExTopology1_igraph_solutions|Answers to igraph exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson and Kristoffer Vitting-Seerup
----

=== Lecture 04 (Sep 21) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/material/22140/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio_R|Yeast cell cycle 1 - introduction to data and methods]] - '''Answers:''' [[ExYeastSysBio_R_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 28) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Rasmus Wernersson
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [https://teaching.healthtech.dtu.dk/material/22140/Bbr002.pdf PDF] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [https://teaching.healthtech.dtu.dk/material/22140/GO_NATURE_GENETICS_2000.pdf PDF] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On DTU Learn

:'''Exercise:''' [[ExGeneOntology_Yeast_R|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast_R_answers|answers]]

=== Lecture 06 (Oct 5) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/691406/View DTU Learn]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/material/22140/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics_R|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' [[ExYeastCellCycle_answers|Answers]]

=== Lecture 07 (Oct 12) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson.
:'''Readings:''' [https://teaching.healthtech.dtu.dk/material/22140/Cyclebase1_2008.pdf PDF Cyclebase paper] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2_R|Mapping temporal expression data onto networks]] '''Answers:''' [[ExYeastCellCycleTranscriptomics2_R_answers|answers]]

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Lars Rønn Olsen, Rasmus Wernersson, and Kristoffer Vitting-Seerup
----

=== Lecture 08 (Oct 26) - Systems Biology in Biomedical Research (Heart diseases) 1 - CANCELLED ===



=== Lecture 09 (Nov 2) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Virtual pulldown and protein complex detection'' - Lars Rønn Olsen and Giorgia Moranzoni
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/Furlong_Cell2012.pdf PDF]) - Concentrate on: '''Figure 1''' and '''Box 3'''
:* '''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Exercises:''' [[DiscoNet|DiscoNet]] - '''Answers:''' [[DiscoNet_answers|DiscoNet answers]]



=== Lecture 10 (Nov 9) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Systems Biology in Cancer'' - Kristoffer Vitting-Seerup
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/704041/View On DTU Learn]
:'''Exercises:''' [[DiscoNet2|Multiomics data integration]] '''Answers:''' [[DiscoNet2_answers|answers]]



=== Lecture 11 (Nov 16) - Essential R functions + exam exercise ===

:'''Lecture:''' Week 10 (and to some extent week 9) exercise walk-through
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' Old exam set (adapted to R)

<br>

=== Lecture 12 (Nov 23) - QnA / AMA ===

:'''Lecture:''' Kristoffer Vitting-Seerup
:'''Topics:''' Anything you would like to a refresher about

<br>

=== Lecture 13 (Nov 30) - Systems Biology in Biomedical Research 3 (ZS Revelen framework, drug targets) ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': ''' Network analysis of COVID19 – Intomics''' (PDF on DTU Learn)
:* It's a document with a written explanation of the COVID-19 analysis. (PDF of Intomics web-page, apologies for the sub-optimal formatting)
:* Please read all of it - ''including methods''. It is written for a non-technical audience and should be easy to understand.

:'''Slides:''' To appear on DTU Learn
:'''Exercise:'''
:* PDF on Learn
:'''Link to ZS Revelen:''' '''https://zs-revelen.com/'''
:*'''Please register''' your email with ZS Revelen before the exercise. '''Please use your DTU email.'''
<br>

= Old exam sets =

* On Learn



= Exam =
* '''Date: 6/12 2022
* '''Time: 15:00-19:00
* '''Where: (will be) available via https://eksamensplan.dtu.dk/

ExYeastCellCycle answers

2024-03-05T15:30:39Z

WikiSysop: Created page with "===Question 1=== Q:How much RNA ("total RNA") was used in the experiment? A: 50 ug of total RNA (found on the Protocols page). Q: Which type/brand of array was used? A: Affymetrix GeneChip Yeast Genome 2.0 Array (found on the main page). Q: How many individual arrays were used in the study? A: 2 arrays (found both on the main page and the samples page) Q: IMPORTANT: note down which arrays were used for CONTROL (asynchronous; "mock treated") and which were use for CAS..."

===Question 1===

Q:How much RNA ("total RNA") was used in the experiment?
A: 50 ug of total RNA (found on the Protocols page).

Q: Which type/brand of array was used?
A: Affymetrix GeneChip Yeast Genome 2.0 Array (found on the main page).

Q: How many individual arrays were used in the study?
A: 2 arrays (found both on the main page and the samples page)

Q: IMPORTANT: note down which arrays were used for CONTROL (asynchronous; "mock treated") and which were use for CASE (arrested cells)
A: (answers found on the samples page):
CONTROL = GSM287991: mock-treated / asynchronous
CASE = GSM287992: alpha factor arrested cells

===Question 2===

<pre>
library(ggplot2)
ggplot(expr, aes(x = GSM287992, y = GSM287991)) +
geom_point()
</pre>

Notice that the dots fall on the diagonal, and that the scales on X and Y are similar. The base assumption is that MOST of the genes do not vary between the two different conditions, and this is what we see here.

It is also this underlying assumption that makes it possible to normalize the data in the first place (to account for technical noise, e.g. slightly different amount of cDNA on each array, slight array-to-array variation etc). In conclusion, the data looks comparable.

The expression values span 4 orders of magnitude, the largest value being close to 40,000. As most of the values are in the lower ranger (as indicated on the plot), it is difficult to see much details in a plot with these dimensions.

===Question 3===

Plot after Log2 transformation of all the expression data:

<pre>
ggplot(expr, aes(x = log2(GSM287992), y = log2(GSM287991))) +
geom_point()
</pre>

===Question 4===

(Theoretical question)

RATIO = CASE/CONTROL (we only consider values >0 for both CASE and CONTROL)

If CASE < CONTROL the RATIO will fall in the interval ]0;1[
If CASE > CONTROL the RATIO will fall in the interval ]1;inf[

The problem here is that these intervals are very far from being comparable. Up- and down-regulation of a given gene will be on very different scales.

===Question 5===

The trick is simply that we can use the Log2 function to transform the RATIOs.

Notice:
Log2(1) = 0
Log2(x) -> -inf as x -> 0
Log2(x) -> +inf as x -> inf

Compared to the question above, we will now have the following intervals after transformation:

CASE < CONTROL: Log2(RATIO) will fall in the interval ]-inf;0[
CASE > CONTROL: Log2(RATIO) will fall in the interval ]0;inf[

Examples:

raw ratio: 7.1/2.3 = 3.07, log2 ratio: log2(7.1/2.3) = 1.63
raw ratio: 2.3/7.1 = 0.32, log2 ratio: log2(2.3/7.1) = -1.63

===Question 6===

<pre>
expr$fc <- expr$GSM287992/expr$GSM287991
expr$log2fc <- log2(expr$GSM287992/expr$GSM287991)
</pre>

===Question 7===

GO overrepresentation analysis of top 100 CONTROL genes.

<pre>
load("home/projects/22140/exercise5.Rdata")
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
fora_q7 <- fora(pathways = BP_list, genes = expr[order(expr$GSM287991, decreasing = TRUE),]$SysName[1:100], universe = background)
</pre>

Observation: Lots of basic cell “house keeping” (e.g. rRNA synthesis, metabolic processes). Since this is the CONTROL sample of normally growing cells, it is expected that we should see all the basic processes need for cell growth.

===Question 8===
The same analysis as above, but now with the entire list of expression values as input.

<pre>
stats <- expr$GSM287991; names(stats) <- expr$SysName
gsea_q8 <- fgsea(BP_list, stats)
</pre>

Same observation as above (but based on much more data – hence the more depth in the analysis): lots of basic metabolism.

===Question 9===

What happens if we randomize the order of the genes in the list?

<pre>
stats <- expr$GSM287991; names(stats) <- sample(expr$SysName)
gsea_q9 <- fgsea(BP_list, stats)
</pre>

Answer: we lose the signal.

===Question 10===

Repeat the functional class scoring analysis for the CASE array

<pre>
stats <- expr$GSM287992; names(stats) <- expr$SysName
gsea_q10 <- fgsea(BP_list, stats)
</pre>

Answer: it’s very difficult to tell it apart from the CONTROL sample.
It’s important to understand why: Even if the cell are arrested in G1 a lot of other (normal) processes are going on – e.g. the cell still needs basic metabolism

===Question 11===

Functional class scoring of genes Log2 Fold Change

<pre>
stats <- expr$log2fc; names(stats) <- expr$SysName
gsea_q11 <- fgsea(BP_list, stats)
</pre>

Observe the following: by sorting on the fold chance we focus the analysis on what is different between CASE and CONTROL. In more practical terms this means that all the normal metabolic processes that were dominating the analysis will disappear, and we can now see what is going on in relation to the G1 arrest.

Answer: the Karyogamy GO terms are overrepresented. This is in good agreement with the fact that the cells where arrested with alpha-factor, which induced the mating response. Karyogamy is the process where the A- and the alpha-cells fuse nuclei.

File:Log2 graph.png

2024-03-05T15:28:36Z

WikiSysop:

File:Checkmark from openclipart.png

2024-03-05T15:27:55Z

WikiSysop:

ExYeastCellCycleTranscriptomics R

2024-03-05T15:27:19Z

WikiSysop: Created page with "= Yeast cell cycle / transcriptomics exercise #1 = '''Exercise written by:''' Rasmus Wernersson and Lars Rønn Olsen '''Learning objectives:''' * Introduction to practical work on array CASE/CONTROL studies. * Introduction to ArrayExpress as a way to download DNA microarray experimental data. * Introduction to basic expression data transformation: ** Calculation of Log2 expression values ** Basic evaluation of the comparability of two arrays. ** Calculation of FOLD CHA..."

= Yeast cell cycle / transcriptomics exercise #1 =
'''Exercise written by:''' Rasmus Wernersson and Lars Rønn Olsen

'''Learning objectives:'''
* Introduction to practical work on array CASE/CONTROL studies.
* Introduction to ArrayExpress as a way to download DNA microarray experimental data.
* Introduction to basic expression data transformation:
** Calculation of Log2 expression values
** Basic evaluation of the comparability of two arrays.
** Calculation of FOLD CHANGE and Log(Fold Change)
* Gene Ontology overrepresentation of RANKED DATA.

= Data set: Alpha Factor arrest =
As mentioned in greater detail in the previous lecture, alpha-factor arrest works by inducing yeast's natural mating behavior: The HAPLOID cells in the vegetative state prepare to undergo cell- and nucleus fusion with HAPLOID cells of the opposite mating type (A vs. ALPHA). This is followed by MEIOSIS and spore formation. Therefore, it is important that the cells fusing are not in different stages of the MITOTIC cell cycle, as this would have disastrous effect (spend a moment thinking about why).

Yeast evolution solved this problem by triggering an ARREST in the cell cycle at the G1/S boundary when the mating hormone of the opposite mating type is detected (a-factor, or alpha-factor). Experimentally this can be used to halt virtually all of the cells in a growing culture at the same stage in cell cycle.

== Downloading data from Gene Expression Omnibus (GEO) ==



'''NOTICE:''' we will cut some corners and provide partly "pre-cooked" data in this part of the exercise, as performing the full array analysis is out of scope for this course. However, we still need to get at least a basic understanding of what the data looks like, where it comes from, and SOME of the steps we need to go through to make it useful for our analysis.

'''GEO:''' https://www.ncbi.nlm.nih.gov/geo/

[[Image:Cogs_brain.png|50px]]'''TASK: Explore GEO'''
* Find experiment '''GSE11412''' at GEO
** Answering the questions below will require a bit of exploring / clicking around on links on your own.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How much RNA ("total RNA") was used in the experiment?
* Which type/brand of array was used?
* How many samples are in the study?
* '''IMPORTANT:''' note down which sample_ids is used for CONTROL (asynchronous; "mock treated") and which sample_ids is use for CASE (arrested cells) - we'll need this information later.

=== Load data ===

There are many ways you can load the data into R. You can download raw data (that needs to be analyzed from scratch) or processed data (that have been quality checked, normalized and summarized to probe sets) can be downloaded and imported into R. You can also use packages such as "GEOquery" or "geneExpressionFromGEO" to extract data directly from GEO. We have cheated a bit and prepared a data frame for you to work with:

<pre>
load("/home/projects/22140/exercise6.Rdata")
</pre>

[[Image:Checkmark_from_openclipart.png|24px]] '''Check point:''' ''there are '''5''' columns of data in your data frame: Systematic gene name, Popular gene name, ProbeSet, Case (arrayname), Control (arrayname)''

== Getting the data ready for analysis ==
Before we can use the gene expression data for our analysis, we need to go through a few steps to prepare it to use.

=== Verify the data ===
First we'll perform a bit of sanity-checking on the data. Never blindly trust that the data is OK. As a rule of thumb it is assumed that only a minority of genes are differentially regulated between the case and control samples. If the processed data we download has been normalized correctly, we should expect a scatter plot of the case vs. control to be on a diagonal.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #2: create a scatter plot for case vs. control'''
* Paste the plot into your report.
* Do the two arrays indeed appear to be comparable, or did we goof up the normalization and/or experiments?
* On what order of magnitude is the largest expression value?

As you could (hopefully) see from the plot, the majority of the expression values are in the low end of the scale, and this makes it a bit difficult to see finer details in this part of the plot. Since values of expression data often span several orders of magnitude, a common "trick" is to log-transform the values to bring them on a more comparable scale. In bioinformatics the tradition is to log2 transform all data - we'll get back to why in a bit.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #3: create a scatter plot for log2(case) vs. log2(control)'''
* Include the plot in your report.
* Do the data appear to be comparable in the low end of the scale now?

=== Fold Change ===
[[Image:Log2_graph.png|500px|right|thumb|Log2 function]]
Now we get into the real meat of a case-control study: the ability to quantify the difference between the two conditions.

In this case we only have a single measurement of gene expression for each condition, so more advanced statistical methods are not applicable. However, we can still learn a lot about the trends of the total data set by calculating the change in expression.

In essence this is done calculating the ratio between the two values (e.g. case/control) for each gene. However, there are a few things we need to take into consideration first:

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:'''
* In what interval will the ratio be if case is smaller than control (aka the largest and the smallest value the fold-change can theoretically be (not what is observed in your data))?
* In what interval will the ratio be if case is larger than control?
* Do you foresee any problems comparing the ratios?

Once again log-transformation will come to our rescue - but let's investigate why.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' - if we log2 transform the ratios what will be the result interval in the following cases:
* CASE < CONTROL
* CASE > CONTROL
* Are the values comparable now?
* What will happen if you accidentally swap the case and control numbers? (Calculate an example e.g. case = 7.1 , control = 2.3) - can you see any (dis)advantages with this system?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Calculate ratio and log2(ratio)'''
* Calculate both ratio (case/control) and log2(ratio) for all the expression data. Name the columns '''fc''' and '''log2fc''' (see the box below for why).
** If you use Excel this can be done by only TWO operations per column (ask for help if needed).

[[Image:Checkmark_from_openclipart.png|24px]] '''Check point:''' ''you should have '''7''' columns of data in your table by now: Systematic gene name, Popular gene name, ProbeSet, Case (arrayname), Control (arrayname), fc, log2fc''

<blockquote style="background-color: lavender; border: solid thin grey;">
'''Gene expression lingo: fold change'''<br>

The ratio we have just calculated between the two conditions is, within the world of transcriptomics, referred to as the '''fold change'''. The rationale is that the fold change (FC) immediately tells you something about the magnitude of regulation (e.g. 10-fold up-regulation). As we have seen it very useful to work with log2-transformed values, and one of the reasons for using log2 compared to log10, is that log2 is perfectly suited to count '''doublings'''. A log2(FC) value of "2" is the same as two times doubled = 4; A log2(FC) value of 3 -> 8 etc.
</blockquote>

== Gene Ontology Gene Set Enrichment Analysis ==
For this exercise, you will once again need the background of all the yeast genes measured with this particular microarray platform. You can load this from the week 5 exercise data.

=== Control array ===
[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: top 100 CONTROL genes'''
* Make a list of the top 100 most expressed genes in the control study.
* Test for GO over-representation compared to the background list (as we did in the week 5 exercise) for "Biological Process"
* Paste the top 10 most significantly overrepresented biological processes into your report, and comment on the trends you see - is this expected/surprising?

The "fgsea" package we use for our GO over-representation analysis (ORA) can also do a functional class scoring (FCS), instead of the subset vs. background analysis we have been using so far.

For the functional class scoring analysis, you supply a single list of ranked genes (in most cases the entire genome), which has been sorted by some experimental criteria, e.g. gene expression values under a certain condition.

Here, it's important to realize that the list has two purposes:
# To provide a ranking of the genes.
# To provide the background distribution (it will also serve as a list of "all genes")
The over-representation analysis will then provide insight about the types of genes (and thus GO categories) that are "pushed" towards the ends of the list.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: GO Functional Class Scoring'''
* Sort the gene list according to the expression in the control array.
* Perform a functional class scoring of GO terms (for Biological Process) using only the ranked list using the "fgsea" function from the "fgsea" package. For this purpose, you need to make a named list with the ranking metric (in this case, the expression values) and the gene names. See "?names".
* Include the top 10 most significantly over represented biological processes in your report (Use [https://rdrr.io/bioc/fgsea/man/fgseaMultilevel.html NES] scores for filtering upregulated).
* Do the results make biological sense?

As a check on whether the gene set enrichment algorithm does a decent job with the FSC analysis, we'll investigate what happens if we feed it a randomly shuffled list of genes.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9: random "ranked" list'''
* Make a new named vector of the expression values, but this time randomize the names.
* What result do you expect to get from analyzing a randomly ordered gene list?
* Perform a functional class scoring of GO terms (Biological Process) - do the results match what you expected?

=== CASE array ===
[[Image:Cogs_brain.png|50px]]
'''TASK: rank based analysis of the case array'''
* Rank the gene list according to expression from the case array, and perform a functional class scoring of GO terms (Biological Process).

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #10:'''
* Can you explain the results? (Would you have expected otherwise since this is from the cells arrested in the G1 phase?).

=== Differential Expression - Fold change ===
[[Image:Cogs_brain.png|50px]]
'''TASK: GO over-rep analysis of Fold Change'''
* Rank the gene list on Log2(Fold Change), and perform a functional class scoring of GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #11:'''
* Do the results make biological sense?
* Think about and explain why these results may differ from performing the same analysis on the CASE array alone.

Autumn2023

2024-03-05T15:26:25Z

WikiSysop: /* Lecture 06 (Oct 5) - Yeast Systems Biology 3 */

= Course 22140 - plan for autumn 2023 =

'''Teachers:'''
* Lars Rønn Olsen (course organizer) - '''contact:''' [mailto:lronn@dtu.dk lronn@dtu.dk]
* Kristoffer Vitting-Seerup (course organizer) - '''contact:''' [mailto:krivi@dtu.dk krivi@dtu.dk]
* Rasmus Wernersson (external lecturer) - '''contact:''' [mailto:rawe@dtu.dk rawe@dtu.dk]
* Hanxi Li (teaching assistant) - '''contact:''' [mailto:hanxli@dtu.dk hanxli@dtu.dk]



= DTU Learn =
* Link: [https://learn.inside.dtu.dk/d2l/home/167355 Course 22140, Autumn 2022 @ DTU Learn]
<br>

= Bioinformatics =
Besides knowledge about basic molecular biology and biochemistry, a prerequisite for this course is bioinformatics (usually from course 22211 or one of its variants). If you need to read up on some bioinformatics topics, please use the links below.
* [https://teaching.healthtech.dtu.dk/22111/ Course 22111] - ''Introduction to Bioinformatics''
* [[Exercise:_The_protein_database_UniProt|UniProt exercise]] ([[ExUniProt-answers|answers]]) - This is an important one, as we use UniProt a lot in this course.
<br>

= R =
For the computer exercises we will be using R to process data, analyze, and visualize the biological networks. R is Open Source and freely available for Windows, Mac and Linux. We will be utilizing a RStudio server cloud solution to make sure that everyone uses the same version of R and the needed packages. You can log in with your DTU credentials [https://teaching.healthtech.dtu.dk/22140/rstudio.php here].

'''NOTE''': In order to produce plots with RStudio server, you need to have the appropriate graphics device activated. If you have X11 installed, this should work without any further actions. If you do not, you will get an error whenever you try to plot anything. To mitigate this, open Rstudio server, go to "Tools" (options bar at the top of the screen), select "Global options" from the drop down menu, select the "Graphics" tab, and change "Backend" to "Cairo".
<br><br>

= Weekly assignments =
[[Image:Office-notes-line_drawing.png|40px|left]]
As part of the computer exercises you (or your group) should keep a "log book" and answer the questions/report observations as you work though the exercise. The parts you need to document will be marked with the small "report icon" also seen here.

Following the exercise the reports will be handed in using the peer grade system. We will assign your report to three co-students to provide you with feedback.

'''Important:''' The reports are not as such mandatory, but it is HIGHLY recommended to turn them in, as this is excellent training for the exam.

'''Allowed formats:'''
# Plain text + figures as extra files
# Microsoft Word (*.doc, *.docx)
# PDF: use ANY word-processing software you like (e.g. "Pages") and save/print the result to a PDF.

= Lecture plan, autumn 2023 =

== When and Where ==
* '''When:''' Each '''Thursday''' from '''13:00-17:00'''
* '''Where:''' Building '''303A''' auditorium/group-room '''045'''
<hr>

== Block #1: Introduction ==
'''Responsible for this block:''' Lars Rønn Olsen and Rasmus Wernersson
----
=== Lecture 01 (August 31) - Intro 1 ===

:'''Lecture:''' ''Introduction to Systems Biology and biological networks'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:''' ''Can a Biologist fix a radio?'' - Lazebnik Y., Cancer Cell 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W01_Lazebnik_CancerCell2002.pdf PDF])
:'''Exercise:''' [[igraphIntro_Ex_v1|Introduction to working with networks in R]] - '''Answers:''' [[igraphIntro_Answers_v1|Exercise #1 answers]]

=== Lecture 02 (Sep 7) - Intro 2 ===

:'''Lecture:''' ''Protein-protein interaction networks. Experimental methods and interpretation.'' - Lars Rønn Olsen

:'''Slides:''' To appear on DTU Learn
:'''Hand-outs:''' ''SnapShot: Protein-Protein Interaction Networks'' - Seebacher & Gavin, Nature 2011 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/SnapShot_Cell2011.pdf PDF]) - focus on the EXPERIMENTAL METHODS part for this week.
:'''Readings:'''
<blockquote>
* Lecture note on ''quality scoring of protein-protein interaction data, notes and examples'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_scoring_of_PPI.pdf PDF])
* ''Comparative assessment of large-scale data sets of protein-protein interactions'' - von Mering C, ''et al''. Nature 2002 ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/W02_Comparative_assessment_of_large-scale_data_sets_of_protein-protein.pdf PDF])
</blockquote>

:'''Exercise:'''
:* [[Media:W02_exercises_v7_corrected.pdf|Building protein-protein interaction networks from experimental data]] (solutions now on DTU Learn)
:* [[Media:Exercise_help_sheet.pdf|Note taking sheet for help with ex. 5,7,8,9]] - Consider printing this for taking notes
:* [[Ex_handouts_igraph|Visualization of the networks from the hand-out exercise]] - '''Answers''': [[Ex_handouts_igraph_solution|Exercise #2 answers]]

=== Lecture 03 (Sep 14) - Intro 3 ===

:'''Lecture:''' ''Network topology'' - Lars Rønn Olsen
:'''Slides:''' On DTU Learn.

:'''Hand-outs:''' SnapShot: Protein-Protein Interaction Networks - (SAME AS LAST WEEK) ([https://teaching.healthtech.dtu.dk/material/22140/SnapShot_Cell2011.pdf PDF]) - read the rest for this week.
:'''Readings:''' Global network properties. Barabasi& Oltvai, Nat Rev Genet 2004 ([https://teaching.healthtech.dtu.dk/material/22140/W03_Barabasi_Oltvai_NatRevGen2004.pdf PDF]) - concentrate on '''Box 1''' and '''Box 2'''.

:'''Exercises:'''
<blockquote>
#'''Handout exercise:''' Network topology exercise ([https://teaching.healthtech.dtu.dk/material/22140/W03_Network_topology_exercise_v3.pdf PDF])
#'''Computer exercise:''' Topology/statistics/modules [[ExTopology1_igraph|Network topology and statistics]] - '''Answers''': [[ExTopology1_igraph_solutions|Answers to igraph exercise]]
</blockquote>

== Block #2: Case: Yeast systems biology ==
'''Responsible for this block:''' Rasmus Wernersson and Kristoffer Vitting-Seerup
----

=== Lecture 04 (Sep 21) - Yeast Systems Biology 1 ===

:'''Lecture:''' ''Yeast Cell Cycle introduction'' - Rasmus Wernersson.
:'''Slides:''' Will be uploaded to DTU Learn
:'''Readings:'''
:* Background on budding yeast cell cycle and cell cycle regulation ([https://teaching.healthtech.dtu.dk/material/22140/Budding_Yeast_Cell_Cycle_Model.pdf PDF]).
:* Source: http://mpf.biol.vt.edu/research/budding_yeast_model/pp/index.php (much more information about modelling the yeast cell cycle can be found here) [NOT part of the curriculum].
:* '''Important:''' You don't need to understand all the finer points about the regulation, but make sure you known the '''phases''' of the cell cycle.

:'''Saccharomyces Genome Database:''' http://www.yeastgenome.org/
:'''Exercise:''' [[ExYeastSysBio_R|Yeast cell cycle 1 - introduction to data and methods]] - '''Answers:''' [[ExYeastSysBio_R_answers|Yeast 1 answers]]

=== Lecture 05 (Sept 28) - Yeast Systems Biology 2 ===

:'''Lecture:''' ''Gene Ontology and large scale data analysis'' - Rasmus Wernersson
:'''Readings:''' Two introductory papers to The Gene Ontology (GO). Choose the one you prefer.
:* Intro for bioinformaticians: '''The what, where, how and why of gene ontology - a primer for bioinformaticians''' - [https://teaching.healthtech.dtu.dk/material/22140/Bbr002.pdf PDF] (NEW LINK) (focus on the first three pages).
:**Focuses mostly on the structure of the GO, the evidence behind the annotations and relations of the genes/proteins to the categories.
:* Intro for biologists: '''Gene Ontology: tool for the uniﬁcation of biology''' - [https://teaching.healthtech.dtu.dk/material/22140/GO_NATURE_GENETICS_2000.pdf PDF] (NEW LINK)
:**Describes more the general idea behind GO and why it is useful.
:'''Slides:''' On DTU Learn

:'''Exercise:''' [[ExGeneOntology_Yeast_R|Gene Ontology - yeast cell cycle examples]] - '''Answers:''' [[ExGeneOntology_Yeast_R_answers|answers]]

=== Lecture 06 (Oct 5) - Yeast Systems Biology 3 ===

:'''Lecture:''' ''Introduction to transcriptomics'' - Kristoffer Vitting-Seerup
:'''Readings:''' ''A brief introduction to DNA micro-arrays'' ([https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/691406/View DTU Learn]) - Rasmus Wernersson
:'''Background:''' (Optional) - If you need a reminder about how the Log2 function works, then have a look at '''Appendix A''' in Thomas Schneider's '' Information Theory Primer'' ([https://teaching.healthtech.dtu.dk/material/22140/informationtheory_primer.pdf PDF])
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics_R|Yeast cell cycle: single point arrest DNA microarray studies]] - '''Answers:''' [[ExYeastCellCycle_answers|Answers]]

=== Lecture 07 (Oct 12) - Yeast Systems Biology 4 ===

:'''Lecture:''' ''How proteins collaborate during the phases of cell devision'' - Rasmus Wernersson.
:'''Readings:''' [[Media:Cyclebase1_2008.pdf‎|Cyclebase paper]] - (skim it - make sure to understand '''Fig 1''').
:'''Slides:''' To appear on DTU Learn

:'''Exercise:''' [[ExYeastCellCycleTranscriptomics2_R|Mapping temporal expression data onto networks]] '''Answers:''' [[ExYeastCellCycleTranscriptomics2_R_answers|answers]]

<hr>
<div align="center">
'''Autumn vacation'''
(Week 42)
</div>
<hr>

== Block #3: Case: Human disease biology ==
'''Responsible for this block:''' Lars Rønn Olsen, Rasmus Wernersson, and Kristoffer Vitting-Seerup
----

=== Lecture 08 (Oct 26) - Systems Biology in Biomedical Research (Heart diseases) 1 - CANCELLED ===



=== Lecture 09 (Nov 2) - Systems Biology in Biomedical Research (Heart diseases) 2 ===
[[Image:XKCD_significant.png|80px|thumb|right]]
:'''Lecture:''' ''Virtual pulldown and protein complex detection'' - Lars Rønn Olsen and Giorgia Moranzoni
:'''Readings:'''
:* ''Human diseases through the lens of network biology'' ([https://teaching.healthtech.dtu.dk/27040/teachingmaterials/Furlong_Cell2012.pdf PDF]) - Concentrate on: '''Figure 1''' and '''Box 3'''
:* '''Heart development video:''' https://www.youtube.com/watch?v=5DIUk9IXUaI
:'''Extra:''' (not curriculum)
:* The heart disease paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913399/
:* The MCODE paper: http://www.biomedcentral.com/1471-2105/4/2
:'''Exercises:''' [[DiscoNet|DiscoNet]] - '''Answers:''' [[DiscoNet_answers|DiscoNet answers]]



=== Lecture 10 (Nov 9) - Integrating multiple omics data types for cancer research ===
:'''Lecture:''' ''Systems Biology in Cancer'' - Kristoffer Vitting-Seerup
:'''Readings:'''
:* ''The Hallmarks of Cancer'', Hanahan & Weinberg 2011 ([https://www.cell.com/fulltext/S0092-8674(11)00127-9 Article link])
:'''Slides:''' [https://learn.inside.dtu.dk/d2l/le/content/167355/viewContent/704041/View On DTU Learn]
:'''Exercises:''' [[DiscoNet2|Multiomics data integration]] '''Answers:''' [[DiscoNet2_answers|answers]]



=== Lecture 11 (Nov 16) - Essential R functions + exam exercise ===

:'''Lecture:''' Week 10 (and to some extent week 9) exercise walk-through
:'''Slides:''' To appear on DTU Learn
:'''Exercise:''' Old exam set (adapted to R)

<br>

=== Lecture 12 (Nov 23) - QnA / AMA ===

:'''Lecture:''' Kristoffer Vitting-Seerup
:'''Topics:''' Anything you would like to a refresher about

<br>

=== Lecture 13 (Nov 30) - Systems Biology in Biomedical Research 3 (ZS Revelen framework, drug targets) ===
:'''Lecture:''' ''Biomarker and drug target identification'' - Rasmus Wernersson.
:'''Readings:'''
:* ''Systems biology investigation of COVID-19'': ''' Network analysis of COVID19 – Intomics''' (PDF on DTU Learn)
:* It's a document with a written explanation of the COVID-19 analysis. (PDF of Intomics web-page, apologies for the sub-optimal formatting)
:* Please read all of it - ''including methods''. It is written for a non-technical audience and should be easy to understand.

:'''Slides:''' To appear on DTU Learn
:'''Exercise:'''
:* PDF on Learn
:'''Link to ZS Revelen:''' '''https://zs-revelen.com/'''
:*'''Please register''' your email with ZS Revelen before the exercise. '''Please use your DTU email.'''
<br>

= Old exam sets =

* On Learn



= Exam =
* '''Date: 6/12 2022
* '''Time: 15:00-19:00
* '''Where: (will be) available via https://eksamensplan.dtu.dk/

File:Graphpad pvalue 02.png

2024-03-05T15:23:30Z

WikiSysop:

File:Graphpad pvalue 01.png

2024-03-05T15:22:50Z

WikiSysop:

File:DNA pol act tree.png

2024-03-05T15:22:02Z

WikiSysop:

ExGeneOntology Yeast1.5

2024-03-05T15:20:33Z

WikiSysop: /* How to run the analysis */

= Gene Ontology - yeast cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson]

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from Saccharomyces Genome Database (SGD).
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using GOrilla

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene identifiers in SGD (e.g. YDR224C) and protein identifiers in UniProt (e.g. POLD1_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords (see the [[Exercise:_The_protein_database_UniProt|27611 UniProt exercise]] for details): to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** http://cbl-gorilla.cs.technion.ac.il/ - "'''GOrilla''' - Gene Ontology enRIchment anaLysis and visuaLizAtion tool"
''Many, MANY, more Gene Ontology wrappers and analysis tools exist (all based on the same data), but we'll limit ourselves to the ones listed above for the time being.''
== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in the physical partitioning and separation of a cell into daughter cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis excludes nuclear division; in prokaryotes, there is little difference between cytokinesis and cell division. Note that there is no relationship between this term and 'nuclear division ; GO:0000280' because cell division can take place without nuclear division (as in prokaryotes) and vice versa (as in syncytium formation by mitosis without cytokinesis.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Select Search -> Ontology from the top menu.
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

It's much the same case as we have previously seen with taxonomy in course 27611 - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA?
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== Saccharomyces Genome Database ==

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Look the entry for POL1 (YNL102W) in [http://www.yeastgenome.org SGD]'''
* Notice that all Saccharomyces Genome Database (SGD) entries have an entire section on Gene Ontology annotations; click on the "Gene Ontology" tab for full details. This actually include a bit of extra information about the '''evidence''' for annotations.
* What is the Molecular Function for POL1?
* Click on the link for this term to see how SGD describes the GO term, and how the evidence is presented.
** How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?

'''IMPORTANT:''' SGD also offers the possibility to jump from their website to the same GO term inside AmiGO. This is very useful for investigating the hierarchy of GO terms "above" - SGD has limited functionality for this.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: Follow the link to AmiGO'''
* Follow the link to AmiGO for the Molecular Function term found above, and answer the following question:
** Does POL1 have "Transferase Activity"? (Which GO term).

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using GOrilla ==
=== Introducing GOrilla ===
[[Image:Cluster1_biological_process.png|thumb|300px|right|Automated over-representation analysis of Cluster #1 using GOrilla. The color intensity marks significance of the over-representation.]]

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (study group) against a background distribution consisting of the entire yeast genome (population group). The tool we have selected will automatically calculate p-values for ALL Gene Ontology entries within the 3 main trunks of the GO system:

* Biological Process
* Molecular Function
* Cellular Component

The tool is intelligent enough to perform the test on '''nested categories''' and the results are shown both as tables with p-values, and as easy to interpret color-coded graphs (see the figure to the right). Finally it's worth mentioning, that the tool also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

Finally GOrilla can be run in two main modes of operation:
* List vs. background (Study group vs. population group).
* Rank based test on a single (sorted) input list.

We'll cover the list vs. background methods today, and the rank based test in next week's exercise.

'''LINK:'''
** http://cbl-gorilla.cs.technion.ac.il/ - "'''GOrilla''' - '''G'''ene '''O'''ntology en'''RI'''chment ana'''L'''ysis and visua'''L'''iz'''A'''tion tool"

=== How to run the analysis ===
[[Image:GOrilla_webinterface1+boxes.png|thumb|400px|right|Important options to remember when performing set vs. background analysis]]

First we need to prepare our input data - we'll use the '''Cluster #1''' as example again:

'''Input list:''' ("study group")
<pre style="overflow:auto;">
YMR078C
YPR175W
YBR278W
YBL035C
YNL102W
YNL262W
YOR144C
YPR167C
YIR008C
YKL045W
YOR217W
YJR068W
YNL290W
YOL094C
YBR087W
YCL042W
</pre>

'''Background list:''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, which you can download here:
* [https://teaching.healthtech.dtu.dk/material/22140/yeast_all_sysnames.txt yeast_all_sysnames.txt]

[[Image:Document-save.png|left|25px]]
'''TASK: Download the data file'''. Place it somewhere on your computer where you can easily find it - we'll be using it '''a lot'''.

'''Running GOrilla:'''
# GOrilla needs to know which '''organism''' the gene IDs come from (it does not have the functionality to autodetect it), in order to load the correct subset of Gene Ontology. Luckily, yeast is among the supported organisms.
# Choose the running mode (two unranked lists)
# Paste in '''input list'''
# Upload '''background list'''
# Select which part of Gene Ontology you want to compare against.
''(See the figure to the right for a summary)''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #11:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include the graphs in your report. Do the results fit with what we have previously learned about the function of cluster #1?

== Analyze selected clusters using GOrilla ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. What we'll need here is '''lists of gene names''' for each cluster. The easiest way to do this is to reuse the Excel sheet with functional annotation you made last week.

Re-analyzing all 10 clusters will likely take too long for this exercise, as it takes some manual effort to run GOrilla. Instead select the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

File:Document-save.png

2024-03-05T15:19:12Z

WikiSysop:

File:GOrilla webinterface1+boxes.png

2024-03-05T15:18:07Z

WikiSysop:

File:Cluster1 biological process.png

2024-03-05T15:17:25Z

WikiSysop:

ExGeneOntology Yeast1.5

2024-03-05T15:16:34Z

WikiSysop: Created page with "= Gene Ontology - yeast cell cycle examples = '''Cellular component''' example: the GO term '''mitochondrion''' '''Exercise written by:''' [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] '''Purpose of this exercise:''' * Understand how Gene Ontology terms are defined and organized: ** The relationship between GO terms (IS A, PART OF, etc) ** The three..."

= Gene Ontology - yeast cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson]

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from Saccharomyces Genome Database (SGD).
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using GOrilla

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene identifiers in SGD (e.g. YDR224C) and protein identifiers in UniProt (e.g. POLD1_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords (see the [[Exercise:_The_protein_database_UniProt|27611 UniProt exercise]] for details): to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** http://cbl-gorilla.cs.technion.ac.il/ - "'''GOrilla''' - Gene Ontology enRIchment anaLysis and visuaLizAtion tool"
''Many, MANY, more Gene Ontology wrappers and analysis tools exist (all based on the same data), but we'll limit ourselves to the ones listed above for the time being.''
== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in the physical partitioning and separation of a cell into daughter cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis excludes nuclear division; in prokaryotes, there is little difference between cytokinesis and cell division. Note that there is no relationship between this term and 'nuclear division ; GO:0000280' because cell division can take place without nuclear division (as in prokaryotes) and vice versa (as in syncytium formation by mitosis without cytokinesis.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Select Search -> Ontology from the top menu.
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

It's much the same case as we have previously seen with taxonomy in course 27611 - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA?
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== Saccharomyces Genome Database ==

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Look the entry for POL1 (YNL102W) in [http://www.yeastgenome.org SGD]'''
* Notice that all Saccharomyces Genome Database (SGD) entries have an entire section on Gene Ontology annotations; click on the "Gene Ontology" tab for full details. This actually include a bit of extra information about the '''evidence''' for annotations.
* What is the Molecular Function for POL1?
* Click on the link for this term to see how SGD describes the GO term, and how the evidence is presented.
** How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?

'''IMPORTANT:''' SGD also offers the possibility to jump from their website to the same GO term inside AmiGO. This is very useful for investigating the hierarchy of GO terms "above" - SGD has limited functionality for this.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: Follow the link to AmiGO'''
* Follow the link to AmiGO for the Molecular Function term found above, and answer the following question:
** Does POL1 have "Transferase Activity"? (Which GO term).

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using GOrilla ==
=== Introducing GOrilla ===
[[Image:Cluster1_biological_process.png|thumb|300px|right|Automated over-representation analysis of Cluster #1 using GOrilla. The color intensity marks significance of the over-representation.]]

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (study group) against a background distribution consisting of the entire yeast genome (population group). The tool we have selected will automatically calculate p-values for ALL Gene Ontology entries within the 3 main trunks of the GO system:

* Biological Process
* Molecular Function
* Cellular Component

The tool is intelligent enough to perform the test on '''nested categories''' and the results are shown both as tables with p-values, and as easy to interpret color-coded graphs (see the figure to the right). Finally it's worth mentioning, that the tool also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

Finally GOrilla can be run in two main modes of operation:
* List vs. background (Study group vs. population group).
* Rank based test on a single (sorted) input list.

We'll cover the list vs. background methods today, and the rank based test in next week's exercise.

'''LINK:'''
** http://cbl-gorilla.cs.technion.ac.il/ - "'''GOrilla''' - '''G'''ene '''O'''ntology en'''RI'''chment ana'''L'''ysis and visua'''L'''iz'''A'''tion tool"

=== How to run the analysis ===
[[Image:GOrilla_webinterface1+boxes.png|thumb|400px|right|Important options to remember when performing set vs. background analysis]]

First we need to prepare our input data - we'll use the '''Cluster #1''' as example again:

'''Input list:''' ("study group")
<pre style="overflow:auto;">
YMR078C
YPR175W
YBR278W
YBL035C
YNL102W
YNL262W
YOR144C
YPR167C
YIR008C
YKL045W
YOR217W
YJR068W
YNL290W
YOL094C
YBR087W
YCL042W
</pre>

'''Background list:''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, which you can download here:
* [https://teaching.healthtech.dtu.dk/27040/exercises/yeast_all_sysnames.txt yeast_all_sysnames.txt]

[[Image:Document-save.png|left|25px]]
'''TASK: Download the data file'''. Place it somewhere on your computer where you can easily find it - we'll be using it '''a lot'''.

'''Running GOrilla:'''
# GOrilla needs to know which '''organism''' the gene IDs come from (it does not have the functionality to autodetect it), in order to load the correct subset of Gene Ontology. Luckily, yeast is among the supported organisms.
# Choose the running mode (two unranked lists)
# Paste in '''input list'''
# Upload '''background list'''
# Select which part of Gene Ontology you want to compare against.
''(See the figure to the right for a summary)''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #11:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include the graphs in your report. Do the results fit with what we have previously learned about the function of cluster #1?

== Analyze selected clusters using GOrilla ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. What we'll need here is '''lists of gene names''' for each cluster. The easiest way to do this is to reuse the Excel sheet with functional annotation you made last week.

Re-analyzing all 10 clusters will likely take too long for this exercise, as it takes some manual effort to run GOrilla. Instead select the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology Yeast R answers

2024-03-05T15:15:09Z

WikiSysop: Created page with "= Answers to the gene ontology exercise = Answers by Aron Eklund, created 25 November 2014.Updated by Rasmus Wernersson, October 2020. Here is a link to the exercise. leftPlease remember that the web servers and their underlying data could be updated any time, so it is possible that your results may not exactly match the results listed here. == Report question #1 == Q: ''How many ancestor terms are def..."

= Answers to the gene ontology exercise =
Answers by Aron Eklund, created 25 November 2014.Updated by Rasmus Wernersson, October 2020.

Here is a [[ExGeneOntology_Yeast1.5|link to the exercise]].

[[Image:Emblem-important_tiny.png‎|left]]Please remember that the web servers and their underlying data could be updated any time, so it is possible that your results may not exactly match the results listed here.

== Report question #1 ==
Q: ''How many ancestor terms are defined? With how many different types of relationships?''

By opening the page for "Cell division", and clicking on the "Graph Views" tab, one can count 2 ancestor terms (4 if the alternative visualization is used). There is only one type of relationship: is_a.

Q: ''How many children terms are defined? With how many different types of relationships?''

There are 20+ children terms. There are 5 types of relationship: is_a, part_of, negatively_regulates, positively_regulates, regulates.

== Report question #2 ==
Q: ''"Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?''

On the "Graph Views" tab, one sees that "nucleus" is an "intracellular membrane-bounded organelle", which in turn is a "membrane-bounded organelle". Thus, we infer that the nucleus is a '''membrane-bounded organelle'''.

Q: ''"What types of relationships are found?''

We see is_a and part_of relationships in the ancestors. If we look at the children (using the Neighborhood tab) there are many types of relationships.

== Report question #3 ==
Q: ''"Can the activities described be directed towards both DNA and RNA?''

Yes. Both DNA polymerase activity and helicase activity have child terms that are specific for DNA and RNA.

Q: ''"At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?''

Looking at the "Graph Views" tab, we can see that Helicase activity doesn't branch into a different ontology. However, '''DNA Polymerase Activity''' branches into the "Biological Process" ontology (with a part_of relationship). This makes sense because catalysis of DNA polymerization is essentially a molecular event (and hence is a molecular function), and yet it is also essential for DNA biosynthesis (a biological process).

[[Image:DNA_pol_act_tree.png|thumb|center|800px|Click to zoom]]

== Report question #4 ==
Q: ''"How many (if any) cell cycle sub-phases are defined for: G1-phase, S-phase, G2-phase, and M-phase?''

If there were sub-phases, these would be listed as children with a part_of relationship. By looking at the Amigo pages for the individual mitotic phases, we can see that mitotic M phase has 4 children that are sub-phases, and that G1, G2, and S do not have sub-phases.

Q: ''"Which phases are group together into the "interphase" term?''

G1, G2, and S. (We can see these as children of "mitotic interphase")

== Report question #5 ==
Q: ''"In which meiotic cell cycle phase does synapsis happen?''

Looking at the Amigo page for synapsis, on the "Graph views" tab, one can see synapsis is part_of meiosis I.

== Report question #6 ==
Q: ''"What is the Molecular Function for POL1?''

DNA-directed DNA polymerase activity

Q: ''"How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?''

For the ''manually annotated / curated part'': '''59''' unique genes (68 entries in the table)

== Report question #7 ==
Q: ''"Does POL1 have "Transferase Activity"? (Which GO term)''

Yes. DNA-directed DNA polymerase activity is_a (inferred) transferase activity (GO:0016740)

== Report question #8 ==
Q: ''"How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?''

* '''622''', [http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0045004#term=annotation according to QuickGO].
* '''596''' if you search via the UniProt [https://www.uniprot.org/uniprot/?query=goa%3A%28%22DNA+replication+proofreading+%5B0045004%5D%22%29&sort=score Advanced Search interface]

Q: ''"How many of these are human?''

Two proteins (PolD, PolE) - is easiest found using a taxonomy filter, to only keep humun (taxid: 9606)

== Report question #9 ==
Q: ''"Population group size''

5500

Q: ''"Study group size''

16

Q: ''"Genome wide frequency of each GO term''

* DNA replication: 96 / 5500 = '''0.0175'''
* DNA repair: 259 / 5500 = '''0.0471'''
* Cell cycle: 313 / 5500 = '''0.0569'''

Q: ''"Expected number of genes annotated with each term in a random selection of yeast genes of the same size as cluster #1''

* DNA replication: 0.0175 * 16 = '''0.279'''
* DNA repair: 0.0471 * 16 = '''0.753'''
* Cell cycle: 0.0569 * 16 = '''0.911'''

Q: ''"The enrichment of observed GO terms compared to expected''

* DNA replication: 14 / 0.279 = '''50.1'''
* DNA repair: 10 / 0.753 = '''13.3'''
* Cell cycle: 8 / 0.911 = '''8.79'''

Q: ''"The p-value for each GO term''

1. First, we create a contingency table. Start by filling out the cells we already know:

{| border="1" style="border-collapse:collapse"
|+ DNA replication
|-----
! !! In study group !! Not in study group !! Total
|-----
! Has annotation
| 14 || ? || 96
|-----
! Does not have annotation
| ? || ? || ?
|-----
! Total
| 16 || ? || 5500
|}

2. Next, fill out the remaining values by arithmetic:

{| border="1" style="border-collapse:collapse"
|+ DNA replication
|-----
! !! In study group !! Not in study group !! Total
|-----
! Has annotation
| 14 || 82 || 96
|-----
! Does not have annotation
| 2 || 5402 || 5404
|-----
! Total
| 16 || 5484 || 5500
|}

3. Finally, use the 4 non-total cells to calculate a P value using Fisher's exact test. Here are two example ways to do this:

3A. In R :

<code>
<pre>
> m <- matrix(c(14, 2, 82, 5402), nrow = 2)
> m
[,1] [,2]
[1,] 14 82
[2,] 2 5402
>
> fisher.test(m)

Fisher's Exact Test for Count Data

data: m
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
102.259 4677.895
sample estimates:
odds ratio
457.4204
</pre>
</code>

3B. Using the web site we suggested ([http://graphpad.com/quickcalcs/contingency1/]), we enter the data like this:

[[File: Graphpad_pvalue_01.png| border]]

… and we get results like this:

[[File: Graphpad_pvalue_02.png| border]]

Either way, we do not get the exact P value for these data, and we must be satisfied by knowing only that it is rather small. In some cases, especially when the enrichment is less, these methods will be able to provide an exact P value.

== R exercise solutions ==

'''ASK/REPORT QUESTION #11''': Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

''If you read the documentation for the msigdbr and fora function you will find that msigdbr produces a data frame, while fore expects a list. You therefore need to use the split function to make the conversion (this will not be a topic on the exam, but just a reminder for you to always read the documentation of the R packages you use''

'''TASK/REPORT QUESTION #12''': Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

''Depends on what clusters you picked, but generally speaking, molecular functions will be finer grained functions making up biological processes. Cellular component makes immediate sense for processes like DNA replication, which takes place in the nucleus''

<pre>
library(msigdbr)
library(fgsea)

BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)

MF_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "MF")
MF_list = split(x = MF_df$ensembl_gene, f = MF_df$gs_name)

CC_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "CC")
CC_list = split(x = CC_df$ensembl_gene, f = CC_df$gs_name)

load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")

target <- node_attributes[node_attributes$cluster %in% "cluster1",]$ID
BP_gsea <- fora(BP_list, target, background)
MF_gsea <- fora(MF_list, target, background)
CC_gsea <- fora(CC_list, target, background)

# A function for calculating enrichment
compute_enrichment <- function(ora, n_genes, n_universe) {
ora$relative_risk <-
(ora$overlap / ora$size) /
(n_genes / n_universe)
return(ora)
}

BP_gsea <- compute_enrichment(BP_gsea, length(target), length(background))

# Repeat as needed for the MF and CC ontologies and the other clusters

</pre>

File:Cogs brain.png

2024-03-05T15:13:38Z

WikiSysop:

File:Cluster1 subnetwork.png

2024-03-05T15:12:26Z

WikiSysop:

File:QuickGO negative regulation example.png

2024-03-05T15:11:34Z

WikiSysop:

File:Animal cell structure en.png

2024-03-05T15:11:00Z

WikiSysop:

File:QuickGO CellDivision cropped.png

2024-03-05T15:10:22Z

WikiSysop:

File:QuickGO Mitochondrion.png

2024-03-05T15:09:36Z

WikiSysop:

ExGeneOntology R

2024-03-05T14:48:39Z

= Gene Ontology - yeast cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson]

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from Saccharomyces Genome Database (SGD).
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene identifiers in SGD (e.g. YDR224C) and protein identifiers in UniProt (e.g. POLD1_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords (see the [[Exercise:_The_protein_database_UniProt|27611 UniProt exercise]] for details): to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package "fgsea" for over representation analysis, and the R package "msigdbr" for retrieving genesets

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in the physical partitioning and separation of a cell into daughter cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis excludes nuclear division; in prokaryotes, there is little difference between cytokinesis and cell division. Note that there is no relationship between this term and 'nuclear division ; GO:0000280' because cell division can take place without nuclear division (as in prokaryotes) and vice versa (as in syncytium formation by mitosis without cytokinesis.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Select Search -> Ontology from the top menu.
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

It's much the same case as we have previously seen with taxonomy in course 27611 - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== Saccharomyces Genome Database ==

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Look the entry for POL1 (YNL102W) in [http://www.yeastgenome.org SGD]'''
* Notice that all Saccharomyces Genome Database (SGD) entries have an entire section on Gene Ontology annotations; click on the "Gene Ontology" tab for full details. This actually include a bit of extra information about the '''evidence''' for annotations.
* What is the Molecular Function for POL1?
* Click on the link for this term to see how SGD describes the GO term, and how the evidence is presented.
** How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?

'''IMPORTANT:''' SGD also offers the possibility to jump from their website to the same GO term inside AmiGO. This is very useful for investigating the hierarchy of GO terms "above" - SGD has limited functionality for this.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: Follow the link to AmiGO'''
* Follow the link to AmiGO for the Molecular Function term found above, and answer the following question:
** Does POL1 have "Transferase Activity"? (Which GO term).

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?