<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk:443/22118/index.php?action=history&amp;feed=atom&amp;title=Text_mining_MEDLINE_abstracts</id>
	<title>Text mining MEDLINE abstracts - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk:443/22118/index.php?action=history&amp;feed=atom&amp;title=Text_mining_MEDLINE_abstracts"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk:443/22118/index.php?title=Text_mining_MEDLINE_abstracts&amp;action=history"/>
	<updated>2026-04-14T05:47:11Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk:443/22118/index.php?title=Text_mining_MEDLINE_abstracts&amp;diff=24&amp;oldid=prev</id>
		<title>WikiSysop: /* Input and output */</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk:443/22118/index.php?title=Text_mining_MEDLINE_abstracts&amp;diff=24&amp;oldid=prev"/>
		<updated>2025-09-26T11:23:13Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;Input and output&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 13:23, 26 September 2025&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l14&quot;&gt;Line 14:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 14:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;===Input and output===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;===Input and output===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is expected that 2-4 programs will be written in the project. The programs that work with the MEDLINE abstracts should as input take a file of MEDLINE accessions and download (and parse) the abstracts directly from PubMed.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;It is expected that 2-4 programs will be written in the project. The programs that work with the MEDLINE abstracts should as input take a file of MEDLINE accessions and download (and parse) the abstracts directly from PubMed.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The programs could instead work directly with the MEDLINE abstracts in 1 or more files, like this [https://teaching.healthtech.dtu.dk/material/&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;22113&lt;/del&gt;/pubmed_result.txt.gz gzipped file from PubMed], which can be used in the project - or create your own data set from PubMed.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;The programs could instead work directly with the MEDLINE abstracts in 1 or more files, like this [https://teaching.healthtech.dtu.dk/material/&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;22118&lt;/ins&gt;/pubmed_result.txt.gz gzipped file from PubMed], which can be used in the project - or create your own data set from PubMed.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There is also output and input of a blacklist and occurrence tables. These are your own formats.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;There is also output and input of a blacklist and occurrence tables. These are your own formats.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk:443/22118/index.php?title=Text_mining_MEDLINE_abstracts&amp;diff=19&amp;oldid=prev</id>
		<title>WikiSysop: Created page with &quot;__NOTOC__ === Description === The purpose is to mine MEDLINE abstracts for words which are associated with each other. This is done by finding &#039;&#039;informative&#039;&#039; words that co-occur with each other, i.e. the words would be in the same abstract. The process consists of a number of steps, where the first step is to find the non-informative words in the abstracts. Second step will be parsing the abstracts again disregarding the non-informative words, and create some occurrence...&quot;</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk:443/22118/index.php?title=Text_mining_MEDLINE_abstracts&amp;diff=19&amp;oldid=prev"/>
		<updated>2025-09-26T11:17:10Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;__NOTOC__ === Description === The purpose is to mine MEDLINE abstracts for words which are associated with each other. This is done by finding &amp;#039;&amp;#039;informative&amp;#039;&amp;#039; words that co-occur with each other, i.e. the words would be in the same abstract. The process consists of a number of steps, where the first step is to find the non-informative words in the abstracts. Second step will be parsing the abstracts again disregarding the non-informative words, and create some occurrence...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;__NOTOC__&lt;br /&gt;
=== Description ===&lt;br /&gt;
The purpose is to mine MEDLINE abstracts for words which are associated with each other. This is done by finding &amp;#039;&amp;#039;informative&amp;#039;&amp;#039; words that co-occur with each other, i.e. the words would be in the same abstract. The process consists of a number of steps, where the first step is to find the non-informative words in the abstracts. Second step will be parsing the abstracts again disregarding the non-informative words, and create some occurrence and co-occurrence tables with what is left - the informative words. Third step will be using the tables to find word associations. &lt;br /&gt;
&lt;br /&gt;
Word stemming can be added to the process, e.g. &amp;quot;protein&amp;quot; and &amp;quot;proteins&amp;quot; are the same word. The letter case could also be part of the process, e.g. &amp;quot;protein&amp;quot; and &amp;quot;Protein&amp;quot; are the same word.&lt;br /&gt;
&lt;br /&gt;
===Method===&lt;br /&gt;
An informative word is a word that does not occur too frequently in the MEDLINE abstracts. Thus, words that have a too high average occurrence per abstract (or possibly average occurrence per word) are found in a first parsing of the abstracts. You need to set and justify these limits. The first pass results in a word blacklist. Random sampling of 10% of the abstracts can be good enough to create the blacklist.&lt;br /&gt;
&lt;br /&gt;
Parsing the abstracts again disregarding blacklisted words and non-words (there are lots of noise) gives the informative words. If a word occurs more than once in an abstract it is collapsed to one occurrence, since the word association is important - not the number of same word occurrences in an abstract. Compute the single word and co-occurring words frequencies of the informative words and other data you need for answering questions.&lt;br /&gt;
&lt;br /&gt;
Given the frequency of 2 words and the number of words in an abstract, assuming independence, you can calculate an expectation for the number of times these 2 words will co-occur in an abstract. If you take the log of the ratio of observed co-occurrence to expected co-occurrence, you have a log-likelihood (LLH) score for the word pair. A LLH &amp;gt; 0 means the term pair is over-represented and therefore associated.&lt;br /&gt;
&lt;br /&gt;
===Input and output===&lt;br /&gt;
It is expected that 2-4 programs will be written in the project. The programs that work with the MEDLINE abstracts should as input take a file of MEDLINE accessions and download (and parse) the abstracts directly from PubMed.&lt;br /&gt;
The programs could instead work directly with the MEDLINE abstracts in 1 or more files, like this [https://teaching.healthtech.dtu.dk/material/22113/pubmed_result.txt.gz gzipped file from PubMed], which can be used in the project - or create your own data set from PubMed.&lt;br /&gt;
&lt;br /&gt;
There is also output and input of a blacklist and occurrence tables. These are your own formats.&lt;br /&gt;
&lt;br /&gt;
One of the programs must from the occurrence tables be able to answer questions like: Which words are strongly associated with a given word? How strongly associated are these two words?&lt;br /&gt;
&lt;br /&gt;
As an example, I could easily imagine a pipeline of 4 programs with options to control names of input and output files like this.&lt;br /&gt;
# From a list of MEDLINE accessions, download and create a file of abstracts.&lt;br /&gt;
# From a file of abstract create a file of blacklisted words.&lt;br /&gt;
# From a blacklist file and abstract file create occurrence tables/data.&lt;br /&gt;
# From occurrence/data tables answer questions about informative words.&lt;/div&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
</feed>