Text mining MEDLINE abstracts

From 22113
Jump to navigation Jump to search

Description

The purpose is to mine MEDLINE abstracts for words which are associated with each other. This is done by finding informative words that co-occur with each other, i.e. the words would be in the same abstract. The process consists of a number of steps, where the first step is to find the non-informative words in the abstracts. Second step will be parsing the abstracts again disregarding the non-informative words, and create some occurrence and co-occurrence tables with what is left - the informative words. Third step will be using the tables to find word associations.

Word stemming can be added to the process, e.g. "protein" and "proteins" are the same word. The letter case could also be part of the process, e.g. "protein" and "Protein" are the same word.

Method

An informative word is a word that does not occur too frequently in the MEDLINE abstracts. Thus, words that have a too high average occurrence per abstract (or possibly average occurrence per word) are found in a first parsing of the abstracts. You need to set and justify these limits. The first pass results in a word blacklist. Random sampling of 10% of the abstracts can be good enough to create the blacklist.

Parsing the abstracts again disregarding blacklisted words and non-words (there are lots of noise) gives the informative words. If a word occurs more than once in an abstract it is collapsed to one occurrence, since the word association is important - not the number of same word occurrences in an abstract. Compute the single word and co-occurring words frequencies of the informative words and other data you need for answering questions.

Given the frequency of 2 words and the number of words in an abstract, assuming independence, you can calculate an expectation for the number of times these 2 words will co-occur in an abstract. If you take the log of the ratio of observed co-occurrence to expected co-occurrence, you have a log-likelihood (LLH) score for the word pair. A LLH > 0 means the term pair is over-represented and therefore associated.

Input and output

It is expected that 2-4 programs will be written in the project. The programs that work with the MEDLINE abstracts should as input take a file of MEDLINE accessions and download (and parse) the abstracts directly from PubMed. The programs could instead work directly with the MEDLINE abstracts in 1 or more files, like this gzipped file from PubMed, which can be used in the project - or create your own data set from PubMed.

There is also output and input of a blacklist and occurrence tables. These are your own formats.

One of the programs must from the occurrence tables be able to answer questions like: Which words are strongly associated with a given word? How strongly associated are these two words?

As an example, I could easily imagine a pipeline of 4 programs with options to control names of input and output files like this.

  1. From a list of MEDLINE accessions, download and create a file of abstracts.
  2. From a file of abstract create a file of blacklisted words.
  3. From a blacklist file and abstract file create occurrence tables/data.
  4. From occurrence/data tables answer questions about informative words.