Data mining in NCBI databases
Description
Mine NCBI databases for networks of genes which are connected by the fact that they have been mentioned in the same PubMed article. This project is a good example of how research can be done in Real Life and contains a high degree of freedom in how you want to proceed. Part of the problem is to understand and subsequently parse the NCBI databases, which are flat files. The information found could be used for pathway analysis and construction, disease gene finding and many other purposes, where the underlying problem is to find connections between (novel) genes.
The project consists of at least two main steps: preprocessing and filtering/selecting for networks. That is best implemented as two programs.
Input and output
The databases can be found at https://ftp.ncbi.nih.gov/gene/DATA/. The files of interest are gene2pubmed.gz, gene_info.gz and README. README simply describes the files in the directory. They can be given to the program in any way you think is sensible.
The output should be the important networks of genes, displayed in such a way that it is clear, why the network is important, which genes are part of the network and how strongly they are connected. It should be possible to generate a network representation via a third-hand tool from (parts of) the output. The third-hand tool could be Cytoscape.
As perhaps obvious, you need to decide on some kind of intermediate file format which is the result of the preprocessing step. This file should be used by the filtering/selection program, which in turn finds appropriate networks and produces Cytoscape files for visualizing. It must here be mentioned that when visualizing with Cytoscape, then it is important that the gene names are displayed. Nobody knows what gene number represents which gene, so the networks are simply less interesting with bare numbers.
Details
The information that the programs is supposed to create/mine can be considered to be a graph, where the nodes are genes, and the edges between nodes are links between the genes. Two genes are linked if they are mentioned/connected to the same article. The weight of the edge is the number of articles, which links both genes. The greater the weight of the edge, the more important is the relationship of the genes. The data in gene2pubmed is basically a connection between one gene and one PubMed article on each line. From that information you can generate the graph.
In the preprocessing the organism of interest is chosen by the taxonomy ID (tax_id) as a parameter on the command line. Subsequently, the data is mined for the selected organism and an appropriate data structure is written to a file. The reason for doing this is that 1) the organism of interest can be targeted, 2) the mining step takes a while, and it is good to separate it from the filtering/selection step, because then the mining step does not have to be repeated, when the different networks are investigated by the filtering/selection step. Performance is king. Also, some organisms like Homo Sapiens (tax_id 9606) has quite a lot of data, which can be a challenge to process.
Tip: It can be worthwhile to experiment with different file format/data structures/ideas between the 2 steps. Some can really make a difference in both the speed of the process of both steps and the memory needed.
The filtering/selection step takes the output from the preprocessing step and creates the Cytoscape files. It needs to generate appropriate networks from the mined data. Cytoscape (and your eyes/mind) can not handle networks that are too large. Hence, some filtering and/or selection of nodes/genes are needed.
What kind of interesting and informative networks can be created by filtering or selection?:
- A network with many connected nodes = many genes.
- Networks where the sum of the edges is high = many co-mentioning articles.
- High edge-sum/nodes = high importance of the network, many articles.
- Some nodes in the graph do not have any connecting edges = virgin territory or maybe uninteresting.
- Networks that consists of only few nodes where the connecting edges have low weights = not much research has been made.
- Networks that connect to a specific gene = an overview of a interaction network, maybe a biological process.
- Networks that has a specific gene as a center and has connections to the n'th degree (star shaped).
The project illustrates a known issue in biology or even most fields; The explosive growth of data. Some of the difficulty in the project lies in handling all the data. You can get out of memory issues, or simply have a very long running time. Consider running the "final" version of your program(s) on DTU unix servers, while working with a subset of the data while developing.
Drowning your results in data is also a possibility, so it is important to find some way of sorting/selecting your results. Some ways have been mentioned above, but you are free to think of other ways of analyzing your data critically.
Information on the DTU servers:
http://gbar.dtu.dk/faq/53-ssh
https://www.hpc.dtu.dk/?page_id=2501