Artificial neural networks

I todays exercise you shall implement the feed forward algorithm for artificial neural network prediction and the backpropagation algorithm for artificial neural network training

First you shall download the program templates.

Download the file ANN.tar.gz file

Open the file (using tar -xzvf ANN.tar.gz) and place the created ANN directory in the "Algo/code" directory. The NN directory should now contain two jupyter-notebook program files

ANN_forward.ipynb
ANN_train.ipynb

The two files are templates for the "Feedforward", and "Backpropagation" method, for predicting with and trainin of artificial neural networks.


Next you must download some data files.

First remove the "ANN" directory in the course data directory (this directory was created on the first day of the course, and contains some outdated files - A2403_training, A2403_evaluation, trainpred.out, and testpred.out). Next download the file

ANN_data.tar.gz file

Place the file in the course data direcory, and open the file (using tar -xzvf ANN_data.tar.gz). The created ANN directory should now contain the following files

A2403_training
A2403_evaluation
A0201_training
A0201_evaluation
A2403_sp.syn
A2403_bl.syn
A0201_sp.syn
A0201_bl.syn

The first four files are peptide files for two different HLA alleles used for training and evaluation, and the last four network parameters files (synaps files) generated by training ANN's on these data using either Blosum (bl) or Sparse (sp) sequence encoding.

Now we are ready to code


Implementing the algorithms

ANN_forward.ipynb

This program reads in a synaps-file (network parameters), and a file with peptides, encodes the peptides using either Sparse or Blosum encoding, and next makes predictions for each peptide using the feedforward algorithm.

Before compleling the code, have a look at one of the synaps files in the data/ANN directory.

Can you understand the number of synaps weights in the files, i.e.,

cat ../data/A2403_sp.syn | grep -v TEST | grep -v ":" | wc
The second column in this command gives the number of weights in the synaps file. Can you make sense of this number (911)?

Now go back to the "algo/code/ANN/" directory, and open the ANN_forward.ipynb program. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XX's). Test the program with the synaps files in the data/ANN directory, with the matching evaluation files and sequence encoding.

What is the predictive performance of the neural network (in terms of the Pearsons correlation)? And how does this value compare to the value listed in the first line of the synaps file (T_PCC column)?

head ../data/ANN/A2403_bl.syn
TESTRUNID EPOCH: 97 L_PCC: 0.925606324907 L_ERR: 0.00456609177732 T_PCC: 0.815568933575 T_ERR: 0.0129260695735

When you have a functional version of the code, download the code as a python program, and add a command line parser with the options

optional arguments:
  -h, --help          show this help message and exit
  -e EVALUATION_FILE  File with evaluation data
  -syn SYNFILE_NAME   Name of synaps file
  -bl                 Use Blosum encoding

Test the code with one of the examples in the data/ANN directory

python ANN_forward.py -s ../data/ANN/A0201_bl.syn -e ../data/ANN/A0201_evaluation -bl | grep -v "#" | gawk '{print $2,$3}' | xycorr 

Remember you might have to give the full path to the xycorr script, and a different path to the synaps and evaluation data files.

Again, test that you get the same performance value as listed in the first line of the synaps file.

You now can test if combining Blosum and sparse encoding improves the predictive performance

python ANN_forward.py -s ../data/ANN/A0201_bl.syn -e ../data/ANN/A0201_evaluation -bl | grep -v "#" > A0201_bl.pred
python ANN_forward.py -s ../data/ANN/A0201_sp.syn -e ../data/ANN/A0201_evaluation | grep -v "#" > A0201_sp.pred
paste A0201_sp.pred A0201_bl.pred | gawk '{print $2,($3+$6)/2}' | xycorr
What is the predictive performance of the averaged method, and is it higher than two methods individually?

The program ANN_train.ipynb trains a neural network using back-propagation. The program reads a training and evaluation peptide file, encodes the peptides using either Sparse or Blosum encoding, and next trains the network using back-propagation to minimize the error between the predicted and target values for each data point.

Open the file ANN_train.ipynb. Make sure you understand the structure of the program. Fill in the missing code (XXXX). Again make sure you understand the structure of each routine, and then fill out the missing code.

Now train a network on the A2403 training and evaluation data using Sparse encode. Set the synfile_name to "my_A2403_sp.syn". Leave all other options as they are.

What is the training and evaluation performances of the network (in terms of the Pearson's correlation)? Do the same thing using blosum encoding. Are there any striking differences in the course of the training for the two encoding schemes? Can you understand this difference?

Do the same for the A0201 trainin and evalaution files (this might take a little while, since this training data file is rather large). Do you here observe a difference between the Blosum and Sparse encoding

Now download the code as a python program and add a command line parse with the options

optional arguments:
  -h, --help            show this help message and exit
  -t TRAINING_FILE      File with training data
  -e EVALUATION_FILE    File with evaluation data
  -epi EPSILON          Epsilon (default 0.05)
  -s SEED               Seed for random numbers (default 1)
  -i EPOCHS             Number of epochs to train (default 100)
  -syn SYNFILE_NAME     Name of synaps file
  -bl                   Use Blosum encoding
  -stop                 Use Early stopping
  -nh HIDDEN_LAYER_DIM  Number of hidden neurons

Test the code on one of the exampels in the data/ANN directory

python ANN_train.py -bl -syn my_A2403_bl.syn -nh 5 -i 100 -t ../data/ANN/A2403_training -e ../data/ANN/A2403_evaluation -stop > A2403_bl.out 

Compare the output to the files below

A2403_bl.out (A2403 Blosum encoding)
A2403_sp.out (A2403 Sparse encoding)
A0201_sp.out (A0201 Sparse encoding)
A0201_bl.out (A0201 Blosum encoding)

Now you are done. You have now made a series of programs similar to the neural network program suite used at Bioinformatics at DTU to make >20 Science and Nature publications, and attract more that 50 million US $ funding.

If you have more time, could you modify the code to do error minimization on the function E=1/4(O-t)^4?

Remember to upload the two program ANN_forward.py and ANN_train.py via CampusNet