I todays exercise you shall implement the feed forward algorithm for artificial neural network prediction and the backpropagation algorithm for artificial neural network training
First you shall download the program templates.
Download the file ANN.tar.gz file
Open the file (using tar -xzvf ANN.tar.gz) and place the created ANN directory in the "Algo/code" directory. The NN directory should now contain two jupyter-notebook program files
ANN_forward.ipynb ANN_train.ipynb
The two files are templates for the "Feedforward", and "Backpropagation" method, for predicting with and trainin of artificial neural networks.
Next you must download some data files.
First remove the "ANN" directory in the course data directory (this directory was created on the first day of the course, and contains some outdated files - A2403_training, A2403_evaluation, trainpred.out, and testpred.out). Next download the file
Place the file in the course data direcory, and open the file (using tar -xzvf ANN_data.tar.gz). The created ANN directory should now contain the following files
A2403_training A2403_evaluation A0201_training A0201_evaluation A2403_sp.syn A2403_bl.syn A0201_sp.syn A0201_bl.syn
The first four files are peptide files for two different HLA alleles used for training and evaluation, and the last four network parameters files (synaps files) generated by training ANN's on these data using either Blosum (bl) or Sparse (sp) sequence encoding.
Now we are ready to code
This program reads in a synaps-file (network parameters), and a file with peptides, encodes the peptides using either Sparse or Blosum encoding, and next makes predictions for each peptide using the feedforward algorithm.
Before compleling the code, have a look at one of the synaps files in the data/ANN directory.
Can you understand the number of synaps weights in the files, i.e.,
cat ../data/A2403_sp.syn | grep -v TEST | grep -v ":" | wcThe second column in this command gives the number of weights in the synaps file. Can you make sense of this number (911)?
Now go back to the "algo/code/ANN/" directory, and open the ANN_forward.ipynb program. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XX's). Test the program with the synaps files in the data/ANN directory, with the matching evaluation files and sequence encoding.
What is the predictive performance of the neural network (in terms of the Pearsons correlation)? And how does this value compare to the value listed in the first line of the synaps file (T_PCC column)?
head ../data/ANN/A2403_bl.syn TESTRUNID EPOCH: 97 L_PCC: 0.925606324907 L_ERR: 0.00456609177732 T_PCC: 0.815568933575 T_ERR: 0.0129260695735
When you have a functional version of the code, download the code as a python program, and add a command line parser with the options
optional arguments: -h, --help show this help message and exit -e EVALUATION_FILE File with evaluation data -syn SYNFILE_NAME Name of synaps file -bl Use Blosum encoding
Test the code with one of the examples in the data/ANN directory
python ANN_forward.py -s ../data/ANN/A0201_bl.syn -e ../data/ANN/A0201_evaluation -bl | grep -v "#" | gawk '{print $2,$3}' | xycorr
Remember you might have to give the full path to the xycorr script, and a different path to the synaps and evaluation data files.
Again, test that you get the same performance value as listed in the first line of the synaps file.
You now can test if combining Blosum and sparse encoding improves the predictive performance
python ANN_forward.py -s ../data/ANN/A0201_bl.syn -e ../data/ANN/A0201_evaluation -bl | grep -v "#" > A0201_bl.pred python ANN_forward.py -s ../data/ANN/A0201_sp.syn -e ../data/ANN/A0201_evaluation | grep -v "#" > A0201_sp.pred paste A0201_sp.pred A0201_bl.pred | gawk '{print $2,($3+$6)/2}' | xycorrWhat is the predictive performance of the averaged method, and is it higher than two methods individually?
The program ANN_train.ipynb trains a neural network using back-propagation. The program reads a training and evaluation peptide file, encodes the peptides using either Sparse or Blosum encoding, and next trains the network using back-propagation to minimize the error between the predicted and target values for each data point.
Open the file ANN_train.ipynb. Make sure you understand the structure of the program. Fill in the missing code (XXXX). Again make sure you understand the structure of each routine, and then fill out the missing code.
Now train a network on the A2403 training and evaluation data using Sparse encode. Set the synfile_name to "my_A2403_sp.syn". Leave all other options as they are.
What is the training and evaluation performances of the network (in terms of the Pearson's correlation)? Do the same thing using blosum encoding. Are there any striking differences in the course of the training for the two encoding schemes? Can you understand this difference?
Do the same for the A0201 trainin and evalaution files (this might take a little while, since this training data file is rather large). Do you here observe a difference between the Blosum and Sparse encoding
Now download the code as a python program and add a command line parse with the options
optional arguments: -h, --help show this help message and exit -t TRAINING_FILE File with training data -e EVALUATION_FILE File with evaluation data -epi EPSILON Epsilon (default 0.05) -s SEED Seed for random numbers (default 1) -i EPOCHS Number of epochs to train (default 100) -syn SYNFILE_NAME Name of synaps file -bl Use Blosum encoding -stop Use Early stopping -nh HIDDEN_LAYER_DIM Number of hidden neurons
Test the code on one of the exampels in the data/ANN directory
python ANN_train.py -bl -syn my_A2403_bl.syn -nh 5 -i 100 -t ../data/ANN/A2403_training -e ../data/ANN/A2403_evaluation -stop > A2403_bl.out
Compare the output to the files below
A2403_bl.out (A2403 Blosum encoding)
A2403_sp.out (A2403 Sparse encoding)
A0201_sp.out (A0201 Sparse encoding)
A0201_bl.out (A0201 Blosum encoding)
Now you are done. You have now made a series of programs similar to the neural network program suite used at Bioinformatics at DTU to make >20 Science and Nature publications, and attract more that 50 million US $ funding.
If you have more time, could you modify the code to do error minimization on the function E=1/4(O-t)^4?
Remember to upload the two program ANN_forward.py and ANN_train.py via CampusNet