From hackinar
Jump to: navigation, search

Hackinars in Bioinformatics short description


Hands on course in bioinformatics


Every Thursday from 1-5 PM from September 7 – November 30, 2017 (except September 14 and October 19 = 11 times)


208 006

Short Schedule

  • February 8th: 1-3PM Computerome and UNIX, Erland,, 3-5PM "How to organise a project - The most important talk you never heard!", Leon
  • February 22: Computerome and Queuing system, Erland
  • March 8: Assembly with Velvet and Spades, Philip, Simon
  • March 22: Mapping with bwa and bowtie2, Philip, Simon
  • April 5: Mapping with KMA, Philip
  • Continuation? April 12, 19, 26, May 3 are reserved if there are interest in continuing

Old programe Hackinars_2017_fall

Detailed Programe

February 8th: Computerome and UNIX, How to organise a project Erland, Leon

Computerome and UNIX, Erland

Course material:

Unix basics and usage of Computerome - pages 01..28

Extra material:

Unix exercise from Immunological Bioinformatics course

to try it change the path from /usr/opt/www/pub/CBS/courses/27685.imm/exercise_unix/immu00

to /home/projects/pr_hackinars/data/immu00

How to organize a project. The most important talk you never heard!, Leon Jessen


February 22: Computerome and Queuing system, Erland

Course material:

Unix basics and usage of Computerome - pages 29..50

Ole's Notes

Computerome Wiki

Computerome old Wiki

Computerome new Wiki

Module system

Show available modules

module avail

Find all samtools available

module avail samtools

To be able to see samtools (and many other tools) you first have to load tools

module load tools

Load module

module load name_of_tool_including_version

Find module (-i makes it case insensitive)

module apropos -i python

Display more details about module (for example dependencies)

module display velvet/1.2.10

Unload module

module unload anaconda/2.1.0

Modules may have dependencies i.e. velvet

module display velvet/1.2.10

module-whatis	Velvet 1.2.10 - sequence assembler for very short reads 
conflict	velvet 
conflict	perl/5.8.9 
prereq	perl 
prereq	anaconda 
prepend-path	PATH /services/tools/velvet/1.2.10/bin 
setenv	VELVET_ROOT /services/tools/velvet/1.2.10 

So you may need to load more modules (in addition to tools) to be able to run velvet

module load perl/5.20.1
module load anaconda/2.1.0
module load velvet/1.2.10

Module information

module tab tab
module whitespace tab tab

Remove all modules

module purge

Autoload modules at login

  • example load anaconda/2.1.0 at every login
module initapp anaconda/2.1.0

Remove autoloading of module: module initrm

module initrm anaconda/2.1.0

See which modules are pre loaded

module initlist

Source bashrc to load modules from the login environment again


Load ngs modules (some are not in tools)

module load ngs

load own modules described in home directory (for developing)

module load use.own

Some public tools (like MGmpper) are maintained by tnp Admin make directory structure and make user the owner

List of what I have loaded

module list

Modules I (Ole) often load

module load tools
module load moab torque
module load perl/5.20.1
module load anaconda/2.2.0
module load velvet/1.2.10
module load ncbi-blast/2.2.31+
module load ngs
module load sratoolkit/2.8.1
Queuing system

Load queue modules

module load tools
module load moab torque

Queue submission commands (qsub recommended, but it is a matter of taste which to use)




and the wrapped versions (which may be easier - but, because it is wrapped, also one step further away from the raw code)




Start interactive shell (1core, 1node, 4GB RAM, 86400 seconds (=1day))

qsub -l nodes=1:ppn=1,mem=4gb,walltime=86400 -q batch -I -X 

Interactive way to get an interactive shell


Submit job to queue with msub

msub -W group_list=pr_hackinars -Apr_hackinars -d my_working_directory -l nodes=1:ppn=1,mem=4gb,walltime=86400 -q batch  my_command

or with qxsub (here using 1:00:00:00 to get one day format: d:hh:mm:ss)

xqsub -V -d my_working_directory -l nodes=1:ppn=1,mem=4gb,walltime=1:00:00:00 -q batch -re my_std_error_file -ro my_std_output_file -de my_command

Simple example of program which waits for 60 seconds and then prints the date (output go to STDIN.oJob_ID)

echo 'sleep 60;date'|msub -W group_list=pr_hackinars -Apr_hackinars  -l nodes=1:ppn=1,mem=4gb,walltime=86400

See job status (for user myusername). Here you can also see the Job_ID (which is a unique number each job gets)

showq -umyusername


qstat -umyusername

Stop one job

canceljob Job_ID

Cancel all myusername's jobs in the queue (don't do this if you don't know what you are doing)

showq -umyusername| grep unning| gawk '{print $1}'|xargs canceljob

Estimate starting time for job

showstart Job_ID

Overview of system status

module load pestat

Interact with queuing system graphically

  • Start firefox on computerome

If there is a security warning select:

  • Advanced; Add Exception; Confirm Security Exception

March 8: Assembly with Velvet and Spades, Philip, Simon

Assembly slides


Save the world, if not possible in one go, then proceed with one sequence at a time till you reach the goal.


All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself has to change something there.


Choose some samples to work with from: data/art_data/SE and data/art_data/PE, and save them in your own directory. Both directories contain artificial sequence data from the ResFinder database, both single and paired end. Remember which files you choose, because you are going to use them and some of the results obtain today for the exercises the next three weeks. For all tasks below you should try with both the single and paired end files.

Choose one isolate from: data/ECs/, and save them in your own directory. These are Escherichia coli isolates from the German / French outbreak in 2011. Again, remember which one you choose, as this is going to be used again for the exercises in the coming weeks.


Lets load the needed modules to do the exercises for today:

module load tools
module load perl/5.24.0
module load anaconda2/4.0.0
module load ncbi-blast/2.2.26
module load SPAdes/3.9.0
module load cgepipeline/20170113

Velvet assembly

In order to assemble our reads with Velvet we are going to find the optimal k, with a program that tries a series of different k’s and then chooses the best one. Try to assemble the artificial reads with the following commands.

Paired end:

wrapper_denovo_2.0.py velvet --shortPaired fastq chopDBX_1.fastq.gz chopDBX_2.fastq.gz \
--sample outX_velvet_pe --n 6 --m 16 --add_velvetg "-very_clean yes"

Single end:

wrapper_denovo_2.0.py velvet --short fastq chopDBX.fastq.gz \
--sample outX_velvet_se --n 6 --m 16 --add_velvetg "-very_clean yes"

There are several options available for velvet and the wrapper that calls it, try to check if something may have been done differently.

wrapper_denovo_2.0.py velvet -h

In order to check the quality of the assembly of the artificial reads, you may want to see if the assembly can be found in the ResFinder database. We do that with BLAST.

blastall -p blastn -i contig.fsa -d data/ResFinderFSA/ResFinder.fasta > blast.out

The best hits for each contig is under the line:

"Sequences producing significant alignments:"

Remember the output of the best template from BLAST, as we are going to use this the next weeks.

When you understand the program and the output then qeue your E. coli reads, it might take some time to complete so we will come back to that.

SPAdes assembly

Like Velvet, SPAdes also has a wrapper program that executes the assembly program itself. Like for Velvet several different values of k are used, but in difference to Velvet, several different sizes k are used to construct the final assembly. Start by making an assembly of your artificial sequences like before:

Paired end:

spades.py -t 6 -m 16 -o outX_spades_pe --careful \
--pe1-1 chopDBX_1.fastq.gz --pe1-2 chopDBX_2.fastq.gz --only-assembler

Single end:

spades.py -t 6 -m 16 -o outX_spades_se --careful \
-s chopDBX.fastq.gz --only-assembler

Note that SPAdes has a tendency not to detect the phred-scale, espicially if the reads have been trimed and are Phred scale 33. All the artificial reads are phred-scale 33, therefore you can add the option “--phred-offset 33”, so that SPAdes does not crash here. Make a quality check with BLAST on the artificial reads, like we did with Velvet. Remember to save the best matching templates here as well.

Try to study the output and different options of SPAdes (spdes.py -h), and check if something may have been done smarter. When you understand the different options and the output of SPAdes, you may qeue your E. coli reads for assembly with and without the "--careful" option.

Quality check

You are almost done and deserve a break before you continue, that way the assemblies might be done when you come back.

First of all, check that your assemblies are done and ended successfully.

When assembling bacterial genomes, one of the best quality checks you can do is to do MLST. For standard MLST typing 7 genes are used, and if all of these genes can be found in your assembly, when it is probably not half-bad. To determine the MLST we are going to use MLSTfinder, which is based on BLAST. Note that using BLAST is drastically improved because of the assembly for most cases.

mlst.pl -d /home/databases/cgepipeline/mlst/ -i SRR3415X.fasta -o MLST -s ecoli

Try to see if it is possible to call MLST for both assemblies, if not which assembly method was best.

Finally, we are going to identify antibiotic resistance genes in the E. coli samples. Whole genomes may contain several resistance genes, or even several genes on the same contig, which are going to complicate the BLAST output. Therefore, we are going to use ResFinder for this. Where ResFinder has some post-processing of the BLAST output, so that only the best matches are reported for each region of each contig.

resfinder.pl -i SRR3415X.fasta -d data/ResFinderFSA/ -a ResFinder -o Xres_out -k 95.00 -l 0.60

Again, remember where you save the output of ResFinder, this too is going to be used in the coming weeks.

March 22: Mapping with bwa and bowtie2, Philip

Mapping slides


All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself have to change something there. Remeber what isolates and sequences you analyzed last time, because we are going to compare the results of different appraoches. Last was assembly + BLAST methods, today we go straight for mapping.


You may want to analyze some extra samples today, as mapping is faster than assembly. Depending on your choice of artificial sequences last time, you might want to choose some extra ones, i can recommend sample "1450". The artificial data is in: data/art_seqs/SE and data/art_seqs/PE.
And the Escherichia coli in: data/ECs.


Lets load the needed modules to do the exercises for today:

module load ngs tools perl/5.20.2
module load bwa/0.7.15
module load bowtie2/2.2.4
module load samtools/1.5
module load bedtools/2.26.0
export PATH=$PATH:/home/projects/pr_hackinars/apps/


For modern mapping methods, such as Bowtie2 and BWA, we need to index the fasta files of our references first. Even though Bowtie2 and BWA are both using the FM-index, they use slightly different variations of them. Which case means that we are going to index the databases twice, one for use with Bowtie2 and one for BWA.

Bowtie2 indexing:

bowtie2-build X.fsa X.index

BWA indexing:

bwa index X.fsa -p X.index

Today we are going to map against resistance genes, therefore we need to index a fasta file containing the wanted genes. For this purpose we are going to use the ResFinder database, as used last time with BLAST. The database fasta file are located in:

Make an indexing of this fasta file, for both Bowtie2 and BWA, and give the obtained indexing some appropiate name.


Mostly when mapping with Bowtie2 or BWA, you will want to use the standard options:

Bowtie2 single end:

bowtie2 -U X.fastq.gz -x X.index

Bowtie2 paired end:

bowtie2 -1 X_1.fastq.gz -2 X_2.fastq.gz -x X.index

BWA single end:

bwa mem X.index X.fastq.gz

BWA paired end:

bwa mem X.index X_1.fastq.gz X_2.fastq.gz

BWA has a feature that causes it to crash on samples downloaded from SRA, like our E. coli samples. In order to circumvent this you can use the following cmd, which converts the paired end reads into interleaved reads.

fq2int.sh X_1.fastq.gz X_2.fastq.gz | bwa mem -p X.index - 

Both of them outputs sam-format to stdout (default), which is a fairly large format. Therefore it is usually good practice to compress them to bam-format (binary sam), this is done easily with samtools through a pipe.

 | samtools view -Sb - > X.bam

If you then want to inspect the bam file afterwards you can use this command.

samtools view X.bam | less

Try to map your reads against the ResFinder database, using the indexing you made before.

The E. coli reads might take a few minutes to complete.

Take a quick look at the mapped reads, is the results as expected, try to compare to the results obtained from last time.
If you are not familiar with the sam-format, then these links might be helpful:
sam flags


The sam-format, as you probably has seen, is not very human-readable. It is however a nice format for a computer to parse, as seen above samtools is desinged for handling these files. We have already used samtools to compress sam files, now we are going to analyze the reads a bit further.
The most important options to remeber here, is the: "-f", "-F" and "-q".
"-f" and "-F" is used to select alignment based on flags specified on them. For example "4" is unmapped and "8" is mate unmapped, meaning that "4+8=12" means that both pairs must not map. By using the "-F" insted of "-f" you get the negation, meaning that "12" is both pairs map. The "-q" option sets the threshold for mapping quality, here a low quality means a poor alignment quality and a high one a better alignment quality. Depending on who you ask, there are many "standards". Usually a threshold below 10 is not used.
A quality of 10 usually means that there are about 10% chance that the read is wrongly mapped, and a quality of 0 means that there are close to 0% chance that the read is mapped correctly.
Even though the sam-format is standard, the columns in it is not. Meaning that the mapping quality is calculated differently with Bowti2 and BWA, which then goes on for different mapping methods.

Now for some selection of results obtained before.

Only accept mapped reads:

samtools view -b -F 4 -q 1 X.bam > X.mapped.bam

Only properly paired reads:

samtools view -b -F 12 -f 2 -q 1 X.bam > X.mapped.bam

Try some different combinations of flags, and try to vary the quality threshold.

An easy check to see how many reads map to what in your sample can be obtained by following command, make sure you undestand it before executing.

samtools view X.bam | cut -f 3 | sort | uniq -c > X.template_count.tab

You can add the flags directly to the samtools command above, and pipe the output to less in order to quickly test different settings of samtools.

If more attributes is needed such as coverage and depth, the following cmd can be used:

samtools view X.bam | samtools sort | bedtools genomecov -ibam stdin | bedToTemplate.pl > X.template_stats.tab

Try to see if you can get the same results on the artificial dataset as you got last time.
Do you see something strange in the results, obtained by either mapping or assembly + BLAST.
Which do you trust more for this example, remember the assumptions of each method.
Try to compare the results with the E. coli from last time too. If the assembly did not finish last time, then the SPAdes assemblies is available here: "/home/projects/pr_hackinars/data/assembly/SPAdes/".

Call SNPs and Infer Phylogeny

A very important use of mapping is within the area of bacterial outbreak detection and monitoring. We will conduct and exercise, where we employ the tool CSI Phylogeny that relies heavily on correct mapping results from BWA.


The data is raw sequencing reads and a couple of assemblies. The data is located in the folder:


The reference sequence is found within the "ref" folder.

Running CSI Phylogeny

It is a good idea to run CSI Phylogeny within a screen in an interactive session with iqsub, as it runs for quite a while (approx. 2 hours). It will wait for all jobs to finish and in the meantime it writes status messages to STDOUT.

# Load general modules for queue etc.
module load tools ngs pestat gcc/6.2.0 perl R torque moab anaconda/4.4.0
# Load SNP pipeline tools
module load bwa/0.7.10 bedtools/2.26.0 samtools/0.1.18 FastTree/2.1.8
# Load newick utils for visualisation
module load newick-utils/1.6
# Load perl environment
export PERL5LIB=/home/projects/pr_hackinars/data/perl_modules/lib
Ks_SNP_tree_pipeline.pl --qbot_dir <dir> --out_dir <dir> --reference <fasta> <arg1> ..<argN> 

You will encounter the following warning:

rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/scripts/*’: No such file or directory
rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/output/*’: No such file or directory
rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/error/*’: No such file or directory

Don't worry about it. CSI Phylogeny is just making sure that no temporary files from a previous run interferes with the current run.

CSI Phylogeny Options
-o <dir>				All output files will be stored here.

--temp_dir | -d <dir>			Location of tmp files. Default is tmp folder inside the qbot_dir.
--filter_depth | -fd <int>		default: 10.
--filter_relative_depth | -frd <int>	default: 10. Meaning 10 pct of avg.
--filter_min_depth | -fmd <int>		default: 10. Only applicable if using rel dp option.
--filter_prune | -fp <int>		default: 10. Meaning 10 SNPs.
--filter_map_quality | -fmq <int>	default: 0. Recommended 25.
--filter_snp_quality | -fsq <int>	default: 0. Recommended 30.
--filter_z | -fz <double>		default: 0. Recommended 1.96.

--as_prune | -ap <int>			default: 0. Recommended 10 SNPs. Pruning for assembled data.

--queue_filter_mem <str>		default: 3900m. Increase only if your filter job gets killed due to memory.
CSI Phylogeny Example
perl /home/projects/pr_hackinars/data/Ks_SNP_tree_pipeline.pl -qb qbot_dir -o CIS_out_dir /
-r /home/projects/pr_hackinars/data/outbreaks/Haiti_Cholerae/ref/Vc_O1_biovar_El_Tor_N16961.fna /
-fmq 25 -fsq 30 -fp 10 -ap 10 -fz 1.96 -frd 10 -fmd 10  /path/to/data/*.gz /path/to/more/data/*.fa 

The output files from CSI Phylogeny are in


You can display the newick file

nw_display snp_tree.main_tree.newick

April 5: Mapping with KMA, Philip


All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself have to change something there. Remeber what isolates and sequences you analyzed last time, because we are going to compare the results of the different appraoches.


Today you might want to analyze all of the Escherichia coli, as KMA is extremely fast. Compared to other methods that is "rapid", KMA is operating at near WARP speed.


Today there is no dependencies to load, we are however going to add KMA to path for convience.

export PATH=$PATH:/home/projects/pr_hackinars/apps/


In order for KMA to map against a reference database, we need to index the databases liek last time. Today we are going to map resistance genes and species, the databases has however been indexed for you in: "/home/projects/pr_hackinars/data/kma_databases"

If you want to index your own database, then this is how you do it:

kma_index -i DB.fsa -o DB.index


When mapping the resistance we would like full alignment of our sequences, while we can do with the identification of the proper reference for the species mapping. Therefore there are two main mappings with KMA, one full alignemnt and one "sparse" where the alignments are skipped.

Full alignemnt with KMA, on resistance genes:

kma -i X_1.fastq.gz X_2.fastq.gz -o out/X -t_db /home/projects/pr_hackinars/data/KMA_databases/ResFinder -ID 70

"-i" gives the input, "-o" the output, "-t_db" is the indexing and "-ID" specifies that we are only interested in templates with at 70% identity. KMA provides you with several output files, try to inspect them and see if you can figure them out.

Sparse mapping of species with KMA.

kma -i X_1.fastq.gz X_2.fastq.gz -o X -t_db /home/projects/pr_hackinars/data/KMA_databases/allNcbi -Sparse -shm -ID 50

"-shm" specifies that the database can be found in shared memory, if the KMA cannot find it, it will exit with non-zero exit-status. Shared memory allows for several parallel calls of KMA where the database is only loaded in one place. Which for our sake today means that we only need to load the index of > 10000 bacterial genomes once one beforehand (I did that for you), and we save time on loading the database afterwards. Shared databases can be set up with kma_shm, it is very (like mega very) important to note that a database in shared memory will keep being in memory until: computer breaks down, the computer is rebooted or till you take it down. Therefore you should always take it down afterwards in order to free the memory it occupies (I'll do that today), especially if you are running jobs through the queuing system as this is some of the black magic in programming, which has a tendency to go under the radar of most traditional computer surveillance systems.

Save the world

Two of your colleagues has been reported sick, and are infected with some pathogen. Luckily they were able to sequence the pathogens before they were hospitalized. It is now your job to save your colleages and analyze the sequencing they have performed. In this case we would like to know the species of the samples, and whether they carry any resistance genes which might lead us in direction for a proper treatment. The sequences are collected in: "/home/projects/pr_hackinars/data/unknownInfections"