Hands on course in bioinformatics
Every Thursday from 1-5 PM from September 7 – November 30, 2017 (except September 14 and October 19 = 11 times)
208 006
Old programe Hackinars_2017_fall
Course material:
Unix basics and usage of Computerome - pages 01..28
Extra material:
Unix exercise from Immunological Bioinformatics course
to try it change the path from /usr/opt/www/pub/CBS/courses/27685.imm/exercise_unix/immu00
to /home/projects/pr_hackinars/data/immu00
Course material:
Unix basics and usage of Computerome - pages 29..50
Show available modules
module avail
Find all samtools available
module avail samtools
To be able to see samtools (and many other tools) you first have to load tools
module load tools
Load module
module load name_of_tool_including_version
Find module (-i makes it case insensitive)
module apropos -i python
Display more details about module (for example dependencies)
module display velvet/1.2.10
Unload module
module unload anaconda/2.1.0
Modules may have dependencies i.e. velvet
module display velvet/1.2.10 ------------------------------------------------------------------- /services/tools/modulefiles/velvet/1.2.10: module-whatis Velvet 1.2.10 - sequence assembler for very short reads conflict velvet conflict perl/5.8.9 prereq perl prereq anaconda prepend-path PATH /services/tools/velvet/1.2.10/bin setenv VELVET_ROOT /services/tools/velvet/1.2.10 -------------------------------------------------------------------
So you may need to load more modules (in addition to tools) to be able to run velvet
module load perl/5.20.1 module load anaconda/2.1.0 module load velvet/1.2.10
Module information
module tab tab module whitespace tab tab
Remove all modules
module purge
Autoload modules at login
module initapp anaconda/2.1.0
Remove autoloading of module: module initrm
module initrm anaconda/2.1.0
See which modules are pre loaded
module initlist
Source bashrc to load modules from the login environment again
.bashrc
Load ngs modules (some are not in tools)
module load ngs
load own modules described in home directory (for developing)
module load use.own
Some public tools (like MGmpper) are maintained by tnp Admin make directory structure and make user the owner
List of what I have loaded
module list
Modules I (Ole) often load
module load tools module load moab torque module load perl/5.20.1 module load anaconda/2.2.0 module load velvet/1.2.10 module load ncbi-blast/2.2.31+ module load ngs module load sratoolkit/2.8.1
Load queue modules
module load tools module load moab torque
Queue submission commands (qsub recommended, but it is a matter of taste which to use)
qsub
or
msub
and the wrapped versions (which may be easier - but, because it is wrapped, also one step further away from the raw code)
xqsub
or
xmsub
Start interactive shell (1core, 1node, 4GB RAM, 86400 seconds (=1day))
qsub -l nodes=1:ppn=1,mem=4gb,walltime=86400 -q batch -I -X
Interactive way to get an interactive shell
iqsub
Submit job to queue with msub
msub -W group_list=pr_hackinars -Apr_hackinars -d my_working_directory -l nodes=1:ppn=1,mem=4gb,walltime=86400 -q batch my_command
or with qxsub (here using 1:00:00:00 to get one day format: d:hh:mm:ss)
xqsub -V -d my_working_directory -l nodes=1:ppn=1,mem=4gb,walltime=1:00:00:00 -q batch -re my_std_error_file -ro my_std_output_file -de my_command
Simple example of program which waits for 60 seconds and then prints the date (output go to STDIN.oJob_ID)
echo 'sleep 60;date'|msub -W group_list=pr_hackinars -Apr_hackinars -l nodes=1:ppn=1,mem=4gb,walltime=86400
See job status (for user myusername). Here you can also see the Job_ID (which is a unique number each job gets)
showq -umyusername
or
qstat -umyusername
Stop one job
canceljob Job_ID
Cancel all myusername's jobs in the queue (don't do this if you don't know what you are doing)
showq -umyusername| grep unning| gawk '{print $1}'|xargs canceljob
Estimate starting time for job
showstart Job_ID
Overview of system status
module load pestat pestat
Interact with queuing system graphically
firefox
If there is a security warning select:
Save the world, if not possible in one go, then proceed with one sequence at a time till you reach the goal.
All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself has to change something there.
Choose some samples to work with from: data/art_data/SE and data/art_data/PE, and save them in your own directory. Both directories contain artificial sequence data from the ResFinder database, both single and paired end. Remember which files you choose, because you are going to use them and some of the results obtain today for the exercises the next three weeks. For all tasks below you should try with both the single and paired end files.
Choose one isolate from: data/ECs/, and save them in your own directory. These are Escherichia coli isolates from the German / French outbreak in 2011. Again, remember which one you choose, as this is going to be used again for the exercises in the coming weeks.
Lets load the needed modules to do the exercises for today:
module load tools module load perl/5.24.0 module load anaconda2/4.0.0 module load ncbi-blast/2.2.26 module load SPAdes/3.9.0 module load cgepipeline/20170113
In order to assemble our reads with Velvet we are going to find the optimal k, with a program that tries a series of different k’s and then chooses the best one. Try to assemble the artificial reads with the following commands.
Paired end:
wrapper_denovo_2.0.py velvet --shortPaired fastq chopDBX_1.fastq.gz chopDBX_2.fastq.gz \ --sample outX_velvet_pe --n 6 --m 16 --add_velvetg "-very_clean yes"
Single end:
wrapper_denovo_2.0.py velvet --short fastq chopDBX.fastq.gz \ --sample outX_velvet_se --n 6 --m 16 --add_velvetg "-very_clean yes"
There are several options available for velvet and the wrapper that calls it, try to check if something may have been done differently.
wrapper_denovo_2.0.py velvet -h
In order to check the quality of the assembly of the artificial reads, you may want to see if the assembly can be found in the ResFinder database. We do that with BLAST.
blastall -p blastn -i contig.fsa -d data/ResFinderFSA/ResFinder.fasta > blast.out
The best hits for each contig is under the line:
"Sequences producing significant alignments:"
Remember the output of the best template from BLAST, as we are going to use this the next weeks.
When you understand the program and the output then qeue your E. coli reads, it might take some time to complete so we will come back to that.
Like Velvet, SPAdes also has a wrapper program that executes the assembly program itself. Like for Velvet several different values of k are used, but in difference to Velvet, several different sizes k are used to construct the final assembly. Start by making an assembly of your artificial sequences like before:
Paired end:
spades.py -t 6 -m 16 -o outX_spades_pe --careful \ --pe1-1 chopDBX_1.fastq.gz --pe1-2 chopDBX_2.fastq.gz --only-assembler
Single end:
spades.py -t 6 -m 16 -o outX_spades_se --careful \ -s chopDBX.fastq.gz --only-assembler
Note that SPAdes has a tendency not to detect the phred-scale, espicially if the reads have been trimed and are Phred scale 33. All the artificial reads are phred-scale 33, therefore you can add the option “--phred-offset 33”, so that SPAdes does not crash here. Make a quality check with BLAST on the artificial reads, like we did with Velvet. Remember to save the best matching templates here as well.
Try to study the output and different options of SPAdes (spdes.py -h), and check if something may have been done smarter. When you understand the different options and the output of SPAdes, you may qeue your E. coli reads for assembly with and without the "--careful" option.
You are almost done and deserve a break before you continue, that way the assemblies might be done when you come back.
First of all, check that your assemblies are done and ended successfully.
When assembling bacterial genomes, one of the best quality checks you can do is to do MLST. For standard MLST typing 7 genes are used, and if all of these genes can be found in your assembly, when it is probably not half-bad. To determine the MLST we are going to use MLSTfinder, which is based on BLAST. Note that using BLAST is drastically improved because of the assembly for most cases.
mlst.pl -d /home/databases/cgepipeline/mlst/ -i SRR3415X.fasta -o MLST -s ecoli
Try to see if it is possible to call MLST for both assemblies, if not which assembly method was best.
Finally, we are going to identify antibiotic resistance genes in the E. coli samples. Whole genomes may contain several resistance genes, or even several genes on the same contig, which are going to complicate the BLAST output. Therefore, we are going to use ResFinder for this. Where ResFinder has some post-processing of the BLAST output, so that only the best matches are reported for each region of each contig.
resfinder.pl -i SRR3415X.fasta -d data/ResFinderFSA/ -a ResFinder -o Xres_out -k 95.00 -l 0.60
Again, remember where you save the output of ResFinder, this too is going to be used in the coming weeks.
All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths.
If an “X” (capital x) occurs in the commands below, it means that you yourself have to change something there.
Remeber what isolates and sequences you analyzed last time, because we are going to compare the results of different appraoches.
Last was assembly + BLAST methods, today we go straight for mapping.
You may want to analyze some extra samples today, as mapping is faster than assembly.
Depending on your choice of artificial sequences last time, you might want to choose some extra ones, i can recommend sample "1450".
The artificial data is in: data/art_seqs/SE and data/art_seqs/PE.
And the Escherichia coli in: data/ECs.
Lets load the needed modules to do the exercises for today:
module load ngs tools perl/5.20.2 module load bwa/0.7.15 module load bowtie2/2.2.4 module load samtools/1.5 module load bedtools/2.26.0 export PATH=$PATH:/home/projects/pr_hackinars/apps/
For modern mapping methods, such as Bowtie2 and BWA, we need to index the fasta files of our references first. Even though Bowtie2 and BWA are both using the FM-index, they use slightly different variations of them. Which case means that we are going to index the databases twice, one for use with Bowtie2 and one for BWA.
Bowtie2 indexing:
bowtie2-build X.fsa X.index
BWA indexing:
bwa index X.fsa -p X.index
Today we are going to map against resistance genes, therefore we need to index a fasta file containing the wanted genes.
For this purpose we are going to use the ResFinder database, as used last time with BLAST.
The database fasta file are located in:
"data/ResFinderFSA/ResFinder.fsa"
Make an indexing of this fasta file, for both Bowtie2 and BWA, and give the obtained indexing some appropiate name.
Mostly when mapping with Bowtie2 or BWA, you will want to use the standard options:
Bowtie2 single end:
bowtie2 -U X.fastq.gz -x X.index
Bowtie2 paired end:
bowtie2 -1 X_1.fastq.gz -2 X_2.fastq.gz -x X.index
BWA single end:
bwa mem X.index X.fastq.gz
BWA paired end:
bwa mem X.index X_1.fastq.gz X_2.fastq.gz
BWA has a feature that causes it to crash on samples downloaded from SRA, like our E. coli samples. In order to circumvent this you can use the following cmd, which converts the paired end reads into interleaved reads.
fq2int.sh X_1.fastq.gz X_2.fastq.gz | bwa mem -p X.index -
Both of them outputs sam-format to stdout (default), which is a fairly large format. Therefore it is usually good practice to compress them to bam-format (binary sam), this is done easily with samtools through a pipe.
| samtools view -Sb - > X.bam
If you then want to inspect the bam file afterwards you can use this command.
samtools view X.bam | less
Try to map your reads against the ResFinder database, using the indexing you made before.
The E. coli reads might take a few minutes to complete.
Take a quick look at the mapped reads, is the results as expected, try to compare to the results obtained from last time.
If you are not familiar with the sam-format, then these links might be helpful:
SAM-format
sam flags
The sam-format, as you probably has seen, is not very human-readable.
It is however a nice format for a computer to parse, as seen above samtools is desinged for handling these files.
We have already used samtools to compress sam files, now we are going to analyze the reads a bit further.
The most important options to remeber here, is the: "-f", "-F" and "-q".
"-f" and "-F" is used to select alignment based on flags specified on them. For example "4" is unmapped and "8" is mate unmapped, meaning that "4+8=12" means that both pairs must not map.
By using the "-F" insted of "-f" you get the negation, meaning that "12" is both pairs map.
The "-q" option sets the threshold for mapping quality, here a low quality means a poor alignment quality and a high one a better alignment quality.
Depending on who you ask, there are many "standards". Usually a threshold below 10 is not used.
A quality of 10 usually means that there are about 10% chance that the read is wrongly mapped, and a quality of 0 means that there are close to 0% chance that the read is mapped correctly.
Even though the sam-format is standard, the columns in it is not. Meaning that the mapping quality is calculated differently with Bowti2 and BWA, which then goes on for different mapping methods.
Now for some selection of results obtained before.
Only accept mapped reads:
samtools view -b -F 4 -q 1 X.bam > X.mapped.bam
Only properly paired reads:
samtools view -b -F 12 -f 2 -q 1 X.bam > X.mapped.bam
Try some different combinations of flags, and try to vary the quality threshold.
An easy check to see how many reads map to what in your sample can be obtained by following command, make sure you undestand it before executing.
samtools view X.bam | cut -f 3 | sort | uniq -c > X.template_count.tab
You can add the flags directly to the samtools command above, and pipe the output to less in order to quickly test different settings of samtools.
If more attributes is needed such as coverage and depth, the following cmd can be used:
samtools view X.bam | samtools sort | bedtools genomecov -ibam stdin | bedToTemplate.pl > X.template_stats.tab
Try to see if you can get the same results on the artificial dataset as you got last time.
Do you see something strange in the results, obtained by either mapping or assembly + BLAST.
Which do you trust more for this example, remember the assumptions of each method.
Try to compare the results with the E. coli from last time too. If the assembly did not finish last time, then the SPAdes assemblies is available here: "/home/projects/pr_hackinars/data/assembly/SPAdes/".
A very important use of mapping is within the area of bacterial outbreak detection and monitoring. We will conduct and exercise, where we employ the tool CSI Phylogeny that relies heavily on correct mapping results from BWA.
The data is raw sequencing reads and a couple of assemblies. The data is located in the folder:
/home/projects/pr_hackinars/data/outbreaks/Haiti_Cholerae/
The reference sequence is found within the "ref" folder.
It is a good idea to run CSI Phylogeny within a screen in an interactive session with iqsub, as it runs for quite a while (approx. 2 hours). It will wait for all jobs to finish and in the meantime it writes status messages to STDOUT.
# Load general modules for queue etc. module load tools ngs pestat gcc/6.2.0 perl R torque moab anaconda/4.4.0 # Load SNP pipeline tools module load bwa/0.7.10 bedtools/2.26.0 samtools/0.1.18 FastTree/2.1.8 # Load newick utils for visualisation module load newick-utils/1.6 # Load perl environment export PERL5LIB=/home/projects/pr_hackinars/data/perl_modules/lib
Ks_SNP_tree_pipeline.pl --qbot_dir <dir> --out_dir <dir> --reference <fasta> <arg1> ..<argN>
You will encounter the following warning:
rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/scripts/*’: No such file or directory rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/output/*’: No such file or directory rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/error/*’: No such file or directory
Don't worry about it. CSI Phylogeny is just making sure that no temporary files from a previous run interferes with the current run.
-o <dir> All output files will be stored here. --temp_dir | -d <dir> Location of tmp files. Default is tmp folder inside the qbot_dir. --filter_depth | -fd <int> default: 10. --filter_relative_depth | -frd <int> default: 10. Meaning 10 pct of avg. --filter_min_depth | -fmd <int> default: 10. Only applicable if using rel dp option. --filter_prune | -fp <int> default: 10. Meaning 10 SNPs. --filter_map_quality | -fmq <int> default: 0. Recommended 25. --filter_snp_quality | -fsq <int> default: 0. Recommended 30. --filter_z | -fz <double> default: 0. Recommended 1.96. --as_prune | -ap <int> default: 0. Recommended 10 SNPs. Pruning for assembled data. --queue_filter_mem <str> default: 3900m. Increase only if your filter job gets killed due to memory.
perl /home/projects/pr_hackinars/data/Ks_SNP_tree_pipeline.pl -qb qbot_dir -o CIS_out_dir / -r /home/projects/pr_hackinars/data/outbreaks/Haiti_Cholerae/ref/Vc_O1_biovar_El_Tor_N16961.fna / -fmq 25 -fsq 30 -fp 10 -ap 10 -fz 1.96 -frd 10 -fmd 10 /path/to/data/*.gz /path/to/more/data/*.fa
The output files from CSI Phylogeny are in
/home/projects/pr_hackinars/data/CSI_phylo_out/CIS_out_dir
You can display the newick file
nw_display snp_tree.main_tree.newick
All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths.
If an “X” (capital x) occurs in the commands below, it means that you yourself have to change something there.
Remeber what isolates and sequences you analyzed last time, because we are going to compare the results of the different appraoches.
Today you might want to analyze all of the Escherichia coli, as KMA is extremely fast. Compared to other methods that is "rapid", KMA is operating at near WARP speed.
Today there is no dependencies to load, we are however going to add KMA to path for convience.
export PATH=$PATH:/home/projects/pr_hackinars/apps/
In order for KMA to map against a reference database, we need to index the databases liek last time. Today we are going to map resistance genes and species, the databases has however been indexed for you in: "/home/projects/pr_hackinars/data/kma_databases"
If you want to index your own database, then this is how you do it:
kma_index -i DB.fsa -o DB.index
When mapping the resistance we would like full alignment of our sequences, while we can do with the identification of the proper reference for the species mapping. Therefore there are two main mappings with KMA, one full alignemnt and one "sparse" where the alignments are skipped.
Full alignemnt with KMA, on resistance genes:
kma -i X_1.fastq.gz X_2.fastq.gz -o out/X -t_db /home/projects/pr_hackinars/data/KMA_databases/ResFinder -ID 70
"-i" gives the input, "-o" the output, "-t_db" is the indexing and "-ID" specifies that we are only interested in templates with at 70% identity. KMA provides you with several output files, try to inspect them and see if you can figure them out.
Sparse mapping of species with KMA.
kma -i X_1.fastq.gz X_2.fastq.gz -o X -t_db /home/projects/pr_hackinars/data/KMA_databases/allNcbi -Sparse -shm -ID 50
"-shm" specifies that the database can be found in shared memory, if the KMA cannot find it, it will exit with non-zero exit-status. Shared memory allows for several parallel calls of KMA where the database is only loaded in one place. Which for our sake today means that we only need to load the index of > 10000 bacterial genomes once one beforehand (I did that for you), and we save time on loading the database afterwards. Shared databases can be set up with kma_shm, it is very (like mega very) important to note that a database in shared memory will keep being in memory until: computer breaks down, the computer is rebooted or till you take it down. Therefore you should always take it down afterwards in order to free the memory it occupies (I'll do that today), especially if you are running jobs through the queuing system as this is some of the black magic in programming, which has a tendency to go under the radar of most traditional computer surveillance systems.
Two of your colleagues has been reported sick, and are infected with some pathogen.
Luckily they were able to sequence the pathogens before they were hospitalized.
It is now your job to save your colleages and analyze the sequencing they have performed.
In this case we would like to know the species of the samples, and whether they carry any resistance genes which might lead us in direction for a proper treatment.
The sequences are collected in: "/home/projects/pr_hackinars/data/unknownInfections"