Hackinars 2017 fall

Contents

Hackinars in Bioinformatics short description

What

Hands on course in bioinformatics

When

Every Thursday from 1-5 PM from September 7 – November 30, 2017 (except September 14 and October 19 = 11 times)

Where

208 006

Short Schedule

Detailed Programe

September 7: Computerome and UNIX, Erland

Course material:

Unix basics and usage of Computerome - pages 01..28

Extra material:

Unix exercise from Immunological Bioinformatics course

to try it change the path from /usr/opt/www/pub/CBS/courses/27685.imm/exercise_unix/immu00

to /home/projects/pr_hackinars/data/immu00

September 14: No teaching

September 21: Computerome and Queuing system, Erland

Course material:

Unix basics and usage of Computerome - pages 29..50

Ole's Notes

Module system

Show available modules

module avail

Find all samtools available

module avail samtools

To be able to see samtools (and many other tools) you first have to load tools

module load tools

Load module

module load name_of_tool_including_version

Find module (-i makes it case insensitive)

module apropos -i python

Display more details about module (for example dependencies)

module display velvet/1.2.10

Unload module

module unload anaconda/2.1.0

Modules may have dependencies i.e. velvet

module display velvet/1.2.10
-------------------------------------------------------------------
/services/tools/modulefiles/velvet/1.2.10:

module-whatis	Velvet 1.2.10 - sequence assembler for very short reads 
conflict	velvet 
conflict	perl/5.8.9 
prereq	perl 
prereq	anaconda 
prepend-path	PATH /services/tools/velvet/1.2.10/bin 
setenv	VELVET_ROOT /services/tools/velvet/1.2.10 
-------------------------------------------------------------------

So you may need to load more modules (in addition to tools) to be able to run velvet

module load perl/5.20.1
module load anaconda/2.1.0
module load velvet/1.2.10

Module information

module tab tab
module whitespace tab tab

Remove all modules

module purge

Autoload modules at login

module initapp anaconda/2.1.0

Remove autoloading of module: module initrm

module initrm anaconda/2.1.0

See which modules are pre loaded

module initlist

Source bashrc to load modules from the login environment again

.bashrc

Load ngs modules (some are not in tools)

module load ngs

load own modules described in home directory (for developing)

module load use.own

Some public tools (like MGmpper) are maintained by tnp Admin make directory structure and make user the owner

List of what I have loaded

module list

Modules I (Ole) often load

module load tools
module load moab torque
module load perl/5.20.1
module load anaconda/2.2.0
module load velvet/1.2.10
module load ncbi-blast/2.2.31+
module load ngs
module load sratoolkit/2.8.1
Queuing system

Load queue modules

module load tools
module load moab torque

Queue submission commands (qsub recommended, but it is a matter of taste which to use)

qsub

or

msub

and the wrapped versions (which may be easier - but, because it is wrapped, also one step further away from the raw code)

xqsub

or

xmsub

Start interactive shell (1core, 1node, 4GB RAM, 86400 seconds (=1day))

qsub -l nodes=1:ppn=1,mem=4gb,walltime=86400 -q batch -I -X 

Interactive way to get an interactive shell

iqsub


Submit job to queue with msub

msub -W group_list=pr_hackinars -Apr_hackinars -d my_working_directory -l nodes=1:ppn=1,mem=4gb,walltime=86400 -q batch  my_command

or with qxsub (here using 1:00:00:00 to get one day format: d:hh:mm:ss)

xqsub -V -d my_working_directory -l nodes=1:ppn=1,mem=4gb,walltime=1:00:00:00 -q batch -re my_std_error_file -ro my_std_output_file -de my_command

Simple example of program which waits for 60 seconds and then prints the date (output go to STDIN.oJob_ID)

echo 'sleep 60;date'|msub -W group_list=pr_hackinars -Apr_hackinars  -l nodes=1:ppn=1,mem=4gb,walltime=86400

See job status (for user myusername). Here you can also see the Job_ID (which is a unique number each job gets)

showq -umyusername

or

qstat -umyusername

Stop one job

canceljob Job_ID

Cancel all myusername's jobs in the queue (don't do this if you don't know what you are doing)

showq -umyusername| grep unning| gawk '{print $1}'|xargs canceljob

Estimate starting time for job

showstart Job_ID

Overview of system status

module load pestat
pestat

Interact with queuing system graphically

firefox

If there is a security warning select:

September 28: Assembly with Velvet and Spades, Philip

Assembly slides

Purpose

Save the world, if not possible in one go, then proceed with one sequence at a time till you reach the goal.

Practical

All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself has to change something there.

Data

Choose some samples to work with from: data/art_seqs/SE and data/art_seqs/PE, and save them in your own directory. Both directories contain artificial sequence data from the ResFinder database, both single and paired end. Remember which files you choose, because you are going to use them and some of the results obtain today for the exercises the next three weeks. For all tasks below you should try with both the single and paired end files.

Choose one isolate from: data/ECs/, and save them in your own directory. These are Escherichia coli isolates from the German / French outbreak in 2011. Again, remember which one you choose, as this is going to be used again for the exercises in the coming weeks.

Dependencies

Lets load the needed modules to do the exercises for today:


module load tools
module load perl/5.24.0
module load anaconda2/4.0.0
module load ncbi-blast/2.2.26
module load SPAdes/3.9.0
module load cgepipeline/20170113

Velvet assembly

In order to assemble our reads with Velvet we are going to find the optimal k, with a program that tries a series of different k’s and then chooses the best one. Try to assemble the artificial reads with the following commands.

Paired end:

wrapper_denovo_2.0.py velvet --shortPaired fastq chopDBX_1.fastq.gz chopDBX_2.fastq.gz \
--sample outX_velvet_pe --n 6 --m 16 --add_velvetg "-very_clean yes"

Single end:

wrapper_denovo_2.0.py velvet --short fastq chopDBX.fastq.gz \
--sample outX_velvet_se --n 6 --m 16 --add_velvetg "-very_clean yes"

There are several options available for velvet and the wrapper that calls it, try to check if something may have been done differently.

wrapper_denovo_2.0.py velvet -h

In order to check the quality of the assembly of the artificial reads, you may want to see if the assembly can be found in the ResFinder database. We do that with BLAST.

blastall -p blastn -i contig.fsa -d data/ResFinderFSA/ResFinder.fsa > blast.out

The best hits for each contig is under the line:

"Sequences producing significant alignments:"

Remember the output of the best template from BLAST, as we are going to use this the next weeks.

When you understand the program and the output then qeue your E. coli reads, it might take some time to complete so we will come back to that.

SPAdes assembly

Like Velvet, SPAdes also has a wrapper program that executes the assembly program itself. Like for Velvet several different values of k are used, but in difference to Velvet, several different sizes k are used to construct the final assembly. Start by making an assembly of your artificial sequences like before:

Paired end:

spades.py -t 6 -m 16 -o outX_spades_pe --careful \
--pe1-1 chopDBX_1.fastq.gz --pe1-2 chopDBX_2.fastq.gz

Single end:

spades.py -t 6 -m 16 -o outX_spades_se --careful \
-s chopDBX.fastq.gz

Note that SPAdes has a tendency not to detect the phred-scale, espicially if the reads have been trimed and are Phred scale 33. All the artificial reads are phred-scale 33, therefore you can add the option “--phred-offset 33”, so that SPAdes does not crash here. Make a quality check with BLAST on the artificial reads, like we did with Velvet. Remember to save the best matching templates here as well.

Try to study the output and different options of SPAdes (spdes.py -h), and check if something may have been done smarter. When you understand the different options and the output of SPAdes, you may qeue your E. coli reads for assembly (without the "--careful" option).

Quality check

You are almost done and deserve a break before you continue, that way the assemblies might be done when you come back.

First of all, check that your assemblies are done and ended successfully.

When assembling bacterial genomes, one of the best quality checks you can do is to do MLST. For standard MLST typing 7 genes are used, and if all of these genes can be found in your assembly, when it is probably not half-bad. To determine the MLST we are going to use MLSTfinder, which is based on BLAST. Note that using BLAST is drastically improved because of the assembly for most cases.

mlst.pl -d /home/databases/cgepipeline/mlst/ -i SRR3415X.fasta -o MLST -s ecoli

Try to see if it is possible to call MLST for both assemblies, if not which assembly method was best.

Finally, we are going to identify antibiotic resistance genes in the E. coli samples. Whole genomes may contain several resistance genes, or even several genes on the same contig, which are going to complicate the BLAST output. Therefore, we are going to use ResFinder for this. Where ResFinder has some post-processing of the BLAST output, so that only the best matches are reported for each region of each contig.

resfinder.pl -i SRR3415X.fasta -d data/ResFinderFSA/ -a ResFinder -o Xres_out -k 95.00 -l 0.60

Again, remember where you save the output of ResFinder, this too is going to be used in the coming weeks.

October 5: Mapping with bwa and bowtie2, Philip+Rolf

Mapping slides

Practical

All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself have to change something there. Remeber what isolates and sequences you analyzed last time, because we are going to compare the results of different appraoches. Last was assembly + BLAST methods, today we go straight for mapping.

Data

You may want to analyze some extra samples today, as mapping is faster than assembly. Depending on your choice of artificial sequences last time, you might want to choose some extra ones, i can recommend sample "1450". The artificial data is in: data/art_seqs/SE and data/art_seqs/PE.
And the Escherichia coli in: data/ECs.

Dependencies

Lets load the needed modules to do the exercises for today:


module load ngs tools perl/5.20.2
module load bwa/0.7.15
module load bowtie2/2.3.2
module load samtools

Indexing

For modern mapping methods, such as Bowtie2 and BWA, we need to index the fasta files of our references first. Even though Bowtie2 and BWA are both using the FM-index, they use slightly different variations of them. Which case means that we are going to index the databases twice, one for use with Bowtie2 and one for BWA.

Bowtie2 indexing:


bowtie2-build X.fsa X.index

BWA indexing:


bwa index X.fsa -p X.index

Today we are going to map against resistance genes, therefore we need to index a fasta file containing the wanted genes. For this purpose we are going to use the ResFinder database, as used last time with BLAST. The database fasta file are located in:
"data/ResFinderFSA/ResFinder.fsa"

Make an indexing of this fasta file, for both Bowtie2 and BWA, and give the obtained indexing some appropiate name.

Mapping

Mostly when mapping with Bowtie2 or BWA, you will want to use the standard options:

Bowtie2 single end:


bowtie2 -U X.fastq.gz -x X.index

Bowtie2 paired end:


bowtie2 -1 X_1.fastq.gz -2 X_2.fastq.gz -x X.index

BWA single end:


bwa mem X.index X.fastq.gz

BWA paired end, (might need the "-p" option):


bwa mem X.index X_1.fastq.gz X_2.fastq.gz


Both of them outputs sam-format to stdout (default), which is a fairly large format. Therefore it is usually good practice to compress them to bam-format (binary sam), this is done easily with samtools through a pipe.


 | samtools view -Sb - > X.bam

If you then want to inspect the bam file afterwards you can use this command.


samtools view X.bam | less

Try to map your reads against the ResFinder database, using the indexing you made before.

The E. coli reads might take a few minutes to complete.

Take a quick look at the mapped reads, is the results as expected, try to compare to the results obtained from last time.
If you are not familiar with the sam-format, then these links might be helpful:
SAM-format
sam flags

Samtools

The sam-format, as you probably has seen, is not very human-readable. It is however a nice format for a computer to parse, as seen above samtools is desinged for handling these files. We have already used samtools to compress sam files, now we are going to analyze the reads a bit further.
The most important options to remeber here, is the: "-f", "-F" and "-q".
"-f" and "-F" is used to select alignment based on flags specified on them. For example "4" is unmapped and "8" is mate unmapped, meaning that "4+8=12" means that both pairs must not map. By using the "-F" insted of "-f" you get the negation, meaning that "12" is both pairs map. The "-q" option sets the threshold for mapping quality, here a low quality means a poor alignment quality and a high one a better alignment quality. Depending on who you ask, there are many "standards". Usually a threshold below 10 is not used.
A quality of 10 usually means that there are about 10% chance that the read is wrongly mapped, and a quality of 0 means that there are close to 0% chance that the read is mapped correctly.
Even though the sam-format is standard, the columns in it is not. Meaning that the mapping quality is calculated differently with Bowti2 and BWA, which then goes on for different mapping methods.

Now for some selection of results obtained before.

Only accept mapped reads:


samtools view -b -F 4 -q 1 X.bam > X.mapped.bam

Only properly paired reads:


samtools view -b -F 12 -f 2 -q 1 X.bam > X.mapped.bam

Try some different combinations of flags, and try to vary the quality threshold.

An easy check to see how many reads map to what in your sample can be obtained by following command, make shure you undestand it before executing.


samtools view X.bam | cut -f 3 | sort | uniq -c > X.template_count.tab

You can add the flags directly to the samtools command above, and pipe the output to less in order to quickly test different settings of samtools.

Try to see if you can get the same results on the artificial dataset as you got last time.
Do you see something strange in the results, obtained by either mapping or assembly + BLAST.
Which do you trust more for this example, remember the assumptions of each method.
Try to compare the results with the E. coli from last time too. If the assembly did not finish last time, then the SPAdes assemblies is available here: "/home/projects/pr_hackinars/data/assembly/SPAdes/".

Call SNPs and Infer Phylogeny

A very important use of mapping is within the area of bacterial outbreak detection and monitoring. We will conduct and exercise, where we employ the tool CSI Phylogeny that relies heavily on correct mapping results from BWA.

Prerequisites
# Load general modules for queue etc.
module load tools ngs pestat gcc/6.2.0 perl R torque moab anaconda/4.4.0
# Load SNP pipeline tools
module load bwa/0.7.10 bedtools/2.26.0 samtools/0.1.18 FastTree/2.1.8
# Load perl environment
export PERL5LIB=/home/projects/pr_hackinars/data/perl_modules/lib
Data

The data is raw sequencing reads and a couple of assemblies. The data is located in the folder:

/home/projects/pr_hackinars/data/outbreaks/Haiti_Cholerae/

The reference sequence is found within the "ref" folder.

Running CSI Phylogeny

It is a good idea to run CSI Phylogeny within a screen session, as it runs for quite a while and it will wait for all jobs to finish and in the meantime it writes status messages to STDOUT.

Ks_SNP_tree_pipeline.pl --qbot_dir <dir> --out_dir <dir> --reference <fasta> <arg1> ..<argN> 

You will encounter the following warning:

rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/scripts/*’: No such file or directory
rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/output/*’: No such file or directory
rm: cannot remove ‘/home/projects/pr_hackinars/people/user/test_execise/qbot_dir/error/*’: No such file or directory

Don't worry about it. CSI Phylogeny is just making sure that no temporary files from a previous run interferes with the current run.

CSI Phylogeny Options
-o <dir>				All output files will be stored here.

--temp_dir | -d <dir>			Location of tmp files. Default is tmp folder inside the qbot_dir.
--filter_depth | -fd <int>		default: 10.
--filter_relative_depth | -frd <int>	default: 10. Meaning 10 pct of avg.
--filter_min_depth | -fmd <int>		default: 10. Only applicable if using rel dp option.
--filter_prune | -fp <int>		default: 10. Meaning 10 SNPs.
--filter_map_quality | -fmq <int>	default: 0. Recommended 25.
--filter_snp_quality | -fsq <int>	default: 0. Recommended 30.
--filter_z | -fz <double>		default: 0. Recommended 1.96.

--as_prune | -ap <int>			default: 0. Recommended 10 SNPs. Pruning for assembled data.

--queue_filter_mem <str>		default: 3900m. Increase only if your filter job gets killed due to memory.
CSI Phylogeny Example
perl /home/projects/pr_hackinars/data/Ks_SNP_tree_pipeline.pl -qb qbot_dir -o CIS_out_dir -r NC_000913.3.fsa -fmq 25 -fsq 30 /
-fp 10 -ap 10 -fz 1.96 -frd 10 -fmd 10 /path/to/data/*.gz /path/to/more/data/*.fa 

October 12: Mapping with KMA, Philip

Practical

All paths mentioned in the following are relative to the “/home/projects/pr_hackinars/” directory or absolute paths. If an “X” (capital x) occurs in the commands below, it means that you yourself have to change something there. Remeber what isolates and sequences you analyzed last time, because we are going to compare the results of the different appraoches.


Data

Today you might want to analyze all of the Escherichia coli, as KMA is extremely fast. Compared to other methods that is "rapid", KMA is operating at near WARP speed.

Dependencies

Today there is no dependencies to load, we are however going to add KMA to path for convience.


export PATH=$PATH:/home/projects/pr_hackinars/apps/


Indexing

In order for KMA to map against a reference database, we need to index the databases liek last time. Today we are going to map resistance genes and species, the databases has however been indexed for you in: "/home/projects/pr_hackinars/data/kma_databases"

If you want to index your own database, then this is how you do it:


kma_index -i DB.fsa -o DB.index

Mapping

When mapping the resistance we would like full alignment of our sequences, while we can do with the identification of the proper reference for the species mapping. Therefore there are two main mappings with KMA, one full alignemnt and one "sparse" where the alignments are skipped.

Full alignemnt with KMA, on resistance genes:


kma -i X_1.fastq.gz X_2.fastq.gz -o X -t_db /home/projects/pr_hackinars/data/kma_databases/ResFinder -SW -ID 80

"-i" gives the input, "-o" the output, "-t_db" is the indexing and "-SW" specifies that we want to use the Smith Waterman algorithm between the seeds.

Sparse mapping of species with KMA.


kma -i X_1.fastq.gz X_2.fastq.gz -o X -t_db /home/projects/pr_hackinars/data/kma_databases/bacterial_compl_genomes_hq99_k13_ATG -Sparse -shm -ID 50

Where "-shm" specifies that the database has been set up for sharing, which allows for several parallel calls of KMA where the database is only loaded in one place.


Save the world

Two of your colleagues has been reported sick, and are infected with some pathogen. Luckily they were able to sequence the pathogens before they were hospitalized. It is now your job to save your colleages and analyze the sequencing they have performed. In this case we would like to know the species of the samples, and whether they carry any resistance genes which might lead us in direction for a proper treatment. The sequences are collected in: "/home/projects/pr_hackinars/data/unknownInfections"

October 19: No teaching (Autumn “potato picking” break)

October 26: Working with GitHub/Bitbucket, Lukasz

The tutorial is on github:

Files making the basis of the tutorial: tutorial

The live version which changes during the tutorial: tutorial-livel

November 2: Working with Docker, Lukasz

November 9: Work flow managers, Lukasz

November 16: Writing a thesis, Ole

How to write a paper slides

November 23: Phylogeny, Judit

Phylogenetics slides

Practical

The commands given below assume that you are in your own directory. Most of the tools produce multiple output files, therefore it is practical to create the sub-folders in your directory, when running them.

Data

We will investigate an outbreak today. It's 2015 and patients in a US hospital acquire listeriosis... apparently from drinking milkshakes!? We have 10 samples we want to test, to figure out what is the source of the outbreak. They were taken at different times and sequenced on an Illumina platform. Source

You can find all the data in

/home/projects/pr_hackinars/data/outbreaks/ice_cream_outbreak

Dependencies

The following modules are needed for today's exercise:

module load tools
module load perl/5.24.0
module load anaconda2/4.0.0
module load mafft/7.310
module load phyml/3.1
module load newick-utils/1.6

Start an interactive session:

qsub -W group_list=pr_hackinars -A pr_hackinars -l nodes=1:ppn=28,walltime=4:00:00,mem=120Gb -I

Quick look

The samples have been assembled using Spades, and 16S rRNA sequences were extracted from them with RNAmmer 1.2. This sequence is present in all bacterial species, thus can be used to infer phylogeny for even distantly related isolates. Let's concatenate the fasta files together and align them using MAFFT.

mkdir ssu_tree
cd ssu_tree
cat /home/projects/pr_hackinars/data/outbreaks/ice_cream_outbreak/ssu/SRR* > outbreak_ssu.fasta
mafft --maxiterate 1000 --localpair outbreak_ssu.fasta > outbreak_ssu_aligned.fna

The --localpair option specifies that mafft produces a local (and not global) alignment.

Next, we have to convert the fasta format into interlaced phylip, as that's the input format of PhyML, the maximum likelihood based program we want to use for analyzing our (few and short) sequences.

/home/projects/pr_hackinars/src/fasta2phy.py outbreak_ssu_aligned.fna outbreak_ssu_aligned.phy

Check out the PhyML options:

PhyML -h

We specify that we want the Jukes-Cantor model and estimated base frequencies (and not counted), with 100 rounds of bootstrapping:

PhyML -i outbreak_ssu_aligned.phy -d nt -m JC69 -s BEST -f m -b 100

The output quickly reveals, how many different positions are in the alignment. Is it enough to make informed decisions about the outbreak? Let's look at the consensus tree and the clade support count values (the numbers at the base of the clades) with newick-utils.

nw_display outbreak_ssu_aligned.phy_phyml_tree.txt

The difference between the isolates is not great enough to rely just on one short sequence. A whole-genome approach is required.

SNP based phylogeny

Someone has already predicted subtypes for the isolates. The results can be seen in

/home/projects/pr_hackinars/data/outbreaks/ice_cream_outbreak/hackinars_samples.tmpl

As we already suspected, all of the samples were predicted to have the same closely matching reference genome. Thus we can use NDtree, which is a SNP based phylogeny pipeline, that evaluates each base-calling to have a minimum confidence (otherwise it's N). It calculates the genetic distance based on these "true" SNPs, then clusters the sequences with UPGMA and NJ.

cd ..
mkdir ndtree_r
cd ndtree_r
/home/projects/pr_hackinars/src/NDtree/NDtree_p.py \
-f /home/projects/pr_hackinars/data/outbreaks/ice_cream_outbreak/ice-cream.list \
-t /home/projects/pr_hackinars/data/outbreaks/ice_cream_outbreak/pt_Listeria_monocytogenes_J0161_NC_017545_1.fa \
-n 2 -z 1.96 -k 17 -a

This is the parallel version of NDtree. -f is the file containing the list of raw reads, -n is the number of reads for one isolate (paired 2, single 1), -t is the template, -k is the k-mer size, -z is the Z-score threshold for base-calling (corresponding to a p-value of 0.05), and -a means that only those positions are used in the distance calculation, where none of the sequences have unknown bases. It runs for approx. 15 minutes.

nw_display tree.upgma.newick
nw_display tree.nj.newick

Do the two trees agree with regards to the relatedness of the sequences? (Note: the floating 1-s are just part of the tip labels)

Maximum likelihood

We can continue with a maximum likelihood based analysis, as we can easily construct a multiple alignment from the consensus sequences. Then we have two possible approaches. First, let's extract those columns from the alignment that have polymorphisms and feed that mini alignment to PhyML.

cat *.assimpler.con > ice_cream_wgs_align.fsa
mkdir phyml_r
/home/projects/pr_hackinars/src/NDtree/stripali_pos.py \
-i $PWD/ice_cream_wgs_align.fsa \
-o $PWD/phyml_r/ice_cream_snp.fsa

cd phyml_r
cat ice_cream_snp.fsa
/home/projects/pr_hackinars/src/fasta2phy.py ice_cream_snp.fsa ice_cream_snp.phy
PhyML -i ice_cream_snp.phy -d nt -m JC69 -s BEST -f m -b 100

How many polymorphic sites are in these whole-genome sequences? Very few, but still more than just in the 16S rRNA. Change the sequence identifiers back to the original ones before taking a look at the tree:

/home/projects/pr_hackinars/src/NDtree/chnam.py \
-i ice_cream_snp.phy_phyml_tree.txt -s ../names \
-o ice_cream_snp.phy_phyml_tree_names.nw -t2

nw_display ice_cream_snp.phy_phyml_tree_names.nw

Did the tree topology change? Are the clades more resolved compared to the distance based trees? The clade support (count) values are low, why?

Next, we use IQ-tree for inferring a phylogeny from the whole-genome alignments. IQ-tree is an approximately maximum likelihood based inference program, which means that it cuts some corners during the exploration of the tree-space, in order to finish in a reasonable time.

cd ..
mkdir iqtree_r
/home/projects/pr_hackinars/src/iqtree-1.6.beta4-Linux/bin/iqtree  \
-s ice_cream_wgs_align.fsa \
-st DNA -m TEST -nt AUTO -b 100

-s is the input alignment, -st is the data type (nucleotide sequences), -m TEST option will perform a model test on the dataset, determining the substitution model, -b is bootstrapping (1000 would be better, but that runs for approx. 30 mins). IQ-tree is capable of parallelizing some steps, -nt AUTO will determine the optimal number of threads to use.

mv ice_cream_wgs_align.fsa.* iqtree_r
cd iqtree_r
/home/projects/pr_hackinars/src/NDtree/chnam.py \
-i ice_cream_wgs_align.fsa.treefile -s ../names \
-o ice_cream_wgs_align.fsa.treefile_names.nw -t2

nw_display ice_cream_wgs_align.fsa.treefile_names.nw

What substitution model did you use? Did the tree topology change? Are the clades more resolved now? The clade support percentage values are very high, why?

Some additional information for the samples:

Sampling time Source Run ID
November, 2015 Ice cream/Hospital X SRR3052035
November, 2015 Ice cream/Hospital X SRR3053137
November, 2015 Ice cream/Hospital X SRR3086932
November, 2015 Environmental/Hospital X SRR3086935
November, 2015 Environmental/Hospital X SRR3086936
November, 2015 Clinical SRR2994642
March, 2015 Environmental/Company A SRR1974103
December, 2014 Ice cream/Company A SRR3091405
December, 2014 Environmental/Company A SRR3130409
April, 2015 Environmental/Company B SRR2035442

Is the infection truly hospital-acquired? Which company is more likely to be the one where the infecting strain originates from?

If you have more time, you can also run CSI Phylogeny on these isolates.

November 30, Statistics and Figures with R, Lars

December 7: Machine Learning of microbial phenotypes

Tensorflow visualization