Exercise3

Phylogeny

The purpose of this exercise is to generate phylogenetic trees of a subset of the MRSA isolates from the paper by Harris et al 2013 using two different web-services: NDtree and CSIPhylogeny.

Data

Exercise3/ERR070035-not-outbreak.fsa

Exercise3/ERR070041-not-outbreak.fsa

Exercise3/P11.fsa

Exercise3/P12.fsa

Exercise3/P15.fsa

Exercise3/P22.fsa

Exercise3/P24.fsa

Exercise3/P26.fsa

Exercise3/HW1.fsa

Exercise3/HW2.fsa

 

Analysis

The first step in making your phylogenetic trees is to find a suitable reference genome. This is most easily done using KmerFinder (https://cge.cbs.dtu.dk/services/KmerFinder/). The genome in KmerFinder's database that most of the isolates have the highest number of kmers in common with is a good choice as reference.

1. Which strain do you select as the reference genome?

Note that in the output table from KmerFinder, each database hit is accompanied by a link marked "get sequence". Clicking this link will lead you to an ftp site where the genome sequence can be downloaded. The file that you are interested in (which contains the nucleotide sequence in FASTA format) has the extension .fna. In case of several .fna files, they represent the chromosome and individual plasmids of the isolate, respectively. For this exercise, we will only use the chromosome as a reference. The first (largest) .fna file contain the chromosome sequence.

OBS OBS OBS: KmerFinder has very recently been updated, which means that the reference genome that you will find today using KmerFinder is not the same I found a few weeks back and used for generating the phylogenetic trees linked to via the results-page. If you want to make trees identical to the trees you get to via the results page, you should not use the reference genome you just found, but the one you can download from here: NC_017763_ref_mrsa.

Once you have downloaded the reference genome as a FASTA file, you are ready to make the trees using the two different tools.

NDtree

Go to NDtree. Note the links to "Instructions" and "Output", where you can find guidance for the submission step and interpretation of results.

NDtree is very easy to use, you just need to know that it uses the term "template genome" instead of "reference genome". All you have to do is to point to the just-found reference (template) genome via the "Browse..." button. If you do not specify a particular reference genome to be used, the service will automatically find one among a number of available reference (template) genomes (but don't use this feature for this exercise). Next, select the files (NDtree can take multiple files at once) for which you want to make a phylogenetic tree using the button marked "Isolate File". Leave all other settings as they are and press "Upload".

Note: If the quality of one or more of your draft genomes is very poor, you should omit them, as it will otherwise mess up your tree.

Note: If you have raw sequence reads available, they will give a more precise result, but to save time, we will only use assembled genomes.

Have a look at the tree generated by NDtree (pre-run results can as usual be found HERE). You should note that the two non-outbreak strains are clear outliers, and that the resolution among the remaining isolates as a consequence is very low. Accordingly, re-run NDtree, but this time without the two non-outbreak strains.

After having run NDtree with only the outbreak strains, locate the "Downloads", and download the files “distance.txt.gz” and “tree.nj.newick.gz”. The file "distance.txt.gz" is a zipped text file, which contains the distance matrix.

The .newick file can be opened in figtree, a tree viewing program. If you have not installed figtree yet it can be downloaded from here: figtree.

Open the tree.nj.newick file in figtree.

The "radial three layout" is similar to the layout that was shown on the NDtree output page. Try to select the "Rectangular tree layout" and add "Branch Labels".

Now answer the following questions:

2. How many SNPs are there between P11 and P12?

3. Which isolate differs the most from HW1?

Take a look at the distance matrix.

4. Do the SNPs reported in the distance matrix correspond to the branch length numbers in the tree?

5. Is it likely that all isolates are part of the same outbreak according to the species-specific definitions of SNP thresholds as listed in the paper by Schürch et al 2018?

CSI Phylogeny

Go to CSIPhylogeny. Note again the links to "Instructions" and "Output".

Create a tree using the default settings, the same reference (template) genome that you used for NDtree and upload all ten files (also the two non-outbreak files).

When the analysis is done, you should again note that the two non-outbreak strains are clear outliers. Before rerunning CSIPhylogeny without the outliers have a look at the distance matrix (Download matrix as [TXT]).

6. Looking at the SNP distances listed in the distance matrix can you confirm that the two non-outbreak strains are indeed not part of the outbreak?

Rerun CSIPhylogeny without the two non-outbreak strains.

When the new analysis is done, download the distance matrix again. Now, answer the following:

7. Are the SNP distances between each of the pairs of isolates the same in this new distance matrix, as in the distance matrix generated when the two non-outbreak strains were also included? If not, how can that be explained?

Download the tree in .newick format and open it in figtree.

8. Does the tree look like the NJ tree created by NDtree? If not, what could be the explanation?

9. Compare the trees made by NDtree and CSIPhylogeny with the tree in the figure page 132 (panel F) in Harris et al 2013. Are the trees similar?