The purpose of this exercise is to generate phylogenetic trees of a subset of the MRSA isolates from the paper by Harris et al 2013 using two different web-services: NDtree and CSIPhylogeny.
The first step in making your phylogenetic trees is to find a suitable reference genome. This is most easily done using KmerFinder (https://cge.cbs.dtu.dk/services/KmerFinder/). The genome in KmerFinder's database that most of the isolates have the highest number of kmers in common with is a good choice as reference.
1. Which strain do you select as the reference genome?
I find that the easiest way to retrieve the sequence of the reference genome is to google the search term "ncbi complete genome" + the accession number of the reference genome. If you click the top hit of the search, you will be led to the NCBI entry of the reference genome in GenBank format. Start by selecting "FASTA" as the format in the dropdown menu in the top right side. Next, in the drop-down menu labelled "Send to" select "File". The reference will be saved to your downloads folder with the file name sequence.fasta. I suggest you rename the file to be able to remember what it contains.
Once you have downloaded the reference genome as a FASTA file, you are ready to make the trees using the two different tools.
Go to NDtree. Note the links to "Instructions" and "Output", where you can find guidance for the submission step and interpretation of results.
NDtree is very easy to use, you just need to know that it uses the term "template genome" instead of "reference genome". All you have to do is to point to the just-found reference (template) genome via the "Browse..." button. If you do not specify a particular reference genome to be used, the service will automatically find one among a number of available reference (template) genomes (but don't use this feature for this exercise). Next, select the files (NDtree can take multiple files at once) for which you want to make a phylogenetic tree using the button marked "Isolate File". Leave all other settings as they are and press "Upload".
Note: If the quality of one or more of your draft genomes is very poor, you should omit them, as it will otherwise mess up your tree.
Note: If you have raw sequence reads available, they will give a more precise result, but to save time, we will only use assembled genomes.
Have a look at the tree generated by NDtree (pre-run results can as usual be found HERE). You should note that the two non-outbreak strains are clear outliers, and that the resolution among the remaining isolates as a consequence is very low. Accordingly, re-run NDtree, but this time without the two non-outbreak strains.
After having run NDtree with only the outbreak strains, locate the "Downloads", and download the files “distance.txt.gz” and “tree.nj.newick.gz”. The file "distance.txt.gz" is a zipped text file, which contains the distance matrix.
The .newick file can be opened in figtree, a tree viewing program. If you have not installed figtree yet it can be downloaded from here: figtree.
Open the tree.nj.newick file in figtree.
The "radial three layout" is similar to the layout that was shown on the NDtree output page. Try to select the "Rectangular tree layout" and add "Branch Labels".
Now answer the following questions:
2. How many SNPs are there between P11 and P12?
3. Which isolate differs the most from HW1?
Take a look at the distance matrix.
4. Do the SNPs reported in the distance matrix correspond to the branch length numbers in the tree?
5. Is it likely that all isolates are part of the same outbreak according to the species-specific definitions of SNP thresholds as listed in the paper by Schürch et al 2018?
Go to CSIPhylogeny. Note again the links to "Instructions" and "Output".
Create a tree using the default settings, the same reference (template) genome that you used for NDtree and upload all ten files (also the two non-outbreak files).
When the analysis is done, you should again note that the two non-outbreak strains are clear outliers. Before rerunning CSIPhylogeny without the outliers have a look at the distance matrix (Download matrix as [TXT]).
6. Looking at the SNP distances listed in the distance matrix can you confirm that the two non-outbreak strains are indeed not part of the outbreak?
Rerun CSIPhylogeny without the two non-outbreak strains.
When the new analysis is done, download the distance matrix again. Now, answer the following:
7. Are the SNP distances between each of the pairs of isolates the same in this new distance matrix, as in the distance matrix generated when the two non-outbreak strains were also included? If not, how can that be explained?
Download the tree in .newick format and open it in figtree.
8. Does the tree look like the NJ tree created by NDtree? If not, what could be the explanation?
9. Compare the trees made by NDtree and CSIPhylogeny with the tree in the figure page 132 (panel F) in Harris et al 2013. Are the trees similar?