Advanced workshop ex7

PURPOSE

To create a phylogenetic tree with five MRSA outbreak strains using the CGE tool NDtree.

DATA

Paired-end reads in fastq format from five MRSA outbreak strains described in the paper from Harris et al 2013:

HW_1.fastq.gz

HW_2.fastq.gz

P11_1.fastq.gz

P11_2.fastq.gz

P12_1.fastq.gz

P12_2.fastq.gz

P24_1.fastq.gz

P24_2.fastq.gz

P26_1.fastq.gz

P26_2.fastq.gz

Reference genome from NCBI in fasta format:

NC_017763_ref_MRSA.fna

Note: Remember from the introductory workshop that an appropriate reference can, for instance, be found by using KmerFinder to identify the NCBI reference genome that resembles the test strains the most. In particular in relation to outbreaks, an alternative is to use a draft assembly made from the reads from one of the test strains.

WORKFLOW

Preparing file with names of input files

When running NDtree command line, you initially have to prepare a file that contains the names of all the input files. Prepare this file in Sublime Text or a similar “pure” text editor (not Words). File names related to the same strain must be listed on the same line separated by a tab (note: there are no empty lines between the lines with file names):

HW_1.fastq.gz        HW_2.fastq.gz

P11_1.fastq.gz       P11_2.fastq.gz

P12_1.fastq.gz       P12_2.fastq.gz

P24_1.fastq.gz      P24_2.fastq.gz

P26_1.fastq.gz      P26_2.fastq.gz

Note: If you want the reference to be part of the tree, you should add an extra line with the name of the file containing the reference genome. We will, however, not do that in this exercise.

Save the file as “inputdata.list” in the same folder as all the input files.

Launching new AWS instance

Start by launching an AWS instance similarly to how we did it in Ex. 2: NOTE: In step 1 select the following Amazon Machine Image (AMI): Ubuntu Server 16.04 LTS (HVM), SSD Volume Type. In step 2 choose the instance type t2.micro (free).

Remember that accessing an AWS Ubuntu machine is in general done like this:

$ ssh -i ~/.ssh/[KeyPair.pem] ubuntu@[IP-Address]

Preparing the AWS instance

Copy/paste each of the below commands:

$ sudo apt update

$ sudo apt upgrade

Note: If a dialog box appears when running the above command, just select the proposed option by pressing Enter.

$ sudo apt install make gcc python2.7 python-pip

Cloning the ndtree bitbucket repos:

$ git clone https://bitbucket.org/mettevoldbylarsen/ndtree

Installing Phylip takes a few steps:

$ wget http://evolution.gs.washington.edu/phylip/download/phylip-3.697.tar.gz

$ tar -zxvf phylip-3.697.tar.gz

$ cd ~/phylip-3.697/src

$ cp Makefile.unx Makefile

$ make install

Installing Numpy:

$ pip install numpy

Note: In my experience, the command prompt is in the middle of the progress bar after installing Numpy. Hit Enter until the command prompt is clear of the progress bar. It doesn’t have a practical purpose, but it looks very confusing otherwise.

Copy the data from your local computer to AWS. First ensure that all data to be used (paired-end reads, reference genome, the file inputdata.list) is in the same folder. Next, type:

$ scp -i ~/.ssh/[KeyPair.pem] -r [data-folder] ubuntu@[IP-Adress]:~

Note: The above command is our usual command for copying data from the local computer to AWS. With the many large files it can be quite slow. An alternative is to download a tarball containing the paired-end reads and reference genome from Dropbox to AWS, and only upload the inputdata.list from your local computer. We can use wget for downloading the tarball from Dropbox.

On AWS type:

$ wget https://www.dropbox.com/s/zj2bfa5l4n41rs9/data_ex7.tar.gz?dl=0

Followed by

$ tar -zxvf ‘data_ex7.tar.gz?dl=0'

Which should create the folder data_ex7 (if you experience problems with this command, it might be because one or both of the single quotes, ‘ , is converted to a backtick, `, during the copy/paste process. If it is, change it back to a single quote before hitting Enter and the extraction should work).

Now, for running NDtree on AWS, first move to the data-folder (cd ~/data). Next, type:

$ ~/ndtree/NDtree.py -k31 -a -f inputdata.list -t NC_017763_ref_MRSA.fna

Where:

-k31: Specfies the length of the k-mers to be 31

-a:  Only positions called in all strains are used

-f inputdata.list: The file that contains the names of the input files

-t NC_017763_ref_MRSA.fna: Specification of the reference strains in fasta format

The command takes quite some time to finish (25-30 min). Meanwhile, you can download all the output files HERE. The distance matrix can be found in the file infile.names, while the tree in newick format can be found in tree.newick. Only the Neighbor Joining tree is created, when NDtree is run command line.

Open the distance matrix to answer the question:

Open the distance matrix to answer the question:

Q1: How many nucleotide differences are there between the HW1 and P26 strains?

Q2. On your local computer, open the tree.newick file in Figtree (one of the programs you were asked to install from home). Under “Layouts” in the upper left side, select the “radial three layout”. Compare the overall tree topology with the tree below, which is from the paper by Harris et al.. Do the trees look similar?

Q3: If you open the file nctree.log, you can see the length of the reference genome (# Length of chromosome 1) and the number of positions used for phylogeny (# Number of positions used for phylogeny in chromosome 1). Ideally, the number of positions used for phylogeny should be at least 80%. Is it?

There are a number of other output files with useful information (unfortunately they are not very user friendly). For instance, the files with the extension *assimpler.out contain in the bottom some interesting statistics, e.g., the average depth of coverage of the reference genome by the reads of the test strains (“# Average cov template no. 1”, Template no. 1 is the reference genome). NDtree, however, renames the inputfiles, so to know which *assimbler.out file corresponds to which inputfile, you must look in the file “names”.

If the number of position used for phylogeny is low, it could be caused by one of the test strains not being part of the outbreak or being of poor quality. The *assimbler.out files also contain information about the fraction of the reference genome covered by reads from each test strain (breadth of coverage, “# frac covered”). It is a bit cumbersome to open all the *assimbler.out files, especially if the analysis includes many test strains. Instead, you can use a small bash-script to extract the relevant information. Any command that you can run from the command line, you can put in a bash script. Download the bash-script “extract_fracs.sh” HERE. Open the file in Sublime text to see the content. Note that a bash script starts with the line “#!/bin/bash”, which tells the shell that this is a script that should be interpreted (run) with bash.

To run the script, first copy it to your AWS instance:

$ scp -i ~/.ssh/[KeyPair.pem] extract_fracs.sh ubuntu@[ID-Adress]:~/

On the AWS instance, while standing in the home directory, make the file executable:

$ chmod u+x extract_fracs.sh

Now run the script from within the folder that contains all the *assimbler.out files (note: if NDtree has not finished running, you have to copy the premade *assimbler.out files to the AWS instance first).

$ ~/extract_fracs.sh

The results from extract_fracs.sh are written to the file “covered_fractions”.

Q4: What is the minimum fraction covered and by which input file?

IMPORTANT: If you have no time left and have copied everything you need from AWS to your local computer, terminate the instance via the AWS control panel. Don’t terminate the instance if you have time left and want to go on with the below exercise 7-extra. You will need the instance for the exercise.

Ex. 7-extra

If you still have time left, you can download all the output files from another run of NDtree that besides the same four test strains also contained the non-outbreak strain, ERR070035. We will download the outputfiles as a tarball from dropbox:

$ wget https://www.dropbox.com/s/pd0o086dqgguu4s/outfiles_incl_outlier.tar.gz?dl=0

$ tar -zxvf ‘outfiles_incl_outlier.tar.gz?dl=0'

The above commands should create the folder outfiles_incl_outlier. Inside the folder, you can see there is an extra *assimpler.out file called I00006.assimpler.out. Looking into the “names” file (use “less names”) you can see this file contains information regarding the non-outbreak file ERR070035. Modify the extract_fracs.sh script so that it will also extract information from this strain and rerun the script.

Q5: What is the fraction of the reference genome that is covered by reads from the ERR070035 strain?

IMPORTANT: When you have copied everything you need from AWS to your local computer, terminate the instance via the AWS control panel.