Advanced workshop ex4

PURPOSE

The purpose of this exercise is to assemble the trimmed reads we generated in Ex. 3. Further, we will assemble the untrimmed reads and compare the quality of the two assemblies using Quast.

DATA

We will be using the untrimmed reads that were also used in Ex. 2 and 3:

SRR4114395_1.fastq.gz

SRR4114395_2.fastq.gz

And the trimmed reads that we generated in Ex. 3 on the basis of the two previous files:

SRR4114395_1_5end_qual_trimmed.fastq.gz

SRR4114395_2_5end_qual_trimmed.fastq.gz

WORKFLOW

Launch new AWS instance

As an initial step, launch a new AWS instance of the type t2.2xlarge (see Ex. 2 if you have forgotten how to launch an AWS instance. Remember to select the previously generated key pair in the final step instead of “Create a new key pair”). The t2.2xlarge instance type has 8 CPUs and 32 GiB RAM. SPAdes requires a lot of RAM and hence the free t2.micro that we used previously, and which only has 1 RAM, is too small.

Note: As a rule of thumb, 32 GiB RAM will usually be enough for assembling a bacterial genome that has been sequenced to a depth of around 50x.

Once the new instance has been initialised, use ssh to access it. Remember to use the IP address related to this new instance, and not to the t2.micro instance you used previously.

Installing SPAdes

We will be installing SPAdes in the home directory of the AWS instance. This is not exactly best-practise for how to organise your folders and files, but since we will be deleting the instance again when the exercise is over, there is no need to be very tidy.

While in your home directory type the following:

$ wget http://cab.spbu.ru/files/release3.12.0/SPAdes-3.12.0-Linux.tar.gz

This will download the SPAdes Linux binaries as a tar-ball. Next, the files that make up the tar-ball should be extracted and unzipped:

$ tar -xzf SPAdes-3.12.0-Linux.tar.gz

This creates a folder called SPAdes-3.12.0-Linux. SPAdes is written in the programming language Python and the executable file, spades.py, is located in  ~/SPAdes-3.12.0-Linux/bin.

Assembling draft genomes with SPAdes

Create a data folder:

$ mkdir ~/data

Copy both trimmed and untrimmed files from your local computer to the AWS instance data folder (remember, this should be done from your local computer, while standing in the directory that contains the files).

General command:

$ scp -i ~/.ssh/[KeyPair.pem] [reads.fastq.gz] ec2-user@[IP-Address]:~/data/.

Repeat the command four times, each time replacing [reads.fastq.gz] with either

SRR4114395_1.fastq.gz

SRR4114395_2.fastq.gz

SRR4114395_1_5end_qual_trimmed.fastq.gz

SRR4114395_2_5end_qual_trimmed.fastq.gz

Make folders for storing the output from the SPAdes assembly:

$ mkdir ~/assembly_untrimmed  ~/assembly_trimmed

This will create two folders in one go, as can be confirmed with “ls -l”.

The below command should be used for running SPAdes:

General command:

$  ./spades.py --careful --pe1-1 [forward_reads.fastq.gz] --pe1-2 [reverse_reads.fastq.gz] -o [output-folder]

We will generate two different assemblies. One based on the untrimmed reads, and one based on the trimmed reads.

Specific commands:

$ ~/SPAdes-3.12.0-Linux/bin/spades.py --careful  --pe1-1 ~/data/SRR4114395_1.fastq.gz --pe1-2 ~/data/SRR4114395_2.fastq.gz -o ~/assembly_untrimmed

$ ~/SPAdes-3.12.0-Linux/bin/spades.py --careful  --pe1-1 ~/data/SRR4114395_1_5end_qual_trimmed.fastq.gz --pe1-2 ~/data/SRR4114395_2_5end_qual_trimmed.fastq.gz -o ~/assembly_trimmed

Where:

--pe1-1: Points to file with forward reads

--pe1-2: Points to file with reverse reads

-o: Points to folder for output files

--careful: Tries to reduce the number of mismatches and short indels. Also runs MismatchCorrector, which is a post processing tool that uses BWA.

Each assembly will take 5-10 minutes. It can easily take longer for larger genomes with a higher coverage. 

Once the assemblies have finished, they can be found in the output folders under the filename contigs.fasta. Rename each of the contigs.fasta files so that you can tell them apart later:

$ mv ~/assembly_untrimmed/contigs.fasta ~/assembly_untrimmed/contigs_untrimmed.fasta

$ mv ~/assembly_trimmed/contigs.fasta ~/assembly_trimmed/contigs_trimmed.fasta  

You can copy the 2 folders with all their content to your local machine like this (do it from your local machine): 

$ scp -i ~/.ssh/[KeyPair.pem] -r ec2-user@[IP-Address]:~/assembly_untrimmed .

$ scp -i ~/.ssh/[KeyPair.pem] -r ec2-user@[IP-Address]:~/assembly_trimmed .

Notice there is a space before the final “.”. The commands will create folders called assembly_untrimmed and assembly_trimmed in the folder from which you run the command.

Installing Quast

We will use Quast for assessing the quality of the draft genomes we just generated.

To install Quast, type the following commands while standing in the home directory of your AWS instance:

$ wget https://downloads.sourceforge.net/project/quast/quast-4.6.3.tar.gz

$ tar -xzf quast-4.6.3.tar.gz

The executable, quast.py, can now be found in the directory quast-4.6.3.

Running Quast

The general command for running Quast with two draft assemblies and without a reference assembly is:

$ quast.py [contigs1.fasta] [contigs2.fasta]

Where contigs1.fasta and contigs2.fasta are draft assemblies in fasta format.

Specific command:

$ ~/quast-4.6.3/quast.py ~/assembly_untrimmed/contigs_untrimmed.fasta ~/assembly_trimmed/contigs_trimmed.fasta  

When Quast has finished, you can see the results in the file ~/quast_results/latest/report.txt.

Download the report in html format to your local computer:

$ scp -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]:~/quast_results/latest/report.html .

From your local computer, open the file report.html in a browser, e.g., Chrome or Firefox. This will give you a nice graphical overview of the quality of the draft genomes.

Q1. How many contigs of size >= 500 bp do the two assemblies each contain?

Q2. What is the N50 value for each of the two assemblies?

Q3. The N50 value is defined as the length of the shortest contig in the set of longest contigs that together make up at least half the assembly size (half the “Total length”). How is N75 defined?

Q4. Which assembly would you use in your further analysis?

I don’t expect you to have more time left for this exercise during class, but if you later get to work with PacBio or Oxford Nanopore sequence reads, you can come back and do exercise 4-extra below. We will also walk through it during the recap.

IMPORTANT: If you have no time left and have copied everything you need from AWS to your local computer, terminate the instance via the AWS control panel. Don’t terminate the instance if you have time left and want to go on with the below exercise 4-extra. You will need the instance for the exercise.

Ex. 4-extra

PURPOSE

The purpose of this exercise is to assemble reads generated by PacBio. The same commands will work if your reads are from Oxford Nanopore. We will first assemble the reads on their own using minimap2/miniasm for assembly and racon for polishing. Next we will use SPAdes to make a hybrid assembly based on both the PacBio data and the trimmed Illumina reads we have previously worked with.

DATA

PacBio reads from the isolate Staphylococcus aureus ATCC 25923. There is only 1 file, but it is very big (> 2 GB unzipped, ~900 MB when gzipped).

SRR2104768.fastq.gz

The trimmed Illumina reads that we have previous worked with. They are also from S. aureus ATCC 25923:

SRR4114395_1_5end_qual_trimmed.fastq.gz

SRR4114395_2_5end_qual_trimmed.fastq.gz

Finally, we will be using the reference genome for the same strain:

NZ_CP009361.fasta

WORKFLOW

Preparing the AWS instance

Copy the data files to a folder named data on the AWS instance (the Illumina reads are probably already there).

Installing tools needed for compilation of later programs:

$ sudo yum groupinstall "Development Tools”

Installing minimap2

$ git clone https://github.com/lh3/minimap2

$ cd minimap2

$ make

Installing miniasm

$ cd

$ git clone https://github.com/lh3/miniasm

$ cd miniasm

$ make

Assembling with minimap2/miniasm

Minimap2 finds overlaps between reads with error rates up to 15%. Note: Minimap2 only accepts two FASTQ files and you initially need to map your FASTQ file against itself. So, if you have multiple FASTQ files, you have to concatenate them into a single file prior to running minimap2 (use “cat file1.fastq fast2.fastq file3.fastq > file1_2_3.fastq”).

$ cd

$ minimap2/minimap2 -x ava-pb -t 20 data/SRR2104768.fastq.gz data/SRR2104768.fastq.gz > data/SRR2104768_reads_against_reads_minimap2.paf

Miniasm - in this step the assembly is created, but in GFA format

$ miniasm/miniasm -f data/SRR2104768.fastq.gz data/SRR2104768_reads_against_reads_minimap2.paf > data/SRR2104768_minimap2_miniasm.gfa

Converting from GFA to FASTA:

$ awk '/^S/{print ">"$2"\n"$3}'  data/SRR2104768_minimap2_miniasm.gfa  | fold > data/SRR2104768_minimap2_miniasm.fasta

In principle you could stop here, but as we will see during the recap, this assembly has a very high error rate (mismatches and INDELS). PacBio reads are not usually trimmed before assembly, instead the assembly is “polished” to improve the error rate.

If you only have PacBio reads available, you can use them for polishing. We will use the tool racon for the polishing step.

Installation of racon

Racon installation is a bit tricky and has several dependencies:

gcc (we already installed this with the “sudo yum groupinstall "Development Tools”” command earlier)

cmake (unfortunately cmake installing via “sudo yum install cmake” leads to an old version, so we have to do it it like this):

$ wget https://cmake.org/files/v3.6/cmake-3.6.2.tar.gz

$ tar -zxvf cmake-3.6.2.tar.gz

$ cd cmake-3.6.2

$ sudo ./bootstrap --prefix=/usr/local

$ sudo make

$ sudo make install

Finally, the file .bash_profile (located in your home directory, but hidden as indicated by the leading dot, hence “ls” will not show it, but “ls -a” will) must be modified by adding the line “PATH=/home/ec2-user/cmake-3.6.2/bin:$PATH:$HOME/bin”. I do this by copying the file to my local computer (from the local computer type “scp -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]:~/.bash_profile bash_profile”), modifying it in Sublime Text, and then copying it back (“scp -i ~/.ssh/[KeyPair.pem] bash_profile ec2-user@[IP-Address]:~/.bash_profile”). The modified version of ~.bash_profile should have the following content:

# .bash_profile

# Get the aliases and functions

if [ -f ~/.bashrc ]; then

        . ~/.bashrc

fi

# User specific environment and startup programs

PATH=$PATH:$HOME/.local/bin:$HOME/bin

PATH=/home/ec2-user/cmake-3.6.2/bin:$PATH:$HOME/bin  #This is the new line

export PATH

Log out of AWS and in again for the changes to take effect.

If you now type “cmake --version”, it should say 3.6.2.

Now, for the installation of racon:

$ git clone --recursive https://github.com/isovic/racon.git racon

$ cd racon

$ mkdir build

$ cd build

$ cmake -DCMAKE_BUILD_TYPE=Release ..

$ make

You can now call racon like this:

$ /racon/build/bin/racon

General usage of racon is as following:

$ racon [options ...] <sequences> <overlaps> <target sequences>

Where: 

<sequences>: Input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences used for correction

 <overlaps>: Input file in MHAP/PAF/SAM format (can be compressed with gzip) containing overlaps between sequences and target sequences

<target sequences>: input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences which will be corrected

The <sequences> is the original PacBio fastq file, while <target sequences> is the assembly we just made with minimap2/miniasm. The <overlap> is a new file made by minimap2 on the basis of the sequences and the target sequence. We will start by making that:

$ cd

$ minimap2/minimap2 -t 20 data/SRR2104768_minimap2_miniasm.fasta data/SRR2104768.fastq.gz > data/SRR2104768_reads_to_assembly_minimap2.paf

Now for the polishing with racon

$ racon/build/bin/racon -t 20 data/SRR2104768.fastq.gz data/SRR2104768_reads_to_assembly_minimap2.paf data/SRR2104768_minimap2_miniasm.fasta > data/SRR2104768_racon_corrected.fasta

Note: If you have Illumina reads available, you can map the Illumina reads to the initial assembly (use bwa as in Ex. 5) and use that as the overlaps file for polishing.

You can also use SPAdes for generating a hybrid assembly on the basis of both PacBio and Illumina reads. It is quite easy:

General command:

$ ./spades.py --careful --pe1-1 [forward_reads.fastq.gz] --pe1-2 [reverse_reads.fastq.gz] --pacbio <file_name>  -o [output-folder]

Specific command:

$ ~/SPAdes-3.12.0-Linux/bin/spades.py --careful  --pe1-1 ~/data/SRR4114395_1_5end_qual_trimmed.fastq.gz --pe1-2 ~/data/SRR4114395_2_5end_qual_trimmed.fastq.gz --pacbio ~/data/SRR2104768.fastq.gz -o ~/combined_assembly

The assembly can be found in ~/combined_assembly/contigs.fasta. Let’s rename it so we are sure what the file contains.

$ mv ~/combined_assembly/contigs.fasta ~/combined_assembly/SRR2104768_SRR4114395_hybrid_assembly.fasta

Now, let’s run Quast to compare the assemblies. General command, when also including a reference genome:

$ quast.py contigs_1.fasta contigs_2.fasta -R reference.fasta.gz

Specific command

$ ~/quast-4.6.3/quast.py ~/data/SRR2104768_minimap2_miniasm.fasta ~/data/SRR2104768_racon_corrected.fasta ~/combined_assembly/hybrid_asssembly.fasta -R ~/data/NZ_CP009361.fasta

The results can be seen in ~/quast_results/latest. I have beforehand made a Quast report which also included contigs_untrimmed.fasta and contigs_trimmed.fasta from earlier. It is available HERE.

IMPORTANT: When you have copied everything you need from AWS to your local computer, terminate the instance via the AWS control panel.