PURPOSE
The purpose of this exercise is to assemble the trimmed reads we generated in Ex. 3. Further, we will assemble the untrimmed reads and compare the quality of the two assemblies using Quast.
DATA
We will be using the untrimmed reads that were also used in Ex. 2 and 3:
SRR4114395_1.fastq.gz
SRR4114395_2.fastq.gz
And the trimmed reads that we generated in Ex. 3 on the basis of the two previous files:
SRR4114395_1_5end_qual_trimmed.fastq.gz
SRR4114395_2_5end_qual_trimmed.fastq.gz
WORKFLOW
Launch new AWS instance
As an initial step, launch a new AWS instance of the type t2.2xlarge (see Ex. 2 if you have forgotten how to launch an AWS instance. Remember to select the previously generated key pair in the final step instead of “Create a new key pair”). The t2.2xlarge instance type has 8 CPUs and 32 GiB RAM. SPAdes requires a lot of RAM and hence the free t2.micro that we used previously, and which only has 1 RAM, is too small.
Note: As a rule of thumb, 32 GiB RAM will usually be enough for assembling a bacterial genome that has been sequenced to a depth of around 50x.
Once the new instance has been initialised, use ssh to access it. Remember to use the IP address related to this new instance, and not to the t2.micro instance you used previously.
Installing SPAdes
We will be installing SPAdes in the home directory of the AWS instance. This is not exactly best-practise for how to organise your folders and files, but since we will be deleting the instance again when the exercise is over, there is no need to be very tidy.
While in your home directory type the following:
$ wget http://cab.spbu.ru/files/release3.12.0/SPAdes-3.12.0-Linux.tar.gz
This will download the SPAdes Linux binaries as a tar-ball. Next, the files that make up the tar-ball should be extracted and unzipped:
$ tar -xzf SPAdes-3.12.0-Linux.tar.gz
This creates a folder called SPAdes-3.12.0-Linux. SPAdes is written in the programming language Python and the executable file, spades.py, is located in ~/SPAdes-3.12.0-Linux/bin.
Assembling draft genomes with SPAdes
Create a data folder:
$ mkdir ~/data
Copy both trimmed and untrimmed files from your local computer to the AWS instance data folder (remember, this should be done from your local computer, while standing in the directory that contains the files).
General command:
$ scp -i ~/.ssh/[KeyPair.pem] [reads.fastq.gz] ec2-user@[IP-Address]:~/data/.
Repeat the command four times, each time replacing [reads.fastq.gz] with either
SRR4114395_1.fastq.gz
SRR4114395_2.fastq.gz
SRR4114395_1_5end_qual_trimmed.fastq.gz
SRR4114395_2_5end_qual_trimmed.fastq.gz
Make folders for storing the output from the SPAdes assembly:
$ mkdir ~/assembly_untrimmed ~/assembly_trimmed
This will create two folders in one go, as can be confirmed with “ls -l”.
The below command should be used for running SPAdes:
General command:
$ ./spades.py --careful --pe1-1 [forward_reads.fastq.gz] --pe1-2 [reverse_reads.fastq.gz] -o [output-folder]
We will generate two different assemblies. One based on the untrimmed reads, and one based on the trimmed reads.
Specific commands:
$ ~/SPAdes-3.12.0-Linux/bin/spades.py --careful --pe1-1 ~/data/SRR4114395_1.fastq.gz --pe1-2 ~/data/SRR4114395_2.fastq.gz -o ~/assembly_untrimmed
$ ~/SPAdes-3.12.0-Linux/bin/spades.py --careful --pe1-1 ~/data/SRR4114395_1_5end_qual_trimmed.fastq.gz --pe1-2 ~/data/SRR4114395_2_5end_qual_trimmed.fastq.gz -o ~/assembly_trimmed
Where:
--pe1-1: Points to file with forward reads
--pe1-2: Points to file with reverse reads
-o: Points to folder for output files
--careful: Tries to reduce the number of mismatches and short indels. Also runs MismatchCorrector, which is a post processing tool that uses BWA.
Each assembly will take 5-10 minutes. It can easily take longer for larger genomes with a higher coverage.
Once the assemblies have finished, they can be found in the output folders under the filename contigs.fasta. Rename each of the contigs.fasta files so that you can tell them apart later:
$ mv ~/assembly_untrimmed/contigs.fasta ~/assembly_untrimmed/contigs_untrimmed.fasta
$ mv ~/assembly_trimmed/contigs.fasta ~/assembly_trimmed/contigs_trimmed.fasta
You can copy the 2 folders with all their content to your local machine like this (do it from your local machine):
$ scp -i ~/.ssh/[KeyPair.pem] -r ec2-user@[IP-Address]:~/assembly_untrimmed .
$ scp -i ~/.ssh/[KeyPair.pem] -r ec2-user@[IP-Address]:~/assembly_trimmed .
Notice there is a space before the final “.”. The commands will create folders called assembly_untrimmed and assembly_trimmed in the folder from which you run the command.
Installing Quast
We will use Quast for assessing the quality of the draft genomes we just generated.
To install Quast, type the following commands while standing in the home directory of your AWS instance:
$ wget https://downloads.sourceforge.net/project/quast/quast-4.6.3.tar.gz
$ tar -xzf quast-4.6.3.tar.gz
The executable, quast.py, can now be found in the directory quast-4.6.3.
Running Quast
The general command for running Quast with two draft assemblies and without a reference assembly is:
$ quast.py [contigs1.fasta] [contigs2.fasta]
Where contigs1.fasta and contigs2.fasta are draft assemblies in fasta format.
Specific command:
$ ~/quast-4.6.3/quast.py ~/assembly_untrimmed/contigs_untrimmed.fasta ~/assembly_trimmed/contigs_trimmed.fasta
When Quast has finished, you can see the results in the file ~/quast_results/latest/report.txt.
Download the report in html format to your local computer:
$ scp -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]:~/quast_results/latest/report.html .
From your local computer, open the file report.html in a browser, e.g., Chrome or Firefox. This will give you a nice graphical overview of the quality of the draft genomes.
Q1. How many contigs of size >= 500 bp do the two assemblies each contain?
Q2. What is the N50 value for each of the two assemblies?
Q3. The N50 value is defined as the length of the shortest contig in the set of longest contigs that together make up at least half the assembly size (half the “Total length”). How is N75 defined?
Q4. Which assembly would you use in your further analysis?
I don’t expect you to have more time left for this exercise during class, but if you later get to work with PacBio or Oxford Nanopore sequence reads, you can come back and do exercise 4-extra below. We will also walk through it during the recap.
IMPORTANT: If you have no time left and have copied everything you need from AWS to your local computer, terminate the instance via the AWS control panel. Don’t terminate the instance if you have time left and want to go on with the below exercise 4-extra. You will need the instance for the exercise.
Ex. 4-extra
PURPOSE
The purpose of this exercise is to assemble reads generated by PacBio. The same commands will work if your reads are from Oxford Nanopore. We will first assemble the reads on their own using minimap2/miniasm for assembly and racon for polishing. Next we will use SPAdes to make a hybrid assembly based on both the PacBio data and the trimmed Illumina reads we have previously worked with.
DATA
PacBio reads from the isolate Staphylococcus aureus ATCC 25923. There is only 1 file, but it is very big (> 2 GB unzipped, ~900 MB when gzipped).
SRR2104768.fastq.gz
The trimmed Illumina reads that we have previous worked with. They are also from S. aureus ATCC 25923:
SRR4114395_1_5end_qual_trimmed.fastq.gz
SRR4114395_2_5end_qual_trimmed.fastq.gz
Finally, we will be using the reference genome for the same strain:
NZ_CP009361.fasta
WORKFLOW
Preparing the AWS instance
Copy the data files to a folder named data on the AWS instance (the Illumina reads are probably already there).
Installing tools needed for compilation of later programs:
$ sudo yum groupinstall "Development Tools”
Installing minimap2
$ git clone https://github.com/lh3/minimap2
$ cd minimap2
$ make
Installing miniasm
$ cd
$ git clone https://github.com/lh3/miniasm
$ cd miniasm
$ make
Assembling with minimap2/miniasm
Minimap2 finds overlaps between reads with error rates up to 15%. Note: Minimap2 only accepts two FASTQ files and you initially need to map your FASTQ file against itself. So, if you have multiple FASTQ files, you have to concatenate them into a single file prior to running minimap2 (use “cat file1.fastq fast2.fastq file3.fastq > file1_2_3.fastq”).
$ cd
$ minimap2/minimap2 -x ava-pb -t 20 data/SRR2104768.fastq.gz data/SRR2104768.fastq.gz > data/SRR2104768_reads_against_reads_minimap2.paf
Miniasm - in this step the assembly is created, but in GFA format
$ miniasm/miniasm -f data/SRR2104768.fastq.gz data/SRR2104768_reads_against_reads_minimap2.paf > data/SRR2104768_minimap2_miniasm.gfa
Converting from GFA to FASTA:
$ awk '/^S/{print ">"$2"\n"$3}' data/SRR2104768_minimap2_miniasm.gfa | fold > data/SRR2104768_minimap2_miniasm.fasta
In principle you could stop here, but as we will see during the recap, this assembly has a very high error rate (mismatches and INDELS). PacBio reads are not usually trimmed before assembly, instead the assembly is “polished” to improve the error rate.
If you only have PacBio reads available, you can use them for polishing. We will use the tool racon for the polishing step.
Installation of racon
Racon installation is a bit tricky and has several dependencies:
gcc (we already installed this with the “sudo yum groupinstall "Development Tools”” command earlier)
cmake (unfortunately cmake installing via “sudo yum install cmake” leads to an old version, so we have to do it it like this):
$ wget https://cmake.org/files/v3.6/cmake-3.6.2.tar.gz
$ tar -zxvf cmake-3.6.2.tar.gz
$ cd cmake-3.6.2
$ sudo ./bootstrap --prefix=/usr/local
$ sudo make
$ sudo make install
Finally, the file .bash_profile (located in your home directory, but hidden as indicated by the leading dot, hence “ls” will not show it, but “ls -a” will) must be modified by adding the line “PATH=/home/ec2-user/cmake-3.6.2/bin:$PATH:$HOME/bin”. I do this by copying the file to my local computer (from the local computer type “scp -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]:~/.bash_profile bash_profile”), modifying it in Sublime Text, and then copying it back (“scp -i ~/.ssh/[KeyPair.pem] bash_profile ec2-user@[IP-Address]:~/.bash_profile”). The modified version of ~.bash_profile should have the following content:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/.local/bin:$HOME/bin
PATH=/home/ec2-user/cmake-3.6.2/bin:$PATH:$HOME/bin #This is the new line
export PATH
Log out of AWS and in again for the changes to take effect.
If you now type “cmake --version”, it should say 3.6.2.
Now, for the installation of racon:
$ git clone --recursive https://github.com/isovic/racon.git racon
$ cd racon
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make
You can now call racon like this:
$ /racon/build/bin/racon
General usage of racon is as following:
$ racon [options ...] <sequences> <overlaps> <target sequences>
Where:
<sequences>: Input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences used for correction
<overlaps>: Input file in MHAP/PAF/SAM format (can be compressed with gzip) containing overlaps between sequences and target sequences
<target sequences>: input file in FASTA/FASTQ format (can be compressed with gzip) containing sequences which will be corrected
The <sequences> is the original PacBio fastq file, while <target sequences> is the assembly we just made with minimap2/miniasm. The <overlap> is a new file made by minimap2 on the basis of the sequences and the target sequence. We will start by making that:
$ cd
$ minimap2/minimap2 -t 20 data/SRR2104768_minimap2_miniasm.fasta data/SRR2104768.fastq.gz > data/SRR2104768_reads_to_assembly_minimap2.paf
Now for the polishing with racon
$ racon/build/bin/racon -t 20 data/SRR2104768.fastq.gz data/SRR2104768_reads_to_assembly_minimap2.paf data/SRR2104768_minimap2_miniasm.fasta > data/SRR2104768_racon_corrected.fasta
Note: If you have Illumina reads available, you can map the Illumina reads to the initial assembly (use bwa as in Ex. 5) and use that as the overlaps file for polishing.
You can also use SPAdes for generating a hybrid assembly on the basis of both PacBio and Illumina reads. It is quite easy:
General command:
$ ./spades.py --careful --pe1-1 [forward_reads.fastq.gz] --pe1-2 [reverse_reads.fastq.gz] --pacbio <file_name> -o [output-folder]
Specific command:
$ ~/SPAdes-3.12.0-Linux/bin/spades.py --careful --pe1-1 ~/data/SRR4114395_1_5end_qual_trimmed.fastq.gz --pe1-2 ~/data/SRR4114395_2_5end_qual_trimmed.fastq.gz --pacbio ~/data/SRR2104768.fastq.gz -o ~/combined_assembly
The assembly can be found in ~/combined_assembly/contigs.fasta. Let’s rename it so we are sure what the file contains.
$ mv ~/combined_assembly/contigs.fasta ~/combined_assembly/SRR2104768_SRR4114395_hybrid_assembly.fasta
Now, let’s run Quast to compare the assemblies. General command, when also including a reference genome:
$ quast.py contigs_1.fasta contigs_2.fasta -R reference.fasta.gz
Specific command
$ ~/quast-4.6.3/quast.py ~/data/SRR2104768_minimap2_miniasm.fasta ~/data/SRR2104768_racon_corrected.fasta ~/combined_assembly/hybrid_asssembly.fasta -R ~/data/NZ_CP009361.fasta
The results can be seen in ~/quast_results/latest. I have beforehand made a Quast report which also included contigs_untrimmed.fasta and contigs_trimmed.fasta from earlier. It is available HERE.