In this exercise, we will assess the quality of a set of Illumina paired-end reads using FastQC. Based on the results we will then trim the reads using Cutadapt. Finally, we will retrieve reads statistics using prinseq.
A set of paired-end reads generated by Illumina MiSeq using the Nextera XT library preparation and 2X300 cycle MiSeq sequencing kit. The sequenced isolate is Staphylococcus aureus ATCC 25923.
File 1: SRR4114395_1.fastq.gz
File 2: SRR4114395_2.fastq.gz
It is the same files that we used in Ex. 2.
Quality Control using FastQC
Work on you local computer.
Open FastQC (one of the programs you were asked to install from home).
To analyse a fastq file with FastQC do the following:
File > Open > Select SRR4114395_1.fastq.gz
File > Open > Select SRR4114395_2.fastq.gz
Note that you can toggle between the two resulting FastQC reports via the tabs at the top of the FastQC window.
Save both reports:
File > Save report…
Note that the reports will be saved in HTML format, which can be opened in any browser, e.g., Chrome or Firefox.
Q1. How many reads (sequences) do each of the two files contain?
Q2. Which FastQC metrics should you be most concerned about? (Which metrics are flagged by FastQC as highly problematic via a red circle with a white “x” in it).
Now work on the AWS instance you created before lunch (with t2.micro instance type).
We will use Cutadapt to trim the reads. Installing Cutadapt is very easy, just type:
$ pip install --user --upgrade cutadapt
You can now call the program like this:
$ ~/.local/bin/cutadapt --help
If the above command outputs a description of how to run Cutadapt, the installation was successful.
Read Trimming using Cutadapt
Confirm that the two files you copied to the instance during Ex. 2 are still present in the folder “data”.
Due to a small bug in Cutadapt, the 5’ end of file 2 in a pair have to be removed separately in an initial step:
$ ~/.local/bin/cutadapt --cut 20 -o [output_file] [input_file]
$ ~/.local/bin/cutadapt --cut 20 -o SRR4114395_2.only.5end.trimmed.fastq.gz SRR4114395_2.fastq.gz
--cut 20 means you trim off the first 20 bases from the 5’ end of the reads.
Remember to add the path to both output file and input file unless you are in the directory where the output file should be stored and the input file is located.
The above command should have generated a new file, SRR4114395_2.only.5end.trimmed.fastq.gz. Copy the file back to your local computer. This can be done by running the below command in the terminal of your local computer (not on the AWS instance):
$ scp -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]:~/data/SRR4114395_2.only.5end.trimmed.fastq.gz .
The “.” at the end of the command means that the file should be stored with the same name as the original file in the folder you are currently standing in.
Q3: Use FastQC to validate that the reads in SRR4114395_2.only.5end.trimmed.fastq.gz are now 20 nucleotides shorter. Do the 5’ends according to the “Per base sequence content” look better?
Now we are ready to complete the trimming.
Work on the AWS instance:
Use the below command to also trim off 20 bases from the 5’ end of the reads in file 1 and further trim of bases from the 3’ end until a quality score (Phred score) of minimally 30 for both reads. Only keep reads that are at least 30 bp long after trimming:
$ ~/.local/bin/cutadapt --cut 20 -q 30 -m 30 -A XXX -o [output_file_1] -p [output_file_2] --pair-filter=any [input_file_1] [input_file_2]
Specific command (will work if you are in the folder where the input files are located):
$ ~/.local/bin/cutadapt --cut 20 -q 30 -m 30 -A XXX -o SRR4114395_1_5end_qual_trimmed.fastq.gz -p SRR4114395_2_5end_qual_trimmed.fastq.gz --pair-filter=any SRR4114395_1.fastq.gz SRR4114395_2.only.5end.trimmed.fastq.gz
--cut 20: You trim off the first 20 bases from the 5’ end of the reads in file 1
-q 30: You trim of bases with a quality-score below 30 from the 3’ ends of the reads in both files
-m 30: Only reads that are at least 30 bp after trimming are kept
-A XXX: A hack (work-around) that must be added since reads of file 2 would otherwise not be quality-trimmed from the 3’ end
output_file_1: Call the output file for file 1 SRR4114395_1_5end_qual_trimmed.fastq.gz
output_file_2: Call the output file for file 2 SRR4114395_2_5end_qual_trimmed.fastq.gz
--pair-filter=any: Both reads in a pair are removed if one of the reads are filtered away (removed due to being too short)
The trimming will take a few minutes. When it finishes, open the trimmed files with FastQC (first copy them to the local computer) to confirm that the quality of the reads have improved.
Getting read statistics
We will be using prinseq_lite to get statistics on the reads. Install prinseq_lite on the AWS instance like this:
$ wget https://sourceforge.net/projects/prinseq/files/standalone/prinseq-lite-0.20.4.tar.gz
This will download the prinseq_lite Linux binaries as a so called tar-ball (a computer file format that combines and compresses multiple files). Next, the files that make up the tar-ball should be extracted and unzipped:
$ tar -xvzf prinseq-lite-0.20.4.tar.gz
This will create a folder called ~/prinseq-lite-0.20.4. Prinseq_lite is written in the programming language Perl and the executable, prinseq-lite.pl can now be found in the directory ~/prinseq-lite-0.20.4. It should initially be made executable:
$ chmod u+x ~/prinseq-lite-0.20.4/prinseq-lite.pl
u means user
x means executable
Retrieving read statistics for the trimmed forward reads:
$ gzip -dc [input_file_fastq.gz] | ~/prinseq-lite-0.20.4/prinseq-lite.pl -fastq stdin -stats_len -stats_info > [input_file.stats]
gzip -dc: Initially we are decompressing (-d) the input file and writing it to standard output (-c), but then redirecting the output (the uncompressed content of the file) via the pipe, “|”, to the prinseq_lite program
input_file_fastq.gz: A gzipped file with raw reads in fastq format
-fastq: Specifying that the input file format is fastq
-stats_len: Outputs minimum (min), maximum (max), range (range), mean (mean), standard deviation (stddev), mode (mode) and mode value (modeval), and median (median) for read length.
-stats_info: Outputs basic information such as number of reads (reads) and total bases (bases).
input_file.stats: The name of the output file that contains the read statistics
Specific commands (one for the trimmed forward reads and one for the trimmed reverse reads):
$ gzip -dc ~/data/SRR4114395_1_5end_qual_trimmed.fastq.gz | ~/prinseq-lite-0.20.4/prinseq-lite.pl -fastq stdin -stats_len -stats_info > SRR4114395_1_trim.stats
$ gzip -dc ~/data/SRR4114395_2_5end_qual_trimmed.fastq.gz | ~/prinseq-lite-0.20.4/prinseq-lite.pl -fastq stdin -stats_len -stats_info > SRR4114395_2_trim.stats
Use less to examine the content of the two output file.
Q4. How many reads are there in each of the two files and what is the mean read length after trimming?
Q5. Make a rough calculation of the depth of coverage based on the read statistics:
Depth = (no. of reads * av. read length) / genome size
You can use an approximate genome size of 3.000.000 bp for S. aureus.
Terminating your AWS instance
You can now terminate the AWS instance, as we will not be using this instance type again today.
NOTE: Once you have terminated your instance, you will not be able to access it again and all data will be lost so make sure you have copied all you need to your local computer.
Since you are charged by the hour when running AWS instances in this way, it is important to remember to terminate the instances:
Go to the AWS control panel > Services > EC2 > Running Instances.
Tick the box in front of the line describing the instance. The box should turn blue.
Click the “Actions” button > Instance State > Terminate
For a description of the difference between stopping and terminating an instance see: https://docs.rightscale.com/faq/clouds/aws/Whats_the_difference_between_Terminating_and_Stopping_an_EC2_Instance.html
When the status of the instance has changed to “Terminated”, you will no longer be charged for instance usage.
If you have time left, take a look at the below supplementary questions, which highlights additional options you might want to add to the Cutadapt command for read trimming.
SQ1. Most Illumina fastq files will be using the Phred+33 scheme for encoding the quality scores. Rarely, you may encounter some old files that use the Phred+64 scheme. To ensure your files use Phred+33, you can take a look at the content of the files to see if any of the characters encoded by ASCII values 33-63 are present in the lines with the quality score. If they are, your files are for sure encoded using Phred+33.
The figure below shows a section of a fastq file. Has Phred+33 or Phred+64 been used to encode the quality scores?
Cutadapt’s default behaviour is to assume your files use Phred+33, but if they use Phred+64, add the following to the cutadapt command:
SQ2. Now, let’s pretend that FastQC found an adapter sequence in you reads:
Look into Cutadapt’s documentation to figure out what you should additionally add to the command to remove the adapter sequence if it is found in either the 5’ or 3’ end.