To map the trimmed reads that we also worked with during previous exercises to the reference genome of the same strain using BWA in order to identify contaminating reads and SNPs.
The trimmed reads that were generated in Ex. 3:
The reference genome for the same strain:
Launching and accessing the AWS instance
Start by launching an AWS instance similarly to how we did it in Ex. 2: NOTE: In step 1 select the following Amazon Machine Image (AMI): Ubuntu Server 16.04 LTS (HVM), SSD Volume Type. In step 2 choose the instance type t2.large (2 CPUs, 8 GiB RAM, $0.0928 per Hour).
Ubuntu is a different flavour of Linux than we have previously worked with. We choose it this time, because it makes installing some of the software we need for this exercise very easy. Some of the software we use during this exercise (e.g., BWA) is written in C and needs to be compiled before it can be used. This is very easy to do using APT. APT is short for “Advanced Package Tool” and is a set of tools that handles the installation and removal of software on Ubuntu machines.
Accessing an AWS Ubuntu machine is in general done like this:
$ ssh -i ~/.ssh/[KeyPair.pem] ubuntu@[IP-Address]
Note: While the user (the name you write before the “@“) on the Linux machines was “ec2-user”, it is “ubuntu” on Ubuntu machines (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstancesConnecting.html).
Preparing the AWS instance
On the AWS instance, create a folder called “data” (try to remember yourself how it is done) and copy the data files relevant to this exercise to the folder using “scp” as previously (remember this is done from your local computer standing in the folder that contains the data files):
$ scp -i ~/.ssh/[KeyPair.pem] [data file] ubuntu@[IP-Address]:~/data/.
We will be installing all the software that is needed for the exercise before we start the exercise. In the following, always type “Y” for yes, if asked if you are sure you want to continue an installation.
Start by updating APT’s lists of available software packages and dependencies. This will not install anything, but ensure that you later install the newest versions.
$ sudo apt update
We add “sudo”, which is short for "Superuser do" before the commands, to run the commands as administrator (root user). This is necessary to be allowed to run some of the commands.
Now for the actual installation of BWA using APT:
$ sudo apt install bwa
We can also install the two tools bcftools and vcftools using APT:
$ sudo apt install bcftools vcftools
Unfortunately, the version of Samtools, that is installed via APT is not the one we need. Hence we have to compile the program ourselves, which means we first have to install a lot of tools needed for the compilation. Just copy and paste in each of the 5 below commands:
$ sudo apt install build-essential
$ sudo apt install libz-dev
$ sudo apt install libncurses5-dev libncursesw5-dev
$ sudo apt install libbz2-dev
$ sudo apt install -y liblzma-dev
Now, for installing Samtools, execute the following commands:
$ wget https://sourceforge.net/projects/samtools/files/latest/download/samtools-1.8.tar.bz2
$ tar -xvjf samtools-1.8.tar.bz2
$ cd samtools-1.9/
The below step is the actual compilation:
The below command just ensures that we can call the command without having to specify which directory the Samtools executable is located in:
$ sudo cp samtools /usr/bin/.
Confirm that Samtools have been installed correctly by typing “samtools”. It should lead to a list of the commands you can run with Samtools.
The final software that needs to be installed is Picard.
Picard needs Java, however, it doesn't work with the pre-installed version. Accordingly, we first have to uninstall that version:
$ sudo apt purge openjdk-\*
And install the right version:
$ sudo apt install openjdk-8-jdk
Now, move to your home directory:
First, the source code for Picard is downloaded from GitHub:
$ git clone https://github.com/broadinstitute/picard.git
Note that when the source code is on GitHub, we can use the command “git clone” instead of “wget” for downloading it.
Move to the picard folder:
$ cd picard/
To build a fully-packaged, runnable Picard jar with all dependencies included, run:
$ ./gradlew shadowJar
A successful installation will end with “BUILD SUCCESSFUL”.
Now we are finally ready for the actual work.
Mapping the reads to the reference genome with bwa
On the AWS Ubuntu instance, move to the data folder.
$ cd ~/data
In general, bwa is used like this. First the reference is prepared:
$ bwa index [reference.fasta]
$ bwa index NZ_CP009361.fasta
Next, the reads are mapped (aligned) to the reference:
$ bwa mem [reference.fasta] [reads_1.fastq.gz] [reads_2.fastq.gz] > [aln.sam]
It is good practise to immediately compress the SAM file into BAM format to save space. Hence, the last command can be written like this:
$ bwa mem [reference.fasta] [reads_1.fastq.gz] [reads_2.fastq.gz] | samtools view -Sb - > aln.bam
$ bwa mem NZ_CP009361.fasta SRR4114395_1_5end_qual_trimmed.fastq.gz SRR4114395_2_5end_qual_trimmed.fastq.gz | samtools view -Sb - > NZ_CP009361_SRR4114395.aln.bam
Note the “|“, the pipe, which redirects the output from one command to another command for further processing. In the above, the first command is “bwa mem NZ_CP009361.fasta SRR4114395_1_5end_qual_trimmed.fastq.gz SRR4114395_2_5end_qual_trimmed.fastq.gz”, which on its own produces alignments of the reads to the reference and outputs the alignments in SAM format. The alignments are then used as input (piped into) to the second command “samtools view -Sb -”, which compresses the SAM format into BAM format and then lastly (“> NZ_CP009361_SRR4114395.aln.bam”) writes the compressed alignments to the file NZ_CP009361_SRR4114395.aln.bam.
Lets take a look at the alignments we just made. The SAM-format consists of two sections, header lines starting with “@“ and the actual alignments:
$ samtools view -h [file.aln.bam] | less -S
file.aln.bam, is an alignment in BAM format
-h: Tells Samtools to show the header
-S: Lines longer than the screen width should be chopped (truncated) rather than wrapped. That is, the portion of a long line that does not fit in the screen width is not shown. Try to run the below specific command with and without the “-S” option to see the difference.
$ samtools view -h NZ_CP009361_SRR4114395.aln.bam | less -S
In the SAM format, header lines starts with "@", "@SQ" are the sequences you mapped against (the reference fasta file). For the alignment the fields are: Read name, flag, reference it mapped to, start position in reference where the read mapped to, mapping quality (Phred scale), cigar, mate reference map, mate position, template length, read sequence and read qualities. Any additional fields are optional. See SAM-specification (http://samtools.github.io/hts-specs/SAMv1.pdf) for a thorough description. A translator for the flags can be found here: http://broadinstitute.github.io/picard/explain-flags.html.
Q1: Locate the read name “SRR4114395.10”. How many lines include this read name? Why are there more than one line with this read name?
Q2. What is the flag for the alignment with the first “SRR4114395.10” and what does it mean? Hint: Use http://broadinstitute.github.io/picard/explain-flags.html to answer the question.
Identifying contaminating reads
Next, let’s get the statistics for the mapping. This can be done with the “samtools flagstat” command, which should be run on the BAM file generated in the previous step:
$ samtools flagstat [file.aln.bam]
$ samtools flagstat NZ_CP009361_SRR4114395.aln.bam
Q3. From Ex. 3 we know that there are 474123 reads in each of the files SRR4114395_1_5end_qual_trimmed.fastq.gz and SRR4114395_2_5end_qual_trimmed.fastq.gz (as also seen from the “read1” and “read2” values in the samtools flagstat output). Looking at the samtools flagstat output, the number of alignments listed in the first line (QC-passed reads + QC-failed reads) minus the supplementary alignments should equal the sum of reads in the two fastq files. Does it?
Q4. What is the percentage of unmapped reads (contamination).
In a first step, sort the BAM file according to the mapping position.
$ samtools sort -O BAM [file.sort.aln.bam] [file.aln.bam]
-O BAM specifies that the output format should be BAM
file.sort.aln.bam: The sorted output alignment file in BAM format
file.aln.bam: The unsorted input alignment file in BAM format
$ samtools sort -O BAM -o NZ_CP009361_SRR4114395.sort.aln.bam NZ_CP009361_SRR4114395.aln.bam
Before SNP-calling, we have to remove duplicates (duplicates arise as an artefact during PCR amplification. What we will see in the alignment is that there will be many exact duplicates of a read. Because all the duplicated reads were sampled from the same DNA molecule it gives an uneven representation of that molecule compared to other molecules, and will bias the SNP calling. We will therefore remove them using Picard MarkDuplicates):
$ java -Xmx2g -jar ~/picard/build/libs/picard.jar MarkDuplicates INPUT= [file.sort.aln.bam] OUTPUT= [file.sort.rmdup.bam] ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
ASSUME_SORTED=TRUE: Tells the program that we have sorted the alignment file beforehand even though the header may say otherwise.
METRICS_FILE=/dev/null: Specification of file to write duplication metrics to. It is required to provide a file name, but since we are not interested in this output, we just specify /dev/null, which means it will be discarded.
VALIDATION_STRINGENCY=SILENT: Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance (speed) when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
REMOVE_DUPLICATES=true: Duplicate reads are removed, not just marked.
file.sort.aln.bam: The sorted input alignment file in BAM format.
file.sort.rmdup.bam: The resulting sorted alignment file without duplicate reads.
$ java -Xmx2g -jar ~/picard/build/libs/picard.jar MarkDuplicates INPUT= NZ_CP009361_SRR4114395.sort.aln.bam OUTPUT= NZ_CP009361_SRR4114395.sort.rmdup.bam ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
Q5: Use “samtools flagstat” to see how many reads were removed this way:
$ samtools flagstat NZ_CP009361_SRR4114395.sort.rmdup.bam
Now we are almost ready to call SNPs. First, the reference must be prepared using “samtools faidx”:
$ samtools faidx [reference.fasta]
$ samtools faidx NZ_CP009361.fasta
Then we use “samtools mpileup" and “bcftools” to identify variants (SNPs and INDELS):
$ samtools mpileup -uf [reference.fasta] [file.sort.rmdup.bam] | bcftools call -m -v -O z - > output.var.raw.vcf.gz
samtools mpileup generates a VCF file from the input BAM file
-u: Generate uncompressed VCF output, which is preferred for piping
-f: The faidx-indexed reference file in the FASTA format
Note: The output file in VCF format from the samtools mpileup command is piped into the “bcftools call” command, where the variants are identified and finally saved as a compressed VCF file (“> output.var.raw.vcf.gz”).
-m: Alternative model for multiallelic and rare-variant calling designed to overcome known limitations in -c calling (the previous) model
-v: Output variant sites only
-O z: Output type. “z” means the VCF file is compressed.
$ samtools mpileup -uf NZ_CP009361.fasta NZ_CP009361_SRR4114395.sort.rmdup.bam | bcftools call -m -v -O z - > NZ_CP009361_SRR4114395.var.raw.vcf.gz
This step takes a while - have a cup of coffee 🙂
To look at the resulting output file write the following:
$ zcat NZ_CP009361_SRR4114395.var.raw.vcf.gz | less -S
Note that the format is VCF. You can read more about the VCF format here: https://samtools.github.io/hts-specs/VCFv4.2.pdf. The header lines start with ##. Next, each variant is described in one line.
You can count how many variants this raw VFC file contains using the following command:
zcat NZ_CP009361_SRR4114395.var.raw.vcf.gz | grep -v "#" | wc
grep -v “#”: Outputs all lines except the ones that contain “#” (header lines)
wc: Short for word count. The first column in the output is the number of lines. The second column indicates the number of words. The last column indicates the number of characters.
Q6: How many variants were detected in the raw VFC file?
The raw, unfiltered number of variants is a huge overestimation of the true number of variants. There are many different ways to filter the variants to get a more true number. Below is one example:
$ zcat NZ_CP009361_SRR4114395.var.raw.vcf.gz | vcf-annotate --filter d=30/Q=200/c=2,10 | grep -v "#" | grep "PASS" | grep "1/1" | grep -v "INDEL" | wc
d=30: Min. depth = 30
Q=200: Min mapping quality = 200
c=2,10: SNPs and INDELS must be 10 apart
grep “PASS”: Note that vcf-annotate will only mark the variants, so we afterwards have to specifically select the variants that passed the filtering using “grep “PASS””.
grep “1/1”: Further, we only want homozygous SNPs, since bacteria are haploid and cannot be heterozygous. Heterozygous SNPs will be marked “0/1”.
grep -v “INDEL”: In this example we are only interested in SNPs, not INDELS. When “-v” is added between “grep” and the word that you look for, the output is the lines that do not contain the word.
Q7. How many SNPs did you find now?
Copy the entire data folder with all its content to your local computer. This should be done from your local computer:
$ scp -i ~/.ssh/[KeyPair.pem] -r ubuntu@[IP-Address]:~/data/ .
Note there is a space before the final dot “.”.
IMPORTANT: If you have no time left and have copied everything you need from AWS to your local computer, terminate the instance via the AWS control panel. Don’t terminate the instance if you have time left and want to go on with the below exercise 5-extra. You will need the instance for the exercise.
Since we ended up finding 0 SNPs in the above exercise (as expected), I have manually mutated the reference strain by substituting an "A" at position 1708109 with a "T". Repeat the above exercise using the mutated reference strain, NZ_CP009361_mut.fasta, instead of the original one to confirm that you can find the SNP.