Advanced workshop ex5


To map the trimmed reads that we also worked with during previous exercises to the reference genome of the same strain using BWA in order to identify contaminating reads and SNPs.


The trimmed reads that were generated in Ex. 3:



The reference genome for the same strain:



Launching and accessing the AWS instance

Start by launching an AWS instance similarly to how we did it in Ex. 2: NOTE: In step 1 select the following Amazon Machine Image (AMI): Ubuntu Server 16.04 LTS (HVM), SSD Volume Type. In step 2 choose the instance type t2.large (2 CPUs, 8 GiB RAM, $0.0928 per Hour).

Ubuntu is a different flavour of Linux than we have previously worked with. We choose it this time, because it makes installing some of the software we need for this exercise very easy. Some of the software we use during this exercise (e.g., BWA) is written in C and needs to be compiled before it can be used. This is very easy to do using APT. APT is short for “Advanced Package Tool” and is a set of tools that handles the installation and removal of software on Ubuntu. APT simplifies the process of managing software by automating the retrieval, configuration and installation of software packages. One command that allows us to interact with APT is “apt-get install” that we will be using during this exercise.

Accessing an AWS Ubuntu machine is in general done like this:

$ ssh -i ~/.ssh/[KeyPair.pem] ubuntu@[IP-Address]

Note: While the user (the name you write before the “@“) on the Linux machines was “ec2-user”, it is “ubuntu” on Ubuntu machines (

Preparing the AWS instance

On the AWS instance, create a folder called “data” (try to remember yourself how it is done) and copy the data files relevant to this exercise to the folder using “scp” as previously (remember this is done from your local computer standing in the folder that contains the data files):

$ scp -i  ~/.ssh/[KeyPair.pem] [data file] ubuntu@[IP-Address]:~/data/.

We will be installing all the software that is needed for the exercise before we start the exercise. In the following, always type “Y” for yes, if asked if you are sure you want to continue an installation.

Start by updating APT’s lists of available software packages and dependencies. This will not install anything, but ensure that you later install the newest versions.

$ sudo apt-get update

We add “sudo”, which is short for "Superuser do" before the commands, to run the commands as administrator (root user). This is necessary to be allowed to run some of the commands.

Now for the actual installation of BWA using APT:

$ sudo apt-get install bwa

We can also install the two tools bcftools and vcftools using APT:

$ sudo apt-get install bcftools vcftools

Unfortunately, the version of Samtools, that is installed via APT is not the one we need. Hence we have to compile the program ourselves, which means we first have to install a lot of tools needed for the compilation. Just copy and paste in each of the 5 below commands:

$ sudo apt-get install build-essential

$ sudo apt-get install libz-dev

$ sudo apt-get install libncurses5-dev libncursesw5-dev

$ sudo apt-get install libbz2-dev

$ sudo apt-get install -y liblzma-dev

Now, for installing Samtools, execute the following commands:

$ wget

$ tar -xvjf samtools-1.8.tar.bz2

$ cd samtools-1.9/ #OBS BEMÆRK OM DET FORTSÆTTER MED AT VÆRE ET samtools-1.9 directory, der genereres. Mærkeligt, da det er 1.8, der downloades

The below step is the actual compilation:

$ make

The below command just ensures that we can call the command without having to specify which directory the Samtools executable is located in:

$ sudo cp samtools /usr/bin/.

Confirm that Samtools have been installed correctly by typing “samtools”. It should lead to a list of the commands you can run with Samtools.

The final software that needs to be installed is Picard.

Picard needs Java, so we will start by installing that:

$ sudo apt-get install default-jdk

Now, move to your home directory:

$ cd

First, the source code for Picard is downloaded from GitHub:

$ git clone

Note that when the source code is on GitHub, we can use the command “git clone”  instead of “wget” for downloading it.

Move to the picard folder:

$ cd picard/

To build a fully-packaged, runnable Picard jar with all dependencies included, run:

$ ./gradlew shadowJar

A succesfull installation will end with “BUILD SUCCESFUL”.

Now we are finally ready for the actual work.

Mapping the reads to the reference genome with bwa

On the AWS Ubuntu instance, move to the data folder.

$ cd ~/data

In general, bwa is used like this. First the reference is prepared:

$ bwa index [reference.fasta]

Specfic command:

$ bwa index NZ_CP009361.fasta

Next, the reads are mapped (aligned) to the reference:

General command:

$ bwa mem [reference.fasta] [reads_1.fastq.gz] [reads_2.fastq.gz] > [aln.sam]

It is good practise to immediately compress the SAM file into BAM format to save space. Hence, the last command can be written like this:

General command:

$ bwa mem [reference.fasta] [reads_1.fastq.gz] [reads_2.fastq.gz] | samtools view -Sb - > aln.bam

Specific command:

$ bwa mem NZ_CP009361.fasta SRR4114395_1_5end_qual_trimmed.fastq.gz SRR4114395_2_5end_qual_trimmed.fastq.gz | samtools view -Sb - > NZ_CP009361_SRR4114395.aln.bam

Note the “|“, the pipe, which redirects the output from one command to another command for further processing. In the above, the first command is “bwa mem NZ_CP009361.fasta SRR4114395_1_5end_qual_trimmed.fastq.gz SRR4114395_2_5end_qual_trimmed.fastq.gz”, which on its own produces alignments of the reads to the reference and outputs the alignments in SAM format. The alignments are then used as input (piped into) to the second command “samtools view -Sb -”, which compresses the SAM format into BAM format and then lastly (“>  NZ_CP009361_SRR4114395.aln.bam”) writes the compressed alignments to the file NZ_CP009361_SRR4114395.aln.bam.

Lets take a look at the alignments we just made. The SAM-format consists of two sections, header lines starting with “@“ and the actual alignments:

General command:

$ samtools view -h [file.aln.bam] | less -S


file.aln.bam, is an alignment in BAM format

-h:  Tells Samtools to show the header

-S: Lines longer than the screen width should be chopped (truncated) rather than wrapped. That is, the portion of a long line that does not fit in the screen width is not shown. Try to run the below specific command with and without the “-S” option to see the difference.

Specific command:

$ samtools view -h NZ_CP009361_SRR4114395.aln.bam | less -S

In the SAM format, header lines starts with "@", "@SQ" are the sequences you mapped against (the reference fasta file). For the alignment the fields are: Read name, flag, reference it mapped to, start position in reference where the read mapped to, mapping quality (Phred scale), cigar, mate reference map, mate position, template length, read sequence and read qualities. Any additional fields are optional. See SAM-specification ( for a thorough description. A translator for the flags can be found here:

Q1: Locate the read name “SRR4114395.10”. How many reads have this name? Why are there more than one read with this name?

Q2. What is the flag for the first read with the name “SRR4114395.10” and what does it mean? Hint: Use to answer the question.

Identifying contaminating reads

Next, let’s get the statistics for the mapping. This can be done with the “samtools flagstat” command, which should be run on the BAM file generated in the previous step:

General command:

$ samtools flagstat [file.aln.bam]

Specific command:

$ samtools flagstat NZ_CP009361_SRR4114395.aln.bam

Q3. From Ex. 3 we know that there are 474123 reads in each of the files  SRR4114395_1_5end_qual_trimmed.fastq.gz and SRR4114395_2_5end_qual_trimmed.fastq.gz (as also seen from the “read1” and “read2” values in the samtools flagstat output). Looking at the samtools flagstat output, the number of alignments listed in the first line (QC-passed reads + QC-failed reads) minus the supplementary alignments should equal the sum of reads in the two fastq files. Does it?

Q4. What is the percentage of unmapped reads (contamination).

Identifying SNPs

In a first step, sort the BAM file according to the mapping position.

General command:

$ samtools sort -O BAM [file.sort.aln.bam] [file.aln.bam]


-O BAM specifies that the output format should be BAM

file.sort.aln.bam: The sorted output alignment file in BAM format

file.aln.bam: The unsorted input alignment file in BAM format

Specific command:

$ samtools sort -O BAM -o NZ_CP009361_SRR4114395.sort.aln.bam NZ_CP009361_SRR4114395.aln.bam

Before SNP-calling, we have to remove duplicates:

General command:

$ java -Xmx2g -jar ~/picard/build/libs/picard.jar MarkDuplicates INPUT= [file.sort.aln.bam] OUTPUT= [file.sort.rmdup.bam] ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true


ASSUME_SORTED=TRUE: Tells the program that we have sorted the alignment file beforehand even though the header may say otherwise.

METRICS_FILE=/dev/null: Specification of file to write duplication metrics to. It is required to provide a file name, but since we are not interested in this output, we just specify /dev/null, which means it will be discarded.

VALIDATION_STRINGENCY=SILENT: Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance (speed) when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

REMOVE_DUPLICATES=true: Duplicate reads are removed, not just marked.

file.sort.aln.bam: The sorted input alignment file in BAM format.

file.sort.rmdup.bam: The resulting sorted alignment file without duplicate reads.

Specific command:

$ java -Xmx2g -jar ~/picard/build/libs/picard.jar MarkDuplicates INPUT= NZ_CP009361_SRR4114395.sort.aln.bam OUTPUT= NZ_CP009361_SRR4114395.sort.rmdup.bam ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

Q5: Use “samtools flagstat” to see how many reads were removed this way:

$ samtools flagstat NZ_CP009361_SRR4114395.sort.rmdup.bam

Now we are almost ready to call SNPs. First, the reference must be prepared using “samtools faidx”:

General command:

$ samtools faidx [reference.fasta]

Specific command:

$ samtools faidx NZ_CP009361.fasta

Then we use “samtools mpileup" and “bcftools” to identify variants (SNPs and INDELS):

General command:

$ samtools mpileup -uf [reference.fasta] [file.sort.rmdup.bam] | bcftools call -m -v -O z - > output.var.raw.vcf.gz

samtools mpileup generates a VCF file from the input BAM file

-u: Generate uncompressed VCF output, which is preferred for piping

-f: The faidx-indexed reference file in the FASTA format

Note: The output file in VCF format from the samtools mpileup command is piped into the “bcftools call” command, where the variants are identified and finally saved as a compressed VCF file (“> output.var.raw.vcf.gz”).

-m: Alternative model for multiallelic and rare-variant calling designed to overcome known limitations in -c calling (the previous) model

-v: Output variant sites only

-O z: Output type. “z” means the VCF file is compressed.

Specific command:

$ samtools mpileup -uf NZ_CP009361.fasta NZ_CP009361_SRR4114395.sort.rmdup.bam | bcftools call -m -v -O z - > NZ_CP009361_SRR4114395.var.raw.vcf.gz

This step takes a while - have a cup of coffee 🙂

To look at the resulting output file write the following:

$ zcat NZ_CP009361_SRR4114395.var.raw.vcf.gz | less -S

Note that the format is VCF. You can read more about the VCF format here: The header lines start with ##. Next, each variant is described in one line.

You can count how many variants this raw VFC file contains using the following command:

zcat NZ_CP009361_SRR4114395.var.raw.vcf.gz | grep -v "#" | wc

grep -v “#”: Outputs all lines except the ones that contain “#” (header lines)

wc: Short for word count. The first column in the output is the number of lines. The second column indicates the number of words. The last column indicates the number of characters.

Q6: How many variants were detected in the raw VFC file?

The raw, unfiltered number of variants is a huge overestimation of the true number of variants. There are many different ways to filter the variants to get a more true number. Below is one example:

Specific command:

$ zcat NZ_CP009361_SRR4114395.var.raw.vcf.gz | vcf-annotate --filter d=30/Q=200/c=2,10 |  grep -v "#" | grep "PASS" | grep "1/1" | grep -v "INDEL" | wc

d=30: Min. depth = 30

Q=200: Min mapping quality = 200

c=2,10: SNPs and INDELS must be 10 apart

grep “PASS”: Note that vcf-annotate will only mark the variants, so we afterwards have to specifically select the variants that passed the filtering using “grep “PASS””.

grep “1/1”: Further, we only want homozygous SNPs, since bacteria are haploid and cannot be heterozygous. Heterozygous SNPs will be marked “0/1”.

grep -v “INDEL”: In this example we are only interested in SNPs, not INDELS. When “-v” is added between “grep” and the word that you look for, the output is the lines that do not contain the word.

Q7. How many SNPs did you find now?

Copy the entire data folder with all its content to your local computer. This should be done from your local computer:

$ scp -i  ~/.ssh/[KeyPair.pem] -r ubuntu@[IP-Address]:~/data/ .

Note there is a space befor the final dot “.”.

Important: When you have copied everything you need to your local computer, terminate the instance via the AWS control panel.