Advanced workshop ex6


The purpose of this exercise is to determine the Multilocus Sequence Type (MLST) and plasmid profile of the Staphylococcus aureus ATCC 25923 isolate that we have been working with in the previous exercises. We will do this by using the draft genomes we created in Ex. 4. as input for the MLST and PlasmidFinder methods from Center for Genomic Epidemiology (CGE), which we will run command line on an AWS Linux instance.


The draft assemblies that were generated in Ex. 4.




Launch a new Amazon Instance

See Ex. 2, if you have forgotten how to launch an AWS instance. As Amazon Machine Image (AMI) select “Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type”. As Instance Type keep the default selection “t2.micro”.

Log in to the instance as previously using the new IP-Address:

$ ssh -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]

Copy data files to AWS instance

Make a folder called “data” in the home directory and copy the two draft assembly files to the AWS instance (while standing in the directory containing the draft assemblies):

$ scp -i ~/.ssh/[KeyPair.pem] contigs_trimmed.fasta  ec2-user@[IP-Address]:~/data/.

$ scp -i ~/.ssh/[KeyPair.pem] contigs_untrimmed.fasta  ec2-user@[IP-Address]:~/data/.

Confirm that the two draft assemblies are now on the AWS instance in the data folder.

Remember, if you want to copy an entire folder on your local machine to the data folder on AWS, you could instead write:

$ scp -i ~/.ssh/[KeyPair.pem] -r [path]/[to]/[folder] ec2-user@[IP-Address]:~/data/.

Where [path]/[to]/[folder] is the path to the folder and name of the folder you want to copy.

Setting up databases for CGE tools

First, update the installed packages and package cache on your instance.

$ sudo yum update -y

Install git:

$ sudo yum install git

On the AWS instance, create a folder in the home directory called “databases”:

$ mkdir ~/databases

Move to the new folder.

$ cd database

Now, download (pull) the CGE MLST database repository:

$ git clone

Similarly, download the CGE PlasmidFinder database repository:

$ git clone

Confirm that you now have two folders called mlst_db and plasmidfinder_db in the databases folder. The MLST and PlasmidFinder methods will expect the databases to be called “mlst”, and “plasmidfinder”, respectively, so we will just rename them:

$ mv ~/databases/mlst_db ~/databases/mlst

$ mv ~/databases/plasmidfinder_db ~/databases/plasmidfinder

Note: All CGE databases can be downloaded in this simple way, except for the KmerFinder database, which due to its large size, cannot be hosted by BitBucket. If you want to download the KmerFinder database, it is described in this document how to do it. We will not use KmerFinder in this exercise, though.

Installing Docker

Instead of installing the individual Perl and Python scripts for the CGE methods, we will run them via Docker images. This way we do not need to worry about dependencies. Install the most recent Docker package.

$ sudo yum install -y docker

Start the Docker service.

$ sudo service docker start

Add the ec2-user to the docker group so you can execute Docker commands without using sudo.

$ sudo usermod -a -G docker ec2-user

Exit and log in again (remember you can find old commands using the arrow-up) to make the new docker group permissions take effect.

Verify that the ec2-user can run Docker commands without sudo:

$ docker info

Note: In some cases, you may need to reboot your instance to provide permissions for the ec2-user to access the Docker daemon. Try rebooting your instance if you see the following error:

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

Downloading Docker images that includes the CGE tools

First, confirm that you have currently no docker images available

$ docker images

You should see a single line:

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

To download (pull) the docker image with the MLST method, type:

$ docker pull goseqit/mlst_advanced_workshop_docker

And to pull the image with plasmidfinder, type:

$ docker pull goseqit/plasmidfinder_goseqit_docker

Confirm that the two docker images have been downloaded by again typing:

$ docker images

Running CGE tools on the AWS instance

The command for running the docker images is a bit complicated, but you will see that each command is build up from roughly the same elements and only needs to be adapted slightly depending on which method you want to run.

First make two folders for storing the output files:

$ mkdir ~/MLST_results_trimmed ~/MLST_results_untrimmed

General command for running MLST:

$ docker run -ti --rm -w /output -v ~/[database folder]:/databases -v ~/[folder with input data]:/input -v ~/[folder to write output files to]:/output goseqit/mlst_advanced_workshop_docker MLST -f /input/[inputfile.fsa] -s [mlst scheme] > ~/[path and folder to write output files to]/log_MLST

Specific command (while standing in your home directory), when using contigs_trimmed.fasta as input file:

$ docker run -ti --rm -w /output -v ~/databases:/databases -v ~/data:/input -v ~/MLST_results_trimmed:/output goseqit/mlst_advanced_workshop_docker MLST -f /input/contigs_trimmed.fasta -s saureus > ~/MLST_results_trimmed/log_trimmed_MLST


docker run -ti --rm: This is the general command to run a command within a docker container. The ”--rm” means that the container is deleted, when the run has finished.

-w -output: Must be included or the output files will only be created inside the docker container.

-v ~/[database folder]:/databases: Here we point to the database folder - the folder to which we pulled the MLST database, which in this case is the folder “databases”.

-v ~/[folder with input data]:/input: Here we point to the folder that contains the input draft genomes, which in this case is the folder “data”.

-v ~/[folder to write output files to]:/output: Here we point to the folder in which the files related to running the MLST method should be saved. In this case it is the folder “MLST_results_trimmed”.

goseqit/mlst_advanced_workshop_docker: Here we point to the relevant Docker image.

MLST: This is the method we want to have run.

-f /input/contigs_trimmed.fasta: Here we point to the input draft genome. Note that “/input/” is not a directory on the system, but corresponds to “~/data” as specified earlier by “-v ~/data:/input”.

-s [scheme]: Here we specify which MLST scheme to use. Since we are working with a S. aureus draft genome, we will use the saureus scheme. In the config file in the MLST database, you can see all the possible MLST schemes to choose among.

> ~/MLST_results/log_MLST: Here we specify which file to write the log of the run to.

When the run have finished (takes 4-5 minutes), you can have a look at the log like this:

$ less ~/MLST_results_trimmed/log_trimmed_MLST

It initially contains a summary of the run and should contain the sentence “Program finished successfully!” approximately in the middle. It also contains the identified MLST alleles and sequence type written in JSON format. It is easier to look at this in the results files, which can all be found in ~/MLST_results_trimmed/MLST_saureus.

Note: When running MLST via the Docker image mlst_advanced_workshop_docker, a folder called MLST_[scheme] will automatically be generated in the output folder you specified in the docker run command.

Below is a description of the relevant files present in the MLST_saureus folder.

results.txt: Contains the identified sequence type and a table in which the identified alleles are specified along with their resemblance to alleles in the MLST database. Also contains alignments of the identified MLST alleles and the corresponding sequence in the input genome.

results_tab.txt: Contains the same result table as in results.txt, but now with tab-separated columns for opening in, e.g., Excel.

MLST_allele_seq.fsa: The sequence of the identified MLST alleles in FASTA format.

Hit_in_genome_seq.fsa: The sequence of the loci in the input genome that align to the MLST alleles in FASTA format.

MLST_saureus.err: If the run is not executed according to plan, errors are written to this file. If there are no errors, the file should just contain:



Q1: What is the sequence type of our S. aureus isolate?

Q2: Are any of the identified MLST alleles in the isolate less than perfect matches to the MLST allele in the MLST database (for a perfect match, the %ID is 100 and the length of the alignment (HSP) equals the length of the database allele)? 

Now, run a similar command for the draft genome in contigs_untrimmed.fasta:

$ docker run -ti --rm -w /output -v ~/databases:/databases -v ~/data:/input -v ~/MLST_results_untrimmed:/output goseqit/mlst_advanced_workshop_docker MLST -f /input/contigs_untrimmed.fasta -s saureus > ~/MLST_results_untrimmed/log_untrimmed_MLST

Q3: What is the sequence type of the draft assembly generated on the basis of the untrimmed reads. Was this to be expected?

Q4: What would you write as MLST scheme (instead of “saureus”) in the above commands, if your draft genome was an E. coli?

Now let’s look at how to run PlasmidFinder to identify plasmid replicons. The general command is very similar to the one we used when running MLST.

General command for running PlasmidFinder:

$ docker run -ti --rm -w /output -v ~/[database folder]:/databases -v ~/[folder with input data]:/input -v ~/[folder to write output files to]:/output goseqit/plasmidfinder_goseqit_docker PlasmidFinder -f /input/[inputfile.fsa] -s [sub_database] -k [%ID] > ~/[path and folder to write output files to]/log_PlasmidFinder


goseqit/plasmidfinder_goseqit_docker: the relevant Docker image.

PlasmidFinder: The method we want to run.

-s [sub_database]: Instead of MLST scheme, you should specify which PlasmidFinder sub database to use. You can look into the config file of the PlasmidFinder database to see which sub databases to choose among (spoiler alert: enterobacteriaceae or gram_positive).

-k [%ID]: For PlasmidFinder it is possible to specify a minimum threshold for the %identity (an integer between 50-100) between a replicon in the PlasmidFinder databases and a corresponding sequence in the input genome for the hit to be reported. The default minimum %identity is 90%.

If you have a lot of input draft genomes and not just two as in this exercise, it becomes cumbersome to type in a command for each input file. In that case, follow the below procedures to run the same command for all input files.

1) Make sure all your input files are located in the same folder. For us, this is already the case, as both contigs_trimmed.fasta and contigs_untrimmed.fasta are located in ~/data. Besides the input files, the folder should be empty.

2) Move to the folder with the input files.

$ cd ~/data

3) Now type in the below small bash program (hitting Enter after each line):

$ for file in $(ls); do

$ mkdir ~/“$file”_output

$ done

The first line means “for each of the files in the folder do the following”. The middle line means “make a directory in the home directory called the file name concatenated with “_output””. The third line just specifies that it is the end of the program. The above program should have created the folders “contigs_trimmed.fasta_output” and “contigs_untrimmed.fasta_output” in your home directory. Confirm it is true.

4) Now, for running PlasmidFinder for each of the input files stand in the ~/data folder and type the following:

$ for file in $(ls); do

$ docker run -ti --rm -w /output -v ~/databases:/databases -v ~/data:/input -v ~/“$file”_output:/output goseqit/plasmidfinder_goseqit_docker PlasmidFinder -f /input/“$file” -s gram_positive -k 90 > ~/“$file”_output/log_PlasmidFinder

$ done

Again, the first line means “for each of the files in the folder do the following”. The second line contains the command for running PlasmidFinder via the Docker image. Notice the three times we write “$file” instead of the actual file name. The command can easily be substituted with, e.g, the command for running MLST.

After a few minutes, you can find the PlasmidFinder result files in the folders ~/contigs_trimmed.fasta_output/PF_gram_positive and ~/contigs_untrimmed.fasta_output/PF_gram_positive. The files are equivalent to the files generated when running MLST.

Q5: Did you find the same plasmid replicons for both assemblies?

All commands for running the most used CGE -finder methods can be found HERE. Besides MLST and PlasmidFinder, they include KmerFinder, FimTyper, pMLST, PointFinder, ResFinder, SerotypeFinder, and VirulenceFinder. The document also describes how to search your own, costumised gene database.

End of the exercise.


Extra - if you finish early. AWS instance not needed: Manipulating the CGE databases using SourceTree

If you want to download the CGE databases to your local computer, I suggest you use SourceTree. In a first step, you should create an account on BitBucket. It is free: .

Go through this online tutorial to set up SourceTree: .

You should go though the steps “Install SourceTree”, “Connect your BitBucket or GitHub account”, and “Clone a remote repository” (in this case the remote repository is the CGE database you want a local copy of on your computer, e.g.,  or

If you want to use a static or customised version of a database, and not the one from, which is regularly updated/changed, you can copy the local folder containing the database to a new folder, and add this new folder as a new repository to your own BitBucket account. Creating a new repositoty via SourceTree is quite easy, see: Now, when you download (pull) the database to the AWS instance, replace the URL to the CGE repository with the URL of your own repository.