The purpose of this exercise is to determine the Multilocus Sequence Type (MLST) and plasmid profile of the Staphylococcus aureus ATCC 25923 isolate that we have been working with in the previous exercises. We will do this by using the draft genomes we created in Ex. 4. as input for the MLST and PlasmidFinder methods from Center for Genomic Epidemiology (CGE), which we will run command line on an AWS Linux instance.
The draft assemblies that were generated in Ex. 4.
Launch a new Amazon Instance
See Ex. 2, if you have forgotten how to launch an AWS instance. As Amazon Machine Image (AMI) select “Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type”. As Instance Type keep the default selection “t2.micro”.
Log in to the instance as previously using the new IP-Address:
$ ssh -i ~/.ssh/[KeyPair.pem] ec2-user@[IP-Address]
Copy data files to AWS instance
Make a folder called “data” in the home directory and copy the two draft assembly files to the AWS instance (while standing in the directory containing the draft assemblies):
$ scp -i ~/.ssh/[KeyPair.pem] contigs_trimmed.fasta ec2-user@[IP-Address]:~/data/.
$ scp -i ~/.ssh/[KeyPair.pem] contigs_untrimmed.fasta ec2-user@[IP-Address]:~/data/.
Confirm that the two draft assemblies are now on the AWS instance in the data folder.
Remember, if you want to copy an entire folder on your local machine to the data folder on AWS, you could instead write:
$ scp -i ~/.ssh/[KeyPair.pem] -r [path]/[to]/[folder] ec2-user@[IP-Address]:~/data/.
Where [path]/[to]/[folder] is the path to the folder and name of the folder you want to copy.
Setting up databases for CGE tools
First, update the installed packages and package cache on your instance.
$ sudo yum update -y
$ sudo yum install git
On the AWS instance, create a folder in the home directory called “databases”:
$ mkdir ~/databases
Move to the new folder.
$ cd database
Now, download (pull) the CGE MLST database repository:
$ git clone https://bitbucket.org/genomicepidemiology/mlst_db
Similarly, download the CGE PlasmidFinder database repository:
$ git clone https://bitbucket.org/genomicepidemiology/plasmidfinder_db
Confirm that you now have two folders called mlst_db and plasmidfinder_db in the databases folder. The MLST and PlasmidFinder methods will expect the databases to be called “mlst”, and “plasmidfinder”, respectively, so we will just rename them:
$ mv ~/databases/mlst_db ~/databases/mlst
$ mv ~/databases/plasmidfinder_db ~/databases/plasmidfinder
Note: All CGE databases can be downloaded in this simple way, except for the KmerFinder database, which due to its large size, cannot be hosted by BitBucket. If you want to download the KmerFinder database, it is described in this document how to do it. We will not use KmerFinder in this exercise, though.
Instead of installing the individual Perl and Python scripts for the CGE methods, we will run them via Docker images. This way we do not need to worry about dependencies. Install the most recent Docker package.
$ sudo yum install -y docker
Start the Docker service.
$ sudo service docker start
Add the ec2-user to the docker group so you can execute Docker commands without using sudo.
$ sudo usermod -a -G docker ec2-user
Exit and log in again (remember you can find old commands using the arrow-up) to make the new docker group permissions take effect.
Verify that the ec2-user can run Docker commands without sudo:
$ docker info
Note: In some cases, you may need to reboot your instance to provide permissions for the ec2-user to access the Docker daemon. Try rebooting your instance if you see the following error:
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
Downloading Docker images that includes the CGE tools
First, confirm that you have currently no docker images available
$ docker images
You should see a single line:
REPOSITORY TAG IMAGE ID CREATED SIZE
To download (pull) the docker image with the MLST method, type:
$ docker pull goseqit/mlst_advanced_workshop_docker
And to pull the image with plasmidfinder, type:
$ docker pull goseqit/plasmidfinder_goseqit_docker
Confirm that the two docker images have been downloaded by again typing:
$ docker images
Running CGE tools on the AWS instance
The command for running the docker images is a bit complicated, but you will see that each command is build up from roughly the same elements and only needs to be adapted slightly depending on which method you want to run.
First make two folders for storing the output files:
$ mkdir ~/MLST_results_trimmed ~/MLST_results_untrimmed
General command for running MLST:
$ docker run -ti --rm -w /output -v ~/[database folder]:/databases -v ~/[folder with input data]:/input -v ~/[folder to write output files to]:/output goseqit/mlst_advanced_workshop_docker MLST -f /input/[inputfile.fsa] -s [mlst scheme] > ~/[path and folder to write output files to]/log_MLST
Specific command (while standing in your home directory), when using contigs_trimmed.fasta as input file:
$ docker run -ti --rm -w /output -v ~/databases:/databases -v ~/data:/input -v ~/MLST_results_trimmed:/output goseqit/mlst_advanced_workshop_docker MLST -f /input/contigs_trimmed.fasta -s saureus > ~/MLST_results_trimmed/log_trimmed_MLST
docker run -ti --rm: This is the general command to run a command within a docker container. The ”--rm” means that the container is deleted, when the run has finished.
-w -output: Must be included or the output files will only be created inside the docker container.
-v ~/[database folder]:/databases: Here we point to the database folder - the folder to which we pulled the MLST database, which in this case is the folder “databases”.
-v ~/[folder with input data]:/input: Here we point to the folder that contains the input draft genomes, which in this case is the folder “data”.
-v ~/[folder to write output files to]:/output: Here we point to the folder in which the files related to running the MLST method should be saved. In this case it is the folder “MLST_results_trimmed”.
goseqit/mlst_advanced_workshop_docker: Here we point to the relevant Docker image.
MLST: This is the method we want to have run.
-f /input/contigs_trimmed.fasta: Here we point to the input draft genome. Note that “/input/” is not a directory on the system, but corresponds to “~/data” as specified earlier by “-v ~/data:/input”.
-s [scheme]: Here we specify which MLST scheme to use. Since we are working with a S. aureus draft genome, we will use the saureus scheme. In the config file in the MLST database, you can see all the possible MLST schemes to choose among.
> ~/MLST_results/log_MLST: Here we specify which file to write the log of the run to.
When the run have finished (takes 4-5 minutes), you can have a look at the log like this:
$ less ~/MLST_results_trimmed/log_trimmed_MLST
It initially contains a summary of the run and should contain the sentence “Program finished successfully!” approximately in the middle. It also contains the identified MLST alleles and sequence type written in JSON format. It is easier to look at this in the results files, which can all be found in ~/MLST_results_trimmed/MLST_saureus.
Note: When running MLST via the Docker image mlst_advanced_workshop_docker, a folder called MLST_[scheme] will automatically be generated in the output folder you specified in the docker run command.
Below is a description of the relevant files present in the MLST_saureus folder.
results.txt: Contains the identified sequence type and a table in which the identified alleles are specified along with their resemblance to alleles in the MLST database. Also contains alignments of the identified MLST alleles and the corresponding sequence in the input genome.
results_tab.txt: Contains the same result table as in results.txt, but now with tab-separated columns for opening in, e.g., Excel.
MLST_allele_seq.fsa: The sequence of the identified MLST alleles in FASTA format.
Hit_in_genome_seq.fsa: The sequence of the loci in the input genome that align to the MLST alleles in FASTA format.
MLST_saureus.err: If the run is not executed according to plan, errors are written to this file. If there are no errors, the file should just contain:
Q1: What is the sequence type of our S. aureus isolate?
Q2: Are any of the identified MLST alleles in the isolate less than perfect matches to the MLST allele in the MLST database (for a perfect match, the %ID is 100 and the length of the alignment (HSP) equals the length of the database allele)?
Now, run a similar command for the draft genome in contigs_untrimmed.fasta:
$ docker run -ti --rm -w /output -v ~/databases:/databases -v ~/data:/input -v ~/MLST_results_untrimmed:/output goseqit/mlst_advanced_workshop_docker MLST -f /input/contigs_untrimmed.fasta -s saureus > ~/MLST_results_untrimmed/log_untrimmed_MLST
Q3: What is the sequence type of the draft assembly generated on the basis of the untrimmed reads. Was this to be expected?
Q4: What would you write as MLST scheme (instead of “saureus”) in the above commands, if your draft genome was an E. coli?
Now let’s look at how to run PlasmidFinder to identify plasmid replicons. The general command is very similar to the one we used when running MLST.
General command for running PlasmidFinder:
$ docker run -ti --rm -w /output -v ~/[database folder]:/databases -v ~/[folder with input data]:/input -v ~/[folder to write output files to]:/output goseqit/plasmidfinder_goseqit_docker PlasmidFinder -f /input/[inputfile.fsa] -s [sub_database] -k [%ID] > ~/[path and folder to write output files to]/log_PlasmidFinder
goseqit/plasmidfinder_goseqit_docker: the relevant Docker image.
PlasmidFinder: The method we want to run.
-s [sub_database]: Instead of MLST scheme, you should specify which PlasmidFinder sub database to use. You can look into the config file of the PlasmidFinder database to see which sub databases to choose among (spoiler alert: enterobacteriaceae or gram_positive).
-k [%ID]: For PlasmidFinder it is possible to specify a minimum threshold for the %identity (an integer between 50-100) between a replicon in the PlasmidFinder databases and a corresponding sequence in the input genome for the hit to be reported. The default minimum %identity is 90%.
If you have a lot of input draft genomes and not just two as in this exercise, it becomes cumbersome to type in a command for each input file. In that case, follow the below procedures to run the same command for all input files.
1) Make sure all your input files are located in the same folder. For us, this is already the case, as both contigs_trimmed.fasta and contigs_untrimmed.fasta are located in ~/data. Besides the input files, the folder should be empty.
2) Move to the folder with the input files.
$ cd ~/data
3) Now type in the below small bash program (hitting Enter after each line):
$ for file in $(ls); do
$ mkdir ~/“$file”_output
The first line means “for each of the files in the folder do the following”. The middle line means “make a directory in the home directory called the file name concatenated with “_output””. The third line just specifies that it is the end of the program. The above program should have created the folders “contigs_trimmed.fasta_output” and “contigs_untrimmed.fasta_output” in your home directory. Confirm it is true.
4) Now, for running PlasmidFinder for each of the input files stand in the ~/data folder and type the following:
$ for file in $(ls); do
$ docker run -ti --rm -w /output -v ~/databases:/databases -v ~/data:/input -v ~/“$file”_output:/output goseqit/plasmidfinder_goseqit_docker PlasmidFinder -f /input/“$file” -s gram_positive -k 90 > ~/“$file”_output/log_PlasmidFinder
Again, the first line means “for each of the files in the folder do the following”. The second line contains the command for running PlasmidFinder via the Docker image. Notice the three times we write “$file” instead of the actual file name. The command can easily be substituted with, e.g, the command for running MLST.
After a few minutes, you can find the PlasmidFinder result files in the folders ~/contigs_trimmed.fasta_output/PF_gram_positive and ~/contigs_untrimmed.fasta_output/PF_gram_positive. The files are equivalent to the files generated when running MLST.
Q5: Did you find the same plasmid replicons for both assemblies?
All commands for running the most used CGE -finder methods can be found HERE. Besides MLST and PlasmidFinder, they include KmerFinder, FimTyper, pMLST, PointFinder, ResFinder, SerotypeFinder, and VirulenceFinder. The document also describes how to search your own, costumised gene database.
End of the exercise.
REMEMBER TO TERMINATE THE AWS INSTANCE WHEN YOU HAVE COPIED EVERYTHING YOU NEED TO YOUR LOCAL COMPUTER.
Extra - if you finish early. AWS instance not needed: Manipulating the CGE databases using SourceTree
If you want to download the CGE databases to your local computer, I suggest you use SourceTree. In a first step, you should create an account on BitBucket. It is free: https://bitbucket.org/account/signup/ .
Go through this online tutorial to set up SourceTree: https://confluence.atlassian.com/get-started-with-sourcetree/install-sourcetree-847359094.html .
You should go though the steps “Install SourceTree”, “Connect your BitBucket or GitHub account”, and “Clone a remote repository” (in this case the remote repository is the CGE database you want a local copy of on your computer, e.g., https://bitbucket.org/genomicepidemiology/mlst_db or https://bitbucket.org/genomicepidemiology/virulencefinder_db).
If you want to use a static or customised version of a database, and not the one from https://bitbucket.org/genomicepidemiology/, which is regularly updated/changed, you can copy the local folder containing the database to a new folder, and add this new folder as a new repository to your own BitBucket account. Creating a new repositoty via SourceTree is quite easy, see: https://confluence.atlassian.com/sourcetreekb/create-a-new-repository-with-sourcetree-780870052.html. Now, when you download (pull) the database to the AWS instance, replace the URL to the CGE repository with the URL of your own repository.