Exercise4

Identification of Unique Core Sequences

In this exercise, we will use the RUCS method for identification of sequences that are unique to a set of bacterial strains with a particular phenotype. The same method can also be used to design PCR primers that exclusively target a specific set of strains.

Data

To use the method it is necessary to have WGS data from two sets of strains - a negative set that does not have the phenotype of interest, and a positive set where all strains display the phenotype of interest. In this exercise we work with E. coli strains. The strains in the negative dataset are all susceptible to the antibiotic colistin, while the strains in the positive set are all resistant.

NEGATIVE SET:

Exercise4/ecoli_1.fa - ecoli_20.fa

(20 files in total)

POSITIVE SET:

Exercise4/ecoli_mcr1_1.fa - ecoli_mcr1_7.fa

(7 files in total)

Analysis

First you have to prepare your data to be acceptable by RUCS. Generate a folder that you call "negatives". Move all the files from the negative set to this folder. Similarly generate a folder that you call "positives". Move all the files from the positive set to this folder. Note that you have to write "positives" and "negatives" exactly like this with no capital letters. Now select both the "negatives" and the "positives" folder and generate one zipped file (select both files, right click, select "Compress 2 Items"). 

Go to the input page for the RUCS method. As "Entry point" select "Find Unique Core Sequences". Leave the remaining options as they are. As "Isolate File" select the zipped file you just generated. Click "Upload".

When the analysis has finished (or for speedy access, go to RESULTS), answer the following questions:

1. How many sequences do the positives strains have in common?

2. How many basepairs do the sequences from question 1 collectively contain?

3. How many of the sequences that are found in all the positive strains are only found in the positive strains (never in the negative strains)?

4. How many basepairs do the sequences from question 3 collectively contain? 

5. How many of the sequences from question 3 are larger than 200 bp? 

6. How many basepairs do the sequences from question 5 collectively contain? 

The unique core sequences that are larger than 200 bp, are the sequences that could potentially contain a gene conferring colistin resistance to the strains of the positive set. You can confirm that a colistin resistance gene is part of these sequences by clicking the button marked "UCS Contigs", which will download these sequences (and actually also all the sequences shorter than 200 bp) in FASTA format to your computer. Next upload the file to ResFinder and search for acquired resistance genes.

7. Which resistance gene did ResFinder identify?

Since the colistin resistance gene was already known and part of the ResFinder database, it was actually quite a detour to use RUCS to discover the source of the resistance. If the positive phenotype is instead caused by a hitherto unknown gene, you might want to initially look for open reading frames in the unique core sequences.

Go to the Virtual Ribosome. This tool has been developed at the Technical University of Denmark, although not by CGE. Upload the FASTA file from before via the "Choose File" button. As "Reading Frame" select "All (6 reading frames)", and as "ORF Finder" select "Start codon: Any". Click "Submit query".   

Virtual Ribosome is very fast, so it has not been prerun. When Virtual Ribosome has finished, you can see a translation report with all the possible peptide sequences. You can download the FASTA file with the peptide sequences by clicking the link marked "FASTA" in the top. To proceed from here, you might want to BLAST (using BLASTp) the peptide sequences to the non-redundant database of protein sequences (nr). If you did this with all the peptide sequences, you would find that the longest has hits to colistin resistance genes.