GoSeqItTools FAQ

General

What is the advantage of using the GoSeqIt Tools web-platform as opposed to running locally installed software?

We see several advantages in using a web-platform instead of locally installed software.

  1. The GoSeqIt Tools web-platform is backed by the computer power of Amazon Cloud Servers. This means you do not have to worry about setting up a local server, which is both expensive and time-consuming, or run the analyses on your own laptop, which is inevitably going to be slow. If you upload many files to the GoSeqIt Tools web-platform at the same time, more computerpower will be added to the web-platform, meaning you never have to wait for long to get your results.
  2. GoSeqIt Tools can run from any computer as long as it has an internet connection. You can effortlessly access your account and work on your data from work, at home, or when you are travelling.
  3. You do not have to worry about updating the software or the underlying databases. Both are always up to date.
  4. Help is never far away. We provide support within 1-2 days if you send a mail to mail@goseqit.com.

Which type of data can be analysed using the GoSeqIt Tools platform?

You can analyse whole genome sequence data from bacteria. The sequence data must be pre-assembled into draft genomes before uploading it to the platform. The data must be in the form of FASTA files, which are simple text files with the extension .fasta, .fa, .fsa, or .fna. FASTA files are build up from FASTA entries, where each entry consists of a header line beginning with ">" followed by a unique name/identifier and some optional descriptive text. The remaining lines consist of the DNA sequence. Below, you can see an example of the content of a FASTA file with two FASTA entries:

Are there any rules one should consider when naming the FASTA files before uploading them to the GoSeqIt Tools platform?

In general, it is advisable to restrict the names of the FASTA files to only contain the following characters:

a-z

A-Z

0-9

-

_

.

That being said, the tools will automatically convert other characters to an underscore, "_", to avoid problems when running the methods. If you anyway get an unexpected result, try removing any characters not listed above from the file name. Also check that your file contains sequence data in FASTA format (see the question "Which type of data can be analysed using the GoSeqIt Tools platform"). The file extension has to be either .fa, .fna, .fsa, or .fasta.

How do I upload my input files to the GoSeqIt Tools platform?

For a full description on how to upload data and run analysis, please have a look at this tutorial.

Short recap: You either drag-and-drop your files from your computer to the field marked "Drop files" (upper right side) or select the files you wish to have analysed by clicking the link "browse your computer", which is located just below the field marked "Drop files". If you do this while being inside a specific project, the files and results will automatically be placed inside that project, when clicking "Upload and analyse...". Alternatively you have to select if the files and results should be placed into an existing or new project before clicking "Upload and analyse...". Note that it is only possible to upload files with the extension .fa, .fna., .fsa, or .fasta, but you may upload as many as you like in one go.

I try to drop a file into the “Drop files” box, but nothing happens?

Make sure that your files have the extension .fa, .fna, .fsa, or .fasta.

Tip: Windows generally hides known file extensions in file names. To disable this, go to 'Control panel' > 'Appearance and Personalization' > 'Folder options'. A new window will open, click on the 'view' tab and uncheck the box 'Hide extensions for known file types'.

Is it possible to upload raw/short sequence reads to the platform?

No, this is currently not possible. Your files have to contain pre-assembled draft genomes in FASTA format. We plan, however, that later versions of the platform will be able to accept raw/short sequence reads, which we will then assemble into draft genomes for you.

Which type of analysis is performed when uploading data to GoSeqIt Tools?

When you first log into GoSeqIt Tools, only the Bacterial Analysis Pipeline (standard or sensitive) is available. In the first step of this pipeline, the bacterial species is predicted based on the number of kmers (nucleotide stretches with the length k) co-occurring between the input genome and a reference database of draft genomes. Further, acquired antimicrobial resistance genes are identified using a BLAST-based approach, where the nucleotide sequence of the input genome is compared to the genes in the ResFinder database. Depending on the identified species, Multilocus Sequence Typing (MLST) is performed, also using a BLAST-based approach. If the input genome is recognised as belonging to Enterobacteriaceae or the gram positive bacteria, BLAST is used to search for plasmid replicons using the PlasmidFinder database. Identified plasmids of the incF, IncH1, IncH2, IncI1, IncN, or IncA/C type are further subtyped by plasmid MLST. Finally, identified Escherichia coliEnterococcus spp., Staphylococcus aureus, and Listeria spp. are compared to the VirulenceFinder database containing known virulence genes.

It is also possible to design you own pipeline including any of the available methods. Please watch this video tutorial on how to design you own pipeline for bacterial whole genome sequence analysis.

What is the difference between the standard and sensitive versions of the Bacterial Analysis Pipeline?

If you select your data to be analysed with the standard version of the Bacterial Analysis Pipeline, the minimum %ID (percent identity) between a sequence in the input genome and a corresponding allele in one of the databases is the following:
ResFinder: 90%
VirulenceFinder: 90%
PlasmidFinder: 80%
If you want to pick up more distantly related sequences, you should instead choose the sensitive version of the pipeline. Using the sensitive version, the minimum %ID is 60% for all three databases.

How do I examine the results from an analysis?

For a full description on how to examine the results of an analysis, please have a look at this tutorial.

Short recap: When an analysis has finished, you can browse the results while logged into the GoSeqIt Tools platform. Just click the "Show report" link for the input file of interest. Note that for each of the methods - except ContigsAnalyzer - it is possible to download extra files containing, for instance, the alignment between the database genes and the corresponding sequences in the input genome. For each report you can also download the full report in PDF format clicking the link marked "PDF" or you can download a a 1-line tab-separated summary file that can be opened in, e.g., Excel. To do this, click the button marked "TSV" from inside the report page.

Is it possible to download 1-line summary files for more than one report at a time?

Indeed it is! While logged into the GoSeqIt Tools platform, first click "Reports" in the top left. This will lead you to an overview page showing all the reports available. Tick the reports for which you would like a 1-line summary. Once one or more reports are ticked, a link saying "Download summary" will appear at the top of the list of reports. Click this link to download a file with 1-line tab-separated summaries for each report. The file can, for instance, be opened in Excel.

What is the difference between TSV and TSV long (short and long one-line summaries)?

"TSV" is the short one-line summaries, while "TSV long" is the long one-line summaries. In the short one-line summaries, all genes found by the ResFinder method are comma-separated and listed in the same field. Similarly, all genes found by the VirulenceFinder method are comma-separated and  listed in the same field. If many genes are found for either method, it can be difficult to compare the gene content between two or more isolates. The long one-line summaries make it easier to compare gene content between isolates, since each gene, which is present in at least one of isolates, is listed in a separate field. The value for the corresponding isolate is "1", if the particular gene is present in the isolate, and "0", if it is not. The long one-line summaries are only available for customised pipelines that include the ResFinder and/or VirulenceFinder methods.

The difference between the short and long one-line summaries is also explained in the tutorial on examining the results.

Is my data safe?

Industry standard security is used by GoSeqIt Tools both when it comes to transferring and storing your data. That being said, it is important that you only upload de-identified information. Specifically, you should never name your files in such a way that personally identifying information is deducible from the filename.

ContigsAnalyzer (basic statistics)

What is a contig?

We use the term contig to describe a continuous sequence of nucleotides. Usually the bacterial chromosome (or plasmids) is not fully assembled into one, continuous sequence, but rather separated into several “contigs”. In a multientry FASTA file, a contig corresponds to one FASTA entry.

What is the N50 value?

The N50 value is a statistical measure used to describe the quality of a draft assembly. The N50 value is defined as the length of the shortest contig in the set of largest contigs that together constitute at least half of the total assembly size. In general, a high N50 value signifies a high-quality draft assembly.

FimTyper (E. coli FimH typing)

Where can I find more information about the FimTyper method?

Here and in:

Roer L, Tchesnokova V, Allesøe R, Muradova M, Chattopadhyay S, Ahrenfeldt J, Thomsen MCF, Lund O, Hansen F, Hammerum AM, Sokurenko E,  Hasman H. Development of a web tool for Escherichia coli sub-typing based on fimH alleles. J Clin Microbiol. 2017. 55(8):2538-2543. PMID: 28592545.

 

Where do you get the data for the FimTyper method from and when was it last updated?

We get our data for the FimTyper method from Center for Genomic Epidemiology, Technical University of Denmark (www.genomicepidemiology.org). Please check the News section for when the database was last updated.

What does DB allele/Alignment length mean?

The DB allele length is the length (in basepairs) of the closest matching FimH allele in the FimTyper database, while the alignment length is the length of the alignment between the closest matching FimH allele and the corresponding sequence in the input genome.

PhylotypeFinder (E. coli phylotyping)

Where can I find more information about the E. coli phylo-grouping method

Here and in:

Clermont, O. Christenson J. K., Denamur E. and Gordon D. M. 2013. The Clermont Escherichia coli phylo-typing method revisited: improvement of specificity and detection of new phylo-groups. Environmental Microbiology Reports. 5(1), 58–65. PMID: 23757131

What does the “1”s and “0”s in the table of the phylo-group loci mean?

"1" means that a PCR product for the specific locus is likely to be generated for this isolate, while "0" means a PCR product would likely not be produced. Note that it is not quite the same as the locus being present/not present in the isolate, since mutations in the primer binding sites might prevent a PCR product from being produced. Although PhylotypeFinder in a first step uses BLAST for identifying the loci of the isolates, particular mutations in the primer binding sites might in the second step lead to a "0" for the locus.

Can I use PhylotypeFinder for typing of other species besides E. coli?

No, PhylotypeFinder can only be used with E. coli isolates. To be sure your isolate is E. coli, it is recommended to initially run it through KmerFinder for species identification.

KmerFinder (species prediction)

How is the species predicted?

In short, the species is predicted by chopping the input genome into 15mers (nucleotide stretches with the length 15) and comparing these 15mers to a database of genomes with known species, which have also been chopped into 15mers. The species of the input genome is predicted as the species of the database genome with which the input genome has most 15mers in common. The method is called KmerFinder and is described in detail in the following two papers:

Hasman H, Saputra D, Sicheritz-Pontén T, Lund O, Svendsen CA, Frimodt-Møller N, Aarestrup FM. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples.  J Clin Microbiol. 2014 Jan;52(1):139-46.

and

Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, Sicheritz-Pontén T, Aarestrup FM, Ussery DW, Lund O. Benchmarking of methods for genomic taxonomy.  J Clin Microbiol. 2014 May;52(5):1529-39.

Is the species prediction based on the 16S rRNA gene?

No, we base the species prediction on overlapping 15mers (nucleotide stretches with the length 15) between the input genome and the genomes in a reference database. In this way, the entire genome along with plasmids are sampled, not just one gene. This has been found to be a superior approach as compared to only looking at the 16S rRNA gene:

 

Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, Sicheritz-Pontén T, Aarestrup FM, Ussery DW, Lund O. Benchmarking of methods for genomic taxonomy. J Clin Microbiol. 2014 May;52(5):1529-39.

The predicted species is listed as unknown. Why might that be?

To understand this, you need to understand how we predict the species in the first place: The species is predicted by chopping the input genome into 15mers (nucleotide stretches with the length 15) and comparing these 15mers to a reference database of genomes with known species that have also been chopped into 15mers. The species of the input genome is predicted as the species of the reference genome that the input genome has most 15mers in common with. The value "reference coverage", which is reported along with the predicted species, is defined as the number of unique kmers in the best-matching reference genome matching kmers in the input genome divided by total number of unique kmers in the reference genome. A low reference coverage indicates that the prediction is less likely to be correct. For a reference coverage < 0.50 the species is always predicted as "Unknown". Reasons that could lead to failure in predicting the species include that your file contains e.g., virus or fungal nucleotide sequences - only bacterial species can be predicted. Another explanation might be that the quality of the input genome is so poor (for instance very small) that all matches with reference genomes are statistically insignificant. The final option is that you have a rare bacterial species, which is not yet in the reference database.

Where do you get the data for the species prediction from and when was it last updated?

We get our data for species prediction from the KmerFinder database from Center for Genomic Epidemiology, Technical University of Denmark (www.genomicepidemiology.org). This database were constructed from an NCBI download of all available whole genomes and draft genomes, and has following been checked and fixed for bad files, genomes and annotations. Please check the News section for when the database was last updated.

MLST (Multilocus Sequence Typing)

Where can I find more information about the MLST method used by GoSeqIt Tools?

Here and in:

Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Pontén T, Ussery DW, Aarestrup FM, Lund O. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 2012 Apr;50(4):1355-61. PMID: 22238442.

I have been running the Bacterial Analysis Pipeline, but there are no Multilocus Sequence Type (MLST) results. Why is that?

MLST is only performed for species that according to http://pubmlst.org/data/ have an MLST scheme. If the species predicted during the Species Prediction step does not have an MLST scheme, MLST is not performed.

Where do you get the data for MLST from and when was it last updated?

All data is downloaded from www.pubmlst.org. Please check the News section to see when the database was last updated.

A “-like” has been added to the identified Sequence Type (ST), e.g., ST11-like. What does it mean?

A "-like" is automatically added to the identified Sequence Type unless all alleles in the input genome match perfectly to a database allele (%Identify = 100 and Alignment Length = DB allele Length)

I have been analysing an Escherichia coli/Acinetobacter baumannii/Pasteurella multicoda/Streptococcus thermophilus or Leptospira genome, but get two/three different Sequence Types. Why?

For some species, more than one MLST scheme has been developed:

The Acinetobacter baumannii scheme #1 (also called the Oxford scheme) samples the following genes:

Oxf_gltAOxf_gyrBOxf_gdhBOxf_recAOxf_cpn60Oxf_gpiOxf_rpoD

and is described in:
Bartual et al. Development of a multilocus sequence typing scheme for characterisation of clinical isolates of Acinetobacter baumannii. J Clin Microbiol 2005, 43:4382-4390

The Acinetobacter baumannii scheme #2  (also called the Pasteur scheme) samples the following genes:

Pas_cpn60Pas_fusAPas_gltAPas_pyrGPas_recAPas_rplBPas_rpoB

and is described in:
Diancourt et al. The population structure of Acinetobacter baumannii: expanding multiresistant clones from an ancestral susceptible genetic pool. PLoS One 2010, 7:e10034

The Escherichia coli scheme #1 (Achtman scheme) samples the following genes: adkfumCgyrBicdmdhpurArecA

and is described in:

Wirth et al. Sex and virulence in Escherichia coli: an evolutionary perspective. Mol. Microbiol. 2006: 60(5); 1136-1151

The Escherichia coli scheme #2 (Pasteur scheme) samples the following genes: dinBicdApabBpolBputPtrpAtrpBuidA

and is described in:

Jaureguy et al. Phylogenetic and genomic diversity of human bacteremic Escherichia coli strains. 2008. BMC Genomics 9:560

The Pasteurella multocida scheme #1 (RIRDC scheme) samples the following genes:

adkestpmizwfmdhgdhpgi

and was developed by Robert Davies, University of Glasgow, UK.

The Pasteurella multocida scheme #2 (multihost scheme) samples the following genes:

adkaroAdeoDgdhAg6pdmdhpgi

and was developed by Sounthi Subaaharan and Pat Blackall (Animal Research Institute, Australia)

The Leptospira spp. scheme samples the following genes:

glmUpntAsucAtpiApfkBmreAcaiB

and is described in:

Boonsilp et al. 2013, PLoS Negl Trop Dis 7:e1954

The Leptospira scheme #2 samples the following genes:

adk_2glmU_2icdA_2lipL32_2lipL41_2mreA_2pntA_2

and is described in:

Varni et al. 2014, Infect Genet Ecol 22:216-22

The Leptospira scheme #3 samples the following genes:

adk_3icdA_3lipL41_3rrs2_3secY_3lipL32_3

and is described in:

Ahmed et al. 2006, Ann Clin Microbiol Antimicrob 23:28

The Streptococcus thermophilus scheme (Pasteur scheme) samples the following genes:

Genes: ddlA; glcK; proA; ptsI; serB; tkt

The Streptococcus thermophilus #2 scheme (Yu scheme) samples the following genes:

carB; clpX; dnaA; murC; murE; pepN; pepX; pyrG; recA; rpoB

What does the % Identity mean?

The % Identity is the percentage of nucleotides that are identical between the best-matching database MLST allele and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

What does the Alignment Length mean?

We use BLAST to identify the regions in the input genome that match the alleles in the MLST database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and an allele in the MLST database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

The Alignment Length does not match the DB allele Length. Should I be concerned?

The Alignment Length corresponds to the length of the local alignment between the best-matching database MLST allele and the corresponding sequence in the input genome. If the Alignment Length is shorter than the database (DB) allele length, it means that the sequence of the entire MLST allele could not be found in the input sequence. Typically, if the lengths only differ by a few nucleotides, it could be due to a hitherto unseen and hence unregistered allele in you input genome. Larger differences might be caused by a poor assembly and the input genome being split into two contigs exactly across the MLST gene. You might want to take a look at the alignment file (download the file "Results including alignments (txt)") to decide which is the likely explanation.

What does the different colours of the rows in the MLST table signify?

Green signifies perfect matches between the best-matching MLST allele and a sequence in the input genome (%Identity=100, Alignment Length = DB alllele Length).
Orange means that the entire length of the MLST allele is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
Grey means that it is only part of the MLST allele that is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length,  %ID = 100 or %ID < 100).

The Sequence Type (ST) is reported as Unknown, although all MLST alleles are perfect matches. How could that be?

The Sequence Type is determined by the specific combination of registered MLST alleles according to www.pubmlst.org. Even though all alleles in the input genome match perfectly to alleles in the MLST database, the combination of alleles might be previously unseen, and hence unregistered.

Is it possible to see the actual alignment between the best-matching MLST alleles and the corresponding sequences in the input genome?

Indeed it is! Just download the file marked "Results including alignments (txt)" below the MLST table.

PlasmidFinder (plasmid replicon identification)

I have been running the Bacterial Analysis Pipeline, but cannot find the output from PlasmidFinder anywhere. Why is that?

Plasmid replicons are only identified for Enterobacteriaceae and gram positive bacteriaMissing output from PlasmidFinder is likely due to your input genome not being identified as belonging to either of these groups during the species prediction step.

Where can I find more information about the PlasmidFinder method?

Here and in:

Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller Aarestrup F, Hasman H. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 2014 Jul;58(7):3895-903. PMID: 24777092.

Where do you get the data for the PlasmidFinder database from and when was it last updated?

We get the PlasmidFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.

How similar does a database plasmid replicon sequence has to be to a sequence in the input genome for the replicon to be reported?

If you are running the standard version of the Bacterial Analysis Pipeline, the % Identity (the percentage of identical nucleotides) must be at least 80% for a database replicon to be reported. Further, the Alignment Length (the length of the local alignment between the database replicon and a corresponding sequence in the input genome) must be at least 60% of the length of the database replicon. If you are running the sensitive version of the Bacterial Analysis Pipeline, the % Identity must be at least 60% for a database replicon to be reported, and the Alignment Length still at least 60% of the length of the database replicon.

What does the different colours of the rows in the PlasmidFinder table signify?

Green signifies perfect matches between the database replicon sequence and a sequence in the input genome (%Identity=100, Alignment Length = DB allele Length).
Orange means that the entire length of the database replicon sequence is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
Grey means that only part of the database replicon sequence is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length, %ID = 100 or %ID < 100).

What does the % Identity mean?

The % Identity is the percentage of nucleotides that are identical between the database replicon sequence and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

What does Alignment Length mean?

We use BLAST to identify the regions in the input genome that match the replicons in the PlasmidFinder database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and a replicon in the PlasmidFinder database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

Is it possible to see the actual alignment between the database replicon sequence and the corresponding sequences in the input genome?

Indeed it is! Just download the file marked "Results including alignments (txt)" below the PlasmidFinder table.

pMLST (plasmid Multilocus Sequence Typing)

Where can I find more information about the pMLST method?

Here and in:

Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller Aarestrup F, Hasman H. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 2014 Jul;58(7):3895-903. PMID: 24777092.

Where do you get the data for pMLST from and when was it last updated?

We get all data from www.pubmlst.org. Please check the News section for when the database was last updated.

I have been running the Bacterial Analysis Pipeline, but there are no Plasmid Multilocus Sequence Type (pMLST) results. Why is that?

pMLST is only performed if plasmids of the type IncN, IncHI1, IncHI2, IncI1, IncF, or IncA/C have been indentified in the input file by PlasmidFinder.

A “-like” has been added to the identified Sequence Type (ST). What does it mean?

A "-like" is automatically added to the identified Sequence Type unless all alleles in the input genome match perfectly to a database allele (%Identify = 100 and Alignment Length = DB allele Length)

What does the different colours of the rows in the pMLST table signify?

Green signifies perfect matches between the best-matching pMLST allele and a sequence in the input genome (%Identity=100, Alignment Length = DB allele Length).
Orange means that the entire length of the pMLST allele is covered by a corresponding sequence in the input sequence, but some nucleotides might differ (%ID < 100, Alignement Length = DB allele Length).
Grey means that it is only part of the pMLST allele that is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length,  %ID = 100 or %ID < 100).

What does the % Identity mean?

The % Identity is the percentage of nucleotides that are identical between the best-matching database pMLST allele and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

Is it possible to see the actual alignment between the database replicon sequence and the corresponding sequences in the input genome?

Indeed it is! Just download the file marked "Results including alignments (txt)" below the pMLST table.

What does Alignment Length mean?

We use BLAST to identify the regions in the input genome that match the pMLST alleles in the pMLST  database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and an allele in the pMLST database. The Alignment Length is the length of the alignment measured in basepairs.

PointFinder (chomosomal point mutations causing antimicrobial resistance)

Where can I find more information about the PointFinder method?

Here and in:

Zankari E., Allesøe R., Joensen K.G., Cavaco L.M., Lund O., Aarestrup F.M. PointFinder: a novel web tool for WGS-based detection of antimicrobial resistance associated with chromosomal point mutations in bacterial pathogens. J Antimicrob Chemother. 2017 Oct 1;72(10):2764-2768.

What is the difference between known and unknown point mutations?

For each of the species for which PointFinder is available, a number of genes are examined for point mutations previously found to make isolates harbouring the mutations resistant towards particular antimicrobial agents. If any of these point mutations are found in the genes of the input genome, they are designated "known" mutations. If the same genes contain other, previously uncharacterised point mutations with an unknown effect on the phenotype, they are designated "unknown" mutations. Note that for the unknown mutations, it is not possible to say wether or not they make the input isolate resistant to antimicrobial agents.

Where do you get the data from the PointFinder database from an when was it last updated?

We get the PointFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.

Why is the identified point mutations sometimes associated with an amino acid change and sometimes not?

If the point mutations is found in a ribosomal gene or promoter region it is not associated with an amino acid change, since these regions are not translated into protein.

ResFinder (acquired antimicrobial resistance genes)

Where can I find more information about the ResFinder method?

Here and in

Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM, Larsen MV. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012 Nov;67(11):2640-4. PMID: 22782487.

and

Zankari E, Hasman H, Kaas RS, Seyfarth AM, Agersø Y, Lund O, Larsen MV, Aarestrup FM. Genotyping using whole-genome sequencing is a realistic alternative to surveillance based on phenotypic antimicrobial susceptibility testing. J Antimicrob Chemother. 2013 Apr;68(4):771-7. PMID: 23233485.

Where do you get the data for ResFinder from and when was it last updated?

We get the ResFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.

Is there any bacterial species that should not/are not analysed by ResFinder?

No, all bacterial species can be analysed by the ResFinder method. The Bacterial Analysis Pipeline will search for resistance genes in all input files no matter which species the Species Prediction step predicts, and even if the species is predicted as unknown.

How similar does a database resistance gene has to be to a sequence in the input genome for the gene to be reported?

If you are running the standard version of the Bacterial Analysis Pipeline, the % Identity (the percentage of identical nucleotides) must be at least 90% for a database resistance gene to be reported. Further, the Alignment Length (the length of the local alignment between the database resistance gene and a corresponding sequence in the input genome) must be at least 60% of the length of the database resistance gene. If you are running the sensitive version of the Bacterial Analysis Pipeline, the % Identity must be at least 60% for a database resistance gene to be reported, and the Alignment Length still at least 60% of the length of the database resistance gene.

What does the different colours of the rows in the ResFinder table signify?

Green signifies perfect matches between the database resistance gene and a sequence in the input genome (%Identity=100, Alignment Length = DB allele Length).
Orange means that the entire length of the database resistance gene is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
Grey means that only part of the database resistance gene is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length,  %ID = 100 or %ID < 100).

What does the % Identity mean?

The % Identity is the percentage of nucleotides that are identical between the resistance gene in the ResFinder database and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

What does Alignment Length mean?

We use BLAST to identify the regions in the input genome that match the alleles in the ResFinder database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and an allele in the ResFinder database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

Is it possible to see the actual alignment between the resistance gene in the ResFinder database and the corresponding sequences in the input genome?

Indeed it is! Just download the file marked "Results including alignments (txt)" below the table of identified resistance genes.

SerotypeFinder (E. coli serotyping)

Where can I find more information about the SerotypeFinder method?

Here and in:

Joensen, K. G., A. M. Tetzschner, A. Iguchi, F. M. Aarestrup, and F. Scheutz. 2015. Rapid and easy in silico serotyping of Escherichia coli using whole genome sequencing (WGS) data. J.Clin.Microbiol. 53(8):2410-2426. doi:JCM.00008-15 [pii];10.1128/JCM.00008-15.

Can I use SerotypeFinder for other species besides E. coli?

Unfortunately not, but we are planning to add methods for serotyping of additional species besides E. coli. First in line is Salmonella.

Where do you get the data for SerotypeFinder from and when was it last updated?

We get the SerotypeFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.

What does the % Identity mean?

The % Identity is the percentage of nucleotides that are identical between the database serotypefinder gene and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

What does DB allele/Alignment Length mean?

The DB allele length is the length (in basepairs) of the closest matching allele in the SerotypeFinder database, while the alignment length is the length of the alignment between the closest matching allele and the corresponding sequence in the input genome.

VirulenceFinder (virulence factor identification)

Where do you get the data for the VirulenceFinder database from and when was it last updated?

We get the VirulenceFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.

I have been running the Bacterial Analysis Pipeline, but cannot find the output from VirulenceFinder anywhere. Why is that?

Virulence genes are only identified for Escherichia coli, Staphylococcus aureus, Listeria, and Enterococcus. Missing output from VirulenceFinder is likely due to your input genome not being identified as one of these species/genera during the species prediction step.

How similar does a database virulence gene has to be to a sequence in the input genome for the gene to be reported when running the Bacterial Analysis Pipeline?

If you are running the standard version of the Bacterial Analysis Pipeline, the % Identity (the percentage of identical nucleotides) must be at least 90% for a database virulence gene to be reported. Further, the Alignment Length (the length of the local alignment between the database virulence gene and a corresponding sequence in the input genome) must be at least 60% of the length of the database virulence gene. If you are running the sensitive version of the Bacterial Analysis Pipeline, the % Identity must be at least 60% for a database virulence gene to be reported, and the Alignment Length still at least 60% of the length of the database virulence gene.

What does the different colours of the rows in the VirulenceFinder table signify?

Green signifies perfect matches between the database virulence gene and a sequence in the input genome (%Identity=100, Alignment Length = DB allele Length).
Orange means that the entire length of the database virulence gene is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
Grey means that only part of the database virulence gene is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length,  %ID = 100 or %ID < 100).

What does the % Identity mean?

The % Identity is the percentage of nucleotides that are identical between the database virulence gene and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

What does Alignment Length mean?

We use BLAST to identify the regions in the input genome that match the alleles in the VirulenceFinder database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and an allele in the VirulenceFinder database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

Is it possible to see the actual alignment between the identified virulence genes and the corresponding sequences in the input genome?

Indeed it is! Just download the file marked "Results including alignments (txt)" below the VirulenceFinder table.