GoSeqItTools FAQ

  • 1.General
    Here you can find general questions and answers about the GoSeqIt Tools platform.
  • What is the advantage of using the GoSeqIt Tools web-platform as opposed to running locally installed software?

    We see several advantages in using a web-platform instead of locally installed software.

    1. The GoSeqIt Tools web-platform is backed by the computer power of Amazon Cloud Servers. This means you do not have to worry about setting up a local server, which is both expensive and time-consuming, or run the analysis on your own laptop, which is inevitably going to be slow. If you upload many files to the GoSeqIt Tools web-platform at the same time, more computerpower will be added to the web-platform, meaning you never have to wait for long to get your results.
    2. GoSeqIt Tools can run from any computer as long as it has an internet connection. You can effortlessly access your account and work on your data from work, at home, or when you are travelling.
    3. You do not have to worry about updating the software or the underlying databases. Both are always up to date.
    4. Help is never far away. We provide support within 1-2 days if you send a mail to mail@goseqit.com.
  • Which type of data can be analysed using the GoSeqIt Tools platform?

    You can analyse whole genome sequence data from bacteria. The sequence data must be pre-assembled into draft genomes before uploading it to the platform. The data must be in the form of FASTA files, which are simple text files with the extension .fasta, .fa, .fsa, or .fna. FASTA files are build up from FASTA entries, where each entry consists of a header line beginning with ">" followed by a unique name/identifier and some optional descriptive text. The remaining lines consist of the DNA sequence. Below, you can see an example of the content of a FASTA file with two FASTA entries:

  • Are there any rules one should consider when naming the FASTA files before uploading them to the GoSeqIt Tools platform?

    In general, it is advisable to restrict the names of the FASTA files to only contain the following characters:

    a-z

    A-Z

    0-9

    -

    _

    .

    That being said, the tools will automatically convert other characters to an underscore, "_", to avoid problems when running the methods. If you anyway get an unexpected result, try removing any characters not listed above from the file name. Also check that your file contains sequence data in FASTA format (see the question "Which type of data can be analysed using the GoSeqIt Tools platform"). The file extension has to be either .fa, .fna, .fsa, or .fasta.

     

  • Is it possible to download 1-line summary files for more than one report at a time?

    Indeed it is! While logged into the GoSeqIt Tools platform, first click "Reports" in the top left. This will lead you to an overview page showing all the reports available. Tick the reports for which you would like a 1-line summary. Once one or more reports are ticked, a link saying "Download summary" will appear at the top of the list of reports. Click this link to download a file with 1-line tab-separated summaries for each report. The file can, for instance, be opened in Excel.

  • Is it possible to upload raw/short sequence reads to the platform?

    No, this is currently not possible. Your files have to contain pre-assembled draft genomes in FASTA format. We plan, however, that later versions of the platform will be able to accept raw/short sequence reads, which we will then assemble into draft genomes for you.

  • Which type of analysis is performed when uploading data to the GoSeqItTools platform?

    The Bacterial Analysis Pipeline is being run. In the first step of the pipeline, the bacterial species is predicted based on the number of kmers (nucleotide stretches with the length k) co-occurring between the input genome and a reference database of draft genomes. Further, acquired antimicrobial resistance genes are identified using a BLAST-based approach, where the nucleotide sequence of the input genome is compared to the genes in the ResFinder database. Depending on the identified species, Multilocus Sequence Typing (MLST) is performed, also using a BLAST-based approach. If the input genome is recognised as belonging to Enterobacteriaceae or the gram positive bacteria, BLAST is used to search for plasmid replicons using the PlasmidFinder database. Identified plasmids of the incF, IncH1, IncH2, IncI1, IncN, or IncA/C type are further subtyped by plasmid MLST. Finally, identified Escherichia coli, Enterococcus spp., Staphylococcus aureus, and Listeria spp. are compared to the VirulenceFinder database containing known virulence genes.

  • How do I upload my input files to the GoSeqIt Tools platform?

    This is easy! You either drag-and-drop your files from your computer to the field marked "Drop files" (upper right side) or select the files you want to have analysed by clicking the link "browse your computer", which is located just below the field marked "Drop files". If you do this while being inside a specific project, the files and results will automatically be placed inside that project, when clicking "Upload and analyse...". Alternatively you have to select if the files and results should be placed into an existing or new project before clicking "Upload and analyse...". Note that it is only possible to upload files with the extension .fa, .fna., .fsa, or .fasta.

  • What is the difference between the standard and sensitive versions of the Bacterial Analysis Pipeline?

    If you select your data to be analysed with the standard version of the Bacterial Analysis Pipeline, the minimum %ID (percent identity) between a sequence in the input genome and a corresponding allele in one of the databases is the following:
    ResFinder: 90%
    VirulenceFinder: 90%
    PlasmidFinder: 80%
    If you want to pick up more distantly related sequences, you should instead choose the sensitive version of the pipeline. Using the sensitive version, the minimum %ID is 60% for all three databases.

  • I try to drop a file into the "Drop files" box, but nothing happens?

    Make sure that your files have the extension .fa, .fna, .fsa, or .fasta.

    Tip: Windows generally hides known file extensions in file names. To disable this, go to 'Control panel' > 'Appearance and Personalization' > 'Folder options'. A new window will open, click on the 'view' tab and uncheck the box 'Hide extensions for known file types'.

  • How do I access to the results from an analysis?

    When an analysis has finished, a summary report will automatically be mailed to the email address connected to your account. You can also browse the results while logged into the GoSeqIt Tools platform. Just click the "Show report" link for the input file of interest. Note that for each of the methods (Species prediction, MLST, ResFinder, PlasmidFinder, pMLST, and VirulenceFinder) it is possible to download extra files not presented in the summary report mailed to you. The extra files contain, for instance, the alignment between the database genes and the corresponding sequences in the input genome. For each report you can also download a 1-line tab-separated summary file that can be opened in, e.g., Excel. To do this, click the button marked "TSV" from inside the report page.

  • Is my data safe?

    Industry standard security is used by GoSeqIt Tools both when it comes to transferring and storing your data. That being said, it is important that you only upload de-identified information. Specifically, you should never name your files in such a way that personally identifying information is deducible from the filename.

     

  • 1.Contigs Analysis
    Here you can find questions and answers related to the Contigs Analysis
  • What is a contig?

    We use the term contig to describe a continuous sequence of nucleotides. Usually the bacterial chromosome (or plasmids) is not fully assembled into one, continuous sequence, but rather separated into several “contigs”. In a multientry FASTA file, a contig corresponds to one FASTA entry.

  • What is the N50 value?

    The N50 value is a statistical measure used to describe the quality of a draft assembly. The N50 value is defined as the length of the shortest contig in the set of largest contigs that together constitute at least half of the total assembly size. In general, a high N50 value signifies a high-quality draft assembly.

  • 1.Species prediction
    Here you can find questions and answers related to the Species Prediction
  • How is the species predicted?

    In short, the species is predicted by chopping the input genome into 15mers (nucleotide stretches with the length 15) and comparing these 15mers to a database of genomes with known species, which have also been chopped into 15mers. The species of the input genome is predicted as the species of the database genome with which the input genome has most 15mers in common. The method is called KmerFinder and is described in detail in the following two papers:

    Hasman H, Saputra D, Sicheritz-Pontén T, Lund O, Svendsen CA, Frimodt-Møller N, Aarestrup FM. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples.  J Clin Microbiol. 2014 Jan;52(1):139-46.

    and

    Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, Sicheritz-Pontén T, Aarestrup FM, Ussery DW, Lund O. Benchmarking of methods for genomic taxonomy.  J Clin Microbiol. 2014 May;52(5):1529-39.

  • Is the species prediction based on the 16S rRNA gene?

    No, we base the species prediction on overlapping 15mers (nucleotide stretches with the length 15) between the input genome and the genomes in a reference database. In this way, the entire genome along with plasmids are sampled, not just one gene. This has been found to be a superior approach as compared to only looking at the 16S rRNA gene:

     

    Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, Sicheritz-Pontén T, Aarestrup FM, Ussery DW, Lund O. Benchmarking of methods for genomic taxonomy. J Clin Microbiol. 2014 May;52(5):1529-39.

  • The predicted species is listed as unknown. Why might that be?

    To understand this, you need to understand how we predict the species in the first place: The species is predicted by chopping the input genome into 15mers (nucleotide stretches with the length 15) and comparing these 15mers to a reference database of genomes with known species that have also been chopped into 15mers. The species of the input genome is predicted as the species of the reference genome that the input genome has most 15mers in common with. The value "reference coverage", which is reported along with the predicted species, is defined as the number of unique kmers in the best-matching reference genome matching kmers in the input genome divided by total number of unique kmers in the reference genome. A low reference coverage indicates that the prediction is less likely to be correct. For a reference coverage < 0.50 the species is always predicted as "Unknown". Reasons that could lead to failure in predicting the species include that your file contains e.g., virus or fungal nucleotide sequences - only bacterial species can be predicted. Another explanation might be that the quality of the input genome is so poor (for instance very small) that all matches with reference genomes are statistically insignificant. The final option is that you have a rare bacterial species, which is not yet in the reference database.

  • Where do you get the data for the Species Prediction from and when was it last updated?

    We get our data for Species Prediction from the KmerFinder database from Center for Genomic Epidemiology, Technical University of Denmark (www.genomicepidemiology.org). This database were constructed from a NCBI download of all available whole genomes and draft genomes, and has following been checked and fixed for bad files, genomes and annotations. Please check the News section for when the database was last updated.

  • 1.MLST
    Here you can find questions and answers related to MLST
  • Where can I find more information about the MLST method used by GoSeqIt Tools?

    Here and in:

    Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Pontén T, Ussery DW, Aarestrup FM, Lund O. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 2012 Apr;50(4):1355-61. PMID: 22238442.

  • I have been running the Bacterial Analysis Pipeline, but there are no Multilocus Sequence Type (MLST) results. Why is that?

    MLST is only performed for species that according to http://pubmlst.org/data/ have an MLST scheme. If the species predicted during the Species Prediction step does not have an MLST scheme, MLST is not performed.

  • A "-like" has been added to the identified Sequence Type (ST), e.g., ST11-like. What does it mean?

    A "-like" is automatically added to the identified Sequence Type unless all alleles in the input genome match perfectly to a database allele (%Identify = 100 and Alignment Length = DB allele Length)

  • I have been analysing an Escherichia coli/Acinetobacter baumannii/Pasteurella multicoda/Streptococcus thermophilus or Leptospira genome, but get two/three different Sequence Types. Why?

    For some species, more than one MLST scheme has been developed:

    The Acinetobacter baumannii scheme #1 (also called the Oxford scheme) samples the following genes:

    Oxf_gltA; Oxf_gyrB; Oxf_gdhB; Oxf_recA; Oxf_cpn60; Oxf_gpi; Oxf_rpoD

    and is described in:
    Bartual et al. Development of a multilocus sequence typing scheme for characterisation of clinical isolates of Acinetobacter baumannii. J Clin Microbiol 2005, 43:4382-4390

    The Acinetobacter baumannii scheme #2  (also called the Pasteur scheme) samples the following genes:

    Pas_cpn60; Pas_fusA; Pas_gltA; Pas_pyrG; Pas_recA; Pas_rplB; Pas_rpoB

    and is described in:
    Diancourt et al. The population structure of Acinetobacter baumannii: expanding multiresistant clones from an ancestral susceptible genetic pool. PLoS One 2010, 7:e10034

    The Escherichia coli scheme #1 (Achtman scheme) samples the following genes: adk; fumC; gyrB; icd; mdh; purA; recA

    and is described in:

    Wirth et al. Sex and virulence in Escherichia coli: an evolutionary perspective. Mol. Microbiol. 2006: 60(5); 1136-1151

    The Escherichia coli scheme #2 (Pasteur scheme) samples the following genes: dinB; icdA; pabB; polB; putP; trpA; trpB; uidA

    and is described in:

    Jaureguy et al. Phylogenetic and genomic diversity of human bacteremic Escherichia coli strains. 2008. BMC Genomics 9:560

    The Pasteurella multocida scheme #1 (RIRDC scheme) samples the following genes:

    adk; est; pmi; zwf; mdh; gdh; pgi

    and was developed by Robert Davies, University of Glasgow, UK.

    The Pasteurella multocida scheme #2 (multihost scheme) samples the following genes:

    adk; aroA; deoD; gdhA; g6pd; mdh; pgi

    and was developed by Sounthi Subaaharan and Pat Blackall (Animal Research Institute, Australia)

    The Leptospira spp. scheme samples the following genes:

    glmU; pntA; sucA; tpiA; pfkB; mreA; caiB

    and is described in:

    Boonsilp et al. 2013, PLoS Negl Trop Dis 7:e1954

    The Leptospira scheme #2 samples the following genes:

    adk_2; glmU_2; icdA_2; lipL32_2; lipL41_2; mreA_2; pntA_2

    and is described in:

    Varni et al. 2014, Infect Genet Ecol 22:216-22

    The Leptospira scheme #3 samples the following genes:

    adk_3; icdA_3; lipL41_3; rrs2_3; secY_3; lipL32_3

    and is described in:

    Ahmed et al. 2006, Ann Clin Microbiol Antimicrob 23:28

    The Streptococcus thermophilus scheme (Pasteur scheme) samples the following genes:

    Genes: ddlA; glcK; proA; ptsI; serB; tkt

    The Streptococcus thermophilus #2 scheme (Yu scheme) samples the following genes:

    carB; clpX; dnaA; murC; murE; pepN; pepX; pyrG; recA; rpoB

  • What does the % Identity mean?

    The % Identity is the percentage of nucleotides that are identical between the best-matching database MLST allele and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

  • What does the Alignment Length mean?

    We use BLAST to identify the regions in the input genome that match the alleles in the MLST database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and an allele in the MLST database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

  • The Alignment Length does not match the DB allele Length. Should I be concerned?

    The Alignment Length corresponds to the length of the local alignment between the best-matching database MLST allele and the corresponding sequence in the input genome. If the Alignment Length is shorter than the database (DB) allele length, it means that the sequence of the entire MLST allele could not be found in the input sequence. Typically, if the lengths only differ by a few nucleotides, it could be due to a hitherto unseen and hence unregistered allele in you input genome. Larger differences might be caused by a poor assembly and the input genome being split into two contigs exactly across the MLST gene. You might want to take a look at the alignment file (download the file "Results including alignments (txt)") to decide which is the likely explanation.

  • What does the different colours of the rows in the MLST table signify?
    Green signifies perfect matches between the best-matching MLST allele and a sequence in the input genome (%Identity=100, Alignment Length = DB alllele Length).
    Orange means that the entire length of the MLST allele is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
    Grey means that it is only part of the MLST allele that is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length,  %ID = 100 or %ID < 100).
  • The Sequence Type (ST) is reported as Unknown, although all MLST alleles are perfect matches. How could that be?

    The Sequence Type is determined by the specific combination of registered MLST alleles according to www.pubmlst.org. Even though all alleles in the input genome match perfectly to alleles in the MLST database, the combination of alleles might be previously unseen, and hence unregistered.

  • Is it possible to see the actual alignment between the best-matching MLST alleles and the corresponding sequences in the input genome?

    Indeed it is! Just download the file marked "Results including alignments (txt)" below the MLST table

  • Where do you get the data for MLST from and when was it last updated?

    All data is downloaded from www.pubmlst.org. Please check the News section to see when the database was last updated.

  • 1.ResFinder
    Here you can find questions and answers related to ResFinder
  • Where can I find more information about the ResFinder method?

    Here and in

    Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM, Larsen MV. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012 Nov;67(11):2640-4. PMID: 22782487.

    and

    Zankari E, Hasman H, Kaas RS, Seyfarth AM, Agersø Y, Lund O, Larsen MV, Aarestrup FM. Genotyping using whole-genome sequencing is a realistic alternative to surveillance based on phenotypic antimicrobial susceptibility testing. J Antimicrob Chemother. 2013 Apr;68(4):771-7. PMID: 23233485.

  • Is there any bacterial species that should not/are not analysed by ResFinder?

    No, all bacterial species can be analysed by the ResFinder method. The Bacterial Analysis pipeline will search for resistance genes in all input files no matter which species the Species Prediction step predicts, and even if the species is predicted as unknown.

  • What does the different colours of the rows in the ResFinder table signify?

    Green signifies perfect matches between the database resistance gene and a sequence in the input genome (%Identity=100, Alignment Length = DB allele Length).
    Orange means that the entire length of the database resistance gene is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
    Grey means that only part of the database resistance gene is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length,  %ID = 100 or %ID < 100).

  • What does the % Identity mean?

    The % Identity is the percentage of nucleotides that are identical between the resistance gene in the ResFinder database and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

  • What does Alignment Length mean?

    We use BLAST to identify the regions in the input genome that match the alleles in the ResFinder database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and an allele in the ResFinder database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

  • Is it possible to see the actual alignment between the resistance gene in the ResFinder database and the corresponding sequences in the input genome?

    Indeed it is! Just download the file marked "Results including alignments (txt)" below the table of identified resistance genes.

  • How similar does a database resistance gene has to be to a sequence in the input genome for the gene to be reported?

    If you are running the standard version of the Bacterial Analysis Pipeline, the % Identity (the percentage of identical nucleotides) must be at least 90% for a database resistance gene to be reported. Further, the Alignment Length (the length of the local alignment between the database resistance gene and a corresponding sequence in the input genome) must be at least 60% of the length of the database resistance gene. If you are running the sensitive version of the Bacterial Analysis Pipeline, the % Identity must be at least 60% for a database resistance gene to be reported, and the Alignment Length still at least 60% of the length of the database resistance gene.

  • Where do you get the data for ResFinder from and when was it last updated?

    We get the ResFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.

  • 1.PlasmidFinder
    Here you can find questions and answers related to PlasmidFinder
  • I have been running the Bacterial Analysis Pipeline, but cannot find the output from PlasmidFinder anywhere. Why is that?

    Plasmid replicons are only identified for Enterobacteriaceae and gram positive bacteriaMissing output from PlasmidFinder is likely due to your input genome not being identified as belonging to either of these groups during the species prediction step.

  • How similar does a database plasmid replicon sequence has to be to a sequence in the input genome for the replicon to be reported?

    If you are running the standard version of the Bacterial Analysis Pipeline, the % Identity (the percentage of identical nucleotides) must be at least 80% for a database replicon to be reported. Further, the Alignment Length (the length of the local alignment between the database replicon and a corresponding sequence in the input genome) must be at least 60% of the length of the database replicon. If you are running the sensitive version of the Bacterial Analysis Pipeline, the % Identity must be at least 60% for a database replicon to be reported, and the Alignment Length still at least 60% of the length of the database replicon.

  • What does the % Identity mean?

    The % Identity is the percentage of nucleotides that are identical between the database replicon sequence and the corresponding sequence in the input genome. Perfect matches have % Identity = 100.

  • What does Alignment Length mean?

    We use BLAST to identify the regions in the input genome that match the replicons in the PlasmidFinder database. BLAST finds the best local alignment (overlap) between a sequence in the input genome and a replicon in the PlasmidFinder database. The Alignment Length is the length of the alignment measured in basepairs. For perfect matches the Alignment Length equals the DB allele Length.

  • Is it possible to see the actual alignment between the database replicon sequence and the corresponding sequences in the input genome?

    Indeed it is! Just download the file marked "Results including alignments (txt)" below the PlasmidFinder table.

  • What does the different colours of the rows in the PlasmidFinder table signify?

    Green signifies perfect matches between the database replicon sequence and a sequence in the input genome (%Identity=100, Alignment Length = DB allele Length).
    Orange means that the entire length of the database replicon sequence is covered by a corresponding sequence in the input sequence, but some nucleotides differ (%ID < 100, Alignment Length = DB allele Length).
    Grey means that only part of the database replicon sequence is covered by a corresponding sequence in the input genome (Alignment Length != DB allele Length, %ID = 100 or %ID < 100).

  • Where can I find more information about the PlasmidFinder method?

    Here and in:

    Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller Aarestrup F, Hasman H. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 2014 Jul;58(7):3895-903. PMID: 24777092.

  • Where do you get the data for the PlasmidFinder database from and when was it last updated?

    We get the PlasmidFinder database from Center for Genomic Epidemiology, Technical University of Denmark (http://www.genomicepidemiology.org/). Please check the News section for when the database was last updated.