E-Poster Presentation Australian Society for Microbiology Annual Scientific Meeting 2021

Hecatomb: Fast and accurate taxonomic assignment of viral metagenome sequences (#256)

Michael Roach 1 , Kathy Mihindukulasuriya 2 , Scott Handley 2 , Robert Edwards 1
  1. Flinders University, BEDFORD PARK, SA, Australia
  2. Washington University in St. Louis, St. Louis, USA

Viruses and bacteria influence every environment including human health and disease, and viral metagenomics is beginning to elucidate their roles and interactions across these vastly different complex microbial communities. However, accurate identification and classification of viral sequences is fraught with challenges. Contamination from host DNA and non-biological sources (such as primers and adapters) can comprise a significant proportion of many viral metagenomes. The viral dark matter is vast and search algorithms must deal with sequences that are highly diverged from the few sequences present in the current reference databases. False-positive taxonomic assignments are common in viral metagenomes due to extensive sequence similarity with other domains of life. 

The Hecatomb pipeline addresses these issues whilst keeping up with the staggering output potential of the current generation of sequencing platforms. A rigorous pre-processing stage removes contaminant sequences from host DNA, sequencing primers and adapters, and vector contaminants. Next, the search space is greatly reduced by clustering the redundant sequences. Potential viral sequences are identified by sequence similarity to viral protein and nucleotide databases, and false positives are removed by a secondary search of the larger UniRef database. This iterative approach is both fast and accurate.

The Hecatomb pipeline is powered by Snakemake and Anaconda making it inherently scalable, reproducible, re-entrant, easy to install, and compatible with most Linux computing environments. A launcher makes it easy to download the databases, prepare your own host genomes, and run the pipeline. The output tables are intuitive to load into R or Python for exploring your data. Finally, the accompanying shiny app assists with data interrogation. The pipeline has been used to explore links between viruses and Irritable Bowel Diseases, catalogue the viral and phage compositions of different shark epidermal samples, and to explore potential viral roles in cases of epilepsy triggered by the parasite Onchocerca volvulus.