There are almost one million assembled Bacterial and Archaeal genomes in the NCBI Assembly Repository, and with the maturation of metagenome-assembled genomes, there will be many millions in the next few years. The prophage finder PhiSpy was developed to identify prophages –temperate phages that have integrated into the host genome – without relying on homology-based signatures. We provide a comprehensive testing framework that we used to compare eight different prophage prediction tools. Only the updated PhiSpy algorithm is fast enough to accurately mine the complete NCBI genome assembly catalogue to identify putative prophages over the entire collection. We developed a new ontology to describe the environments from which Bacteria and Archaea have been isolated, and combined that with extensive metadata from GTDB, NCBI, and PATRIC to demonstrate the key determinants in the number of prophages per genome.
We have identified tens of millions of prophages from hundreds of thousands of Bacteria and Archaea. Most clades are susceptible to phage infection: 84% of bacterial phyla have at least one prophage, with an average of 3.5 prophages per complete genome and an average prophage length of 31 kb. Key drivers in determining the number of prophages per bacterial genome include both the environment from which the Bacteria or Archaea was and the age of the isolation. Organisms exposed to connected environments contained more prophages, while those from isolated environments had fewer. Older isolates have fewer prophages, presumably because their viruses have been cured during isolate storage or resuscitation.
The prophage prediction testing framework will be useful in informing best practices for identifying prophages over large databases. The current PhiSpy Prophage Database provides unique resource for mining viral genomes to explore the role of phages in shaping the evolution of their hosts.