E-Poster Presentation Australian Society for Microbiology Annual Scientific Meeting 2021

Application of the random forest algorithm to Streptococcus pyogenes response regulator allele variation: from machine learning to evolutionary models (#206)

Sean J Buckley 1 , Robert J Harvey 1 2 , Zack Shan 3
  1. School of Health and Behavioural Sciences, University of the Sunshine Coast, Maroochydore, Queensland, Australia
  2. Sunshine Coast Health Institute, University of the Sunshine Coast, Birtinya, Queensland, Australia
  3. Thompson Institute, University of the Sunshine Coast, Birtinya, Queensland, Australia

Streptococcus pyogenes (group A Streptococcus: GAS) is a globally significant bacterial pathogen. The current gold standard for genotyping GAS strains is based on the nucleotide variation of emm, which encodes a surface-exposed protein that is recombinogenic and under immune-based selection pressure.  In the era of whole-genome sequencing, we tested the utility of the random forest (RF) algorithm and 53 GAS response regulator (RR) allele types to infer six genomic traits (emm-type, emm-subtype, tissue and country of sample, clinical outcomes, and isolate invasiveness), by applying three different RF classifiers (Ordinary, Regularized, and Guided) within a supervised learning methodology. Our results showed that when inferring the emm-type at the highest accuracy, the Guided, Ordinary, and Regularized RF classifiers selected ten, three, and four RR alleles in the feature set to attain 97.8%, 96.2%, and 95.6% accuracy, respectively. Notably mga2 and lrp were ranked most important in all three. Using only mga2 and lrp as predictor features we inferred the emm-type with 93.7% accuracy. Across the three RF classifiers, the mean accuracies of inference of the emm-subtype, country, invasiveness, clinical, and tissue were 89.9%, 88.6%, 84.7%, 56.9%, and 56.4%, respectively. We have demonstrated the utility of RF classifiers for inferring each of the traits tested, except for the tissue sampled and clinical outcome, which is consistent with the complexity of the pathophysiology and GAS-host interactions during infection. We also identified a novel cell wall-spanning domain (SF5), and propose evolutionary pathways depicting the ‘contrariwise’ and ‘likewise’ chimeric deletion-fusion of emm and enn in emm137.0 and emm77.0 isolates. We also identified a non-typable intermediate strain that stands as evidence of time-dependent excision of mga regulon genes. Overall, we have developed a process flow using the RF and a RR-based typing system that has advanced the understanding of the GAS mga regulon and its plasticity.