Flagellin, the protein that makes up the bacterial flagellin, is remarkably variable (DOI : 10.1128/mSystems.00705-19). It has conserved domains for the propeller function, which are covered by highly variable domains that are immunogenic. This variation is used for serotyping and in E. coli inter alia are known as H antigens. We used the 50 H antigens that map to the E. coli fliC locus for strain identification. They are so divergent that there is very little true sequence alignment and they do not recombine, so each has its own variation which can be greater than in housekeeping genes. We use primers for the E. coli conserved regions to amplify the variable regions using barcodes for each human fecal sample. These reads are sequenced using the PacBio sequel platform for circular consensus sequencing, which gives near 100% accuracy, and aim for about 2000 or more reads per sample. Dominant strains stand out with a high proportion of the reads, and minor strains can be seen even if in very low numbers. One sample with several thousand reads had one or more strains with each of 28 H antigens. Some of the flagellin sequences are present in published E. coli strains and it appears that we are looking at the distribution of the.se strains in our samples.
Colony isolates from the samples can be screened to get isolates of their dominant strains using their fliC sequence, and further sequencing of these isolates confirmed that we are looking at either known strains or close relatives. We also observed cases of major strain turnover in individuals, in some cases as a result of antibiotic use.