Previous Article | Next Article ![]()
Journal of Bacteriology, November 2007, p. 8186-8195, Vol. 189, No. 22
0021-9193/07/$08.00+0 doi:10.1128/JB.00690-07
Copyright © 2007, American Society for Microbiology. All Rights Reserved.
,
Benjamin Janto,1,
Justin S. Hogg,1
Robert Boissy,1
Susan Yu,1
Evan Powell,1
Randy Keefe,1
Nathan E. Ehrlich,1
Kai Shen,1
Jay Hayes,1
Karen Barbadora,2
William Klimke,3
Dmitry Dernovoy,3
Tatiana Tatusova,3
Julian Parkhill,4
Stephen D. Bentley,4
J. Christopher Post,1,5
Garth D. Ehrlich,1,5 and
Fen Z. Hu1,5*
Allegheny General Hospital, Allegheny-Singer Research Institute, Center for Genomic Sciences,1 Department of Microbiology and Immunology, Drexel University College of Medicine, Allegheny Campus, Pittsburgh, Pennsylvania 15212,5 Children's Hospital of Pittsburgh, Pittsburgh, Pennsylvania, 15213,2 National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894,3 The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom4
Received 1 May 2007/ Accepted 24 July 2007
|
|
|---|
3,000) that are represented in the S. pneumoniae population at frequencies of
0.1 can be identified if 33 representative genomes are sequenced. These extensive genic diversity data support the DGH and provide a basis for understanding the great differences in clinical phenotype associated with various pneumococcal strains. When these findings are taken together with previous studies that demonstrated the presence of a supragenome for Streptococcus agalactiae and Haemophilus influenzae, it appears that the possession of a distributed genome is a common host interaction strategy. |
|
|---|
S. pneumoniae played a critical role in the demonstration that DNA is the hereditary genetic material. In 1944 Avery and colleagues showed that DNA is the transforming factor identified by Griffith as being capable of making avirulent S. pneumoniae lethal (1). Since then, S. pneumoniae has served as a model organism for the study of bacterial transformation. It contains an inducible system for the uptake of DNA from the environment that allows for extensive recombination (3, 29, 34). Previous work from our laboratory using genomic libraries from eight clinical strains identified novel genes not present in the TIGR4 reference strain and showed a nonuniform distribution of many S. pneumoniae genes, suggesting a significant degree of genomic plasticity among the isolates (37).
A large amount of intraspecies genic variation (which pertains to the absence or presence of genes and should not be confused with allelic variation) has been observed for several bacteria. Tettelin and colleagues analyzed the genomes of six Streptococcus agalactiae strains of multiple serotypes and showed that
20% of the genes are not shared among all strains (43). A similar trend was found for 13 Haemophilus influenzae genomes, where only
50% of the genes are conserved among all strains (18). Both of these studies support the distributed-genome hypothesis, which states that for some bacteria, the full complement of genes available to a given species exists in a "supragenome" pool, one that each member of a population of naturally transformable bacterial strains contributes to and draws genes from, resulting in a high degree of genic diversity (7).
In this study we sequenced the genomes of eight clinical S. pneumoniae isolates and combined these data with sequences from nine other publicly available S. pneumoniae strains. We present a global comparative analysis of the genes and genomes, which demonstrates great genic diversity among the strains. In addition, we use a mathematical model to predict the number of genomes that must be sequenced to provide coverage of the S. pneumoniae supragenome (18).
|
|
|---|
Sequence data for the other nine strains examined came from various sources. Four isolates were sequenced by the S. pneumoniae Sequencing Group at the Sanger Institute: 23F, INV104B, INV200, and OXC141. The sequences of the latter three strains are incomplete and can be obtained from ftp://ftp.sanger.ac.uk/pub/pathogens/spn/. The published genome sequence of strain R6 (GenBank accession number AE007317) was produced by Eli Lilly & Co (19). The Institute for Genomic Research (TIGR) sequenced strains TIGR4 (GenBank accession number AE005672), TIGR670-6B (project information available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview) (23, 44), and strain D39 (s_pneumoniae_d39_1; GenBank accession number CP000410) (23). Unfinished sequence data for strain PAT6420135 are available from reference 22, and the strain from which the sequence was obtained is now an ATCC patent deposit bacterial strain (ATCC 55840). The Entrez Nucleotide database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide) query "patent[TITL] AND 6420135[TITL]" retrieves the 391 DNA sequences that make up this genome assembly, which was produced by Human Genome Sciences, Inc.
Genome assembly of CGS S. pneumoniae sequences. The 454-assembled genomic contigs were ordered and oriented into scaffolds by alignment, using Nucmer software (6), against each of the nine non-CGS S. pneumoniae genomes, which indicated the closest reference strain. Using a maximum-parsimony approach, each genome was reduced to about 60 contigs by a combination of (i) Sanger sequencing of PCR amplicons targeted to fill gaps between neighboring contigs, as inferred from the scaffolding, and (ii) paired-end Sanger sequencing of clones from a library and identification of clones that spanned gaps in the 454 sequence. Gap closure experiments were designed by a custom Perl script, and PCR primers were designed by Primer3 (36). Clones and PCR amplicons were assembled along with 454 contigs by a modified Phred-Phrap-Consed pipeline where 454 contigs were converted to PHD format files and were input into Phrap as long reads (9, 10, 13, 14). Data were manipulated and visualized using CONSED.
CDS prediction. Prediction of putative coding sequences (CDSs) and gene annotation were done by NCBI using the Microbial Genome Annotation Tools and Genome Annotation Pipeline (http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html). Briefly, gene predictions were performed using GeneMark, GeneMark.hmm, and Glimmer 2. To detect any genes missed by this method, a six-frame translation of slices of the nucleotide sequence was done. The predictions and the slices were then searched against the NCBI Entrez Protein Cluster database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=proteinclusters) as well as proteins from all complete microbial genomes. Annotation was supplied from both sets and supplemented with information from the Conserved Domain Database (42) and from Clusters of Orthologous Groups (25). rRNAs were predicted using BLAST to search against an RNA sequence database (4), and tRNAs were predicted using tRNAscan-SE (24).
Clustering algorithm. A complete description of the algorithms used to create the orthologous clusters is given by Hogg et al. (18). Briefly, to allocate the CDSs into core, unique, or distributed clusters, tfasty34 (Fasta package, version 3.4) was used for six-frame translation homology searches of all predicted proteins against all possible translations (33). Proprietary software designed at the CGS was used to parse this output, grouping the genes into clusters. A cluster was defined as a group of genes which share at least 70% identity, over 70% of its length, with one or more of the other genes in the group, and where at least one sequence in the cluster is equal to or longer than 120 residues. These parameters were selected because they minimize the change in the number of clusters per change in parameters and thus may represent a good estimate to distinguish orthologues from paralogues (18). Multiple genes from the same strain can be included in one cluster; therefore, in most strains the number of clusters is lower than the number of CDSs. To account for the cases where CDSs were missed in the gene-calling step, fasta34 (Fasta package, version 3.4) was used to align all predicted genes against the contigs. If an alignment with the required score was detected in the contig, the gene was considered present even if the gene was not identified.
Phylogenetic-tree building.
For the genic-difference-based dendrogram, a gene possession-based phylogenetic tree of the 17 S. pneumoniae strains was constructed by defining the distance between a pair of genomes, i and k, to be
where gn,i is 1 if gene n is present in strain i and zero otherwise. The strains were clustered using tools available at http://www.let.rug.nl/
kleiweg/clustering/#cluster. "difftbl" was used for conversion of vector data into a difference table using city block as a method, and "cluster" was used for data clustering of the difference table using the group average method. For the relationship among MLST types, we used the batch allelic profile from the pneumococcal MLST database (http://spneumoniae.mlst.net/).
Nucleotide sequence accession numbers. The GenBank accession numbers for the following eight clinical isolates, sequenced in this study, are given in parentheses after the isolate designations: CGSSp3BS71 (AAZZ00000000), CGSSp6BS73 (ABAA00000000), CGSSp9BS68 (ABAB00000000), CGSSp11BS70 (ABAC00000000), CGSSp14BS69 (ABAD00000000), CGSSp18BS74 (ABAE00000000), CGSSp19BS75 (ABAF00000000), and CGSSp23BS72 (ABAG00000000).
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Bacterial strains and sources used for the genomic comparison of S. pneumoniae strains
|
Identification of CDSs and core, distributed, and unique orthologous clusters. Prediction of CDSs and annotations were done using the Microbial Genome Annotation Tools and Genome Annotation Pipeline (http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html). The median number of CDSs per strains is 2,411, and the range is from 2,259 for strain INV200 to 2,763 for strain CGSSp14BS69. The total number of CDSs in all 17 strains is 41,445 (Tables 2 and 3). Note that all 17 genomes were submitted to the NCBI Genome Annotation Pipeline (including those with previously published CDSs) to eliminate errors that may have arisen from differences in the prediction of CDSs. A multi-fasta file with all the CDSs used in this study is available in Fig. S1 in the supplemental material and can also be downloaded from SupraGen at http://centerforgenomicsciences.org/doc_frame/index-old.html.
|
View this table: [in a new window] |
TABLE 2. Summary of CDSs and orthologous gene clusters for 17 S. pneumoniae strains
|
|
View this table: [in a new window] |
TABLE 3. Numbers of CDSs and orthologous clusters for individual S. pneumoniae strains
|
The total number of S. pneumoniae orthologous clusters (which includes unique genes) for all 17 strains is 3,170. Forty-six percent of the total clusters (1,454 core clusters) are found in all strains and include 73% of the total number of CDSs. Eighteen percent of the total clusters (576 clusters) are unique to a single strain, and 36% of the total clusters (1,140 distributed clusters) are present in some combination of two or more strains, but not in all strains (Table 2). Figure 1 shows a histogram plotting the number of orthologous clusters against the number of genomes in which a cluster is found, where 1 and 17 genomes correspond to the unique and core sets, respectively. Table 3 gives the numbers of CDSs, orthologous clusters, and unique genes per strain. The median number of orthologous clusters per strain is 2,012, with a range from 1,850 for INV200 to 2,157 for TIGR670-6B. Note that for each strain, 21 to 32% of the orthologous gene clusters are distributed or unique genes, i.e., not common to all strains. Considering the group of 17 strains collectively, 54% of the gene clusters are noncore.
![]() View larger version (14K): [in a new window] |
FIG. 1. Histogram of the number of observed and predicted (by the finite supragenome model) orthologous gene clusters that are present in a given number of genomes. There were 1,454 orthologous clusters observed in all strains (core); 1,140 distributed among more than one strain, but not all; and 576 in only one strain.
|
Phylogenetic relationships among strains. To evaluate the relationships among strains, we constructed a dendrogram based on noncore genic differences (Fig. 2). As expected, strain D39 and its derivative strain, R6, are the most closely related pair. Interestingly, highly genetically diverse S. pneumoniae strains can be isolated from patients with the same symptoms at the same geographical location, as illustrated by the broad distribution of the eight CGS clinical isolates. Furthermore, strains OXC141 and CGSSp3BS72, which were isolated on different continents, have similar orthologous-cluster distributions. Surprisingly, the strains isolated in Norway (TIGR4) and Spain (23F) are not outliers with respect to the U.S. isolates, and the two Spanish strains (23F and TIGR670-6B) do not group together. The analysis includes two strains each of serotypes 14, 23, and 3 and three strains of serotype 6. There is great similarity between the genic contents of the two serotype 3 strains, but the same is not true for the strains with serotypes 14, 6, and 23.
![]() View larger version (13K): [in a new window] |
FIG. 2. Dendrogram showing the relationship among 17 S. pneumoniae strains based on orthologous cluster differences.
|
![]() View larger version (69K): [in a new window] |
FIG. 3. Global comparison of orthologous gene clusters for 17 S. pneumoniae strains. The similarity score corresponds to the total number of orthologous gene clusters shared within each strain pair; the difference score corresponds to the total number of orthologous gene clusters not shared within a strain pairing; and the comparison equals the similarity score minus the difference score. In addition, the number of clusters shared only within a strain pair is noted as pair-unique. If these values are 1, 2, or 3 standard deviations away from the mean of all pairwise comparisons, the boxes are color coded as noted in the key.
|
Supragenome size estimation. These data show that S. pneumoniae has a supragenome much larger than the genome of any individual strain. Our group developed a finite supragenome model based on clustering analysis of multiple H. influenzae genomes (18). This model accounts for the numbers of core, distributed, and unique orthologous clusters and does not assume that the distributed orthologous clusters are sampled in the population with equal probability. The absence of this assumption yields a model where almost all of the supragenome can be accurately determined after the genomes of a finite number of strains are sequenced.
To visualize how the genome sequence of each additional strain adds to our understanding of the supragenome, we calculated the numbers of total and core orthologous clusters for two strains and then recalculated these parameters on an iterative basis as we added each additional strain. These exact core and supragenome values will differ depending on the order in which the strains are added to the analysis, but the trend is always the same: decay in the number of new orthologous clusters and stabilization of the number of core orthologous clusters at
1,400 (Fig. 4A and B). This suggests that after a finite number of genomes have been sequenced, the number of new orthologous clusters identified will be very low.
![]() View larger version (17K): [in a new window] |
FIG. 4. (A) Plot of the numbers of total and core observed orthologous clusters as a function of the number of strains sequenced. (B) Plot of the number of new observed orthologous clusters as a function of each genome. Numbers were calculated first for two strains and then iteratively for strains added one by one.
|
5,100 orthologous clusters, where
1,380 are core,
2,100 are unique, and the remaining are distributed. The orthologous clusters in the supragenome are represented at different frequencies within the S. pneumoniae population; a cluster that is present at a population frequency lower than 0.1 presumably does not have much impact on the population, although it may be very interesting in itself and could potentially be very important if the one strain it represents causes a pneumococcal pandemic. If all orthologous clusters are considered, the estimated supragenome size is 5,117 clusters and the core set contains 27% of the clusters. If only orthologous clusters with frequencies equal or greater then 0.1 are considered, the estimated supragenome size drops to 2,979 clusters where 46.5% are core. Accounting only for the orthologous clusters that exist in the S. pneumoniae supragenome at a frequency equal to 0.1 to greater than 95% of the supragenome is predicted to be identified after sequencing of 17 strains and 99% after sequencing of 33 strains. If orthologous clusters of all population frequencies are considered, 90% of the supragenome is predicted to be identified after 142 strains are sequenced (Table 4). The predicted and observed distributions are compared in Fig. 1, where the model fits the data well. Figure 5A shows the numbers of total and core orthologous clusters predicted for 50 or fewer strains, and Fig. 5B plots the predicted number of new orthologous clusters as a function of each genome sequenced. This prediction assumes that the 17 strains used in this study are representative of worldwide strains; if they are not, this prediction will underestimate the number of strains required for supragenome coverage. |
View this table: [in a new window] |
TABLE 4. Predicted coverage of the S. pneumoniae supragenome using the finite supragenome model
|
![]() View larger version (20K): [in a new window] |
FIG. 5. Predictions using the finite supragenome model. (A) Plot of the numbers of total and core predicted orthologous clusters as functions of the number of strains sequenced. (B) Plot of the number of new predicted orthologous clusters as a function of each genome sequenced. Numbers were calculated first for two strains and then iteratively for strains added one by one.
|
5.4% have annotations related to phage, prophage, or bacteriophage; and the remaining 33% correspond to a wide range of proteins including putative transporters, transcriptional regulators and activators, lantibiotic biosynthesis and transport proteins, macrolide efflux proteins, and many other enzymes (see Table S3 in the supplemental material).
For the most part, the genes in a given orthologous cluster have the same annotation, and somewhat surprisingly, the genes in the core orthologous clusters include a significant percentage annotated as hypothetical proteins (
30%), perhaps suggesting that many bacterial housekeeping functions remain unknown (see Table S3 in the supplemental material). Among the annotated core orthologous clusters, no phage proteins were detected, suggesting that no single phage is conserved across all strains. Interestingly, there are a few orthologous clusters with more than 100 CDSs that are annotated as transposases, often degenerate or truncated. The role of these proteins in the S. pneumoniae genome, if any, remains to be established. A table with the annotations for genes in the core, distributed, and unique categories is provided as Table S3 in the supplemental material.
|
|
|---|
73% of the total CDSs) are conserved among all species. When the genomes of individual strains were evaluated, 21 to 32% of the orthologous clusters were noncore. Predictions using the finite supragenome model suggest that the total number of orthologous clusters in the S. pneumoniae species is around 5,100 and the total number of core orthologous clusters is around 1,380. These large strain differences illustrate the enormous genic diversity within this species, as postulated in the distributed-genome hypothesis (7, 37). The engines driving this genomic plasticity are threefold: first, it has been demonstrated that chronic infections by nasopharyngeal pathogens are generally polyclonal in nature (11, 16, 30, 31, 39-41; J. R. Gilsdorf, presented at the 9th International Symposium on Recent Advances in Otitis Media, 3 to 7 June 2007); second, the bacteria in these chronic infections adopt a biofilm mode of growth, which greatly increases the kinetics of horizontal gene transfer (8, 15, 28); and third, S. pneumoniae employs highly energetic fratricidal as well as autocompetence and autotransformation mechanisms for the release and uptake of pneumococcal DNA, respectively, from the surrounding environment (35). The pathological consequences of these phenomena, which collectively result in a continual reassortment of genic characters among strains within a polyclonal biofilm infection, are that the host's adaptive immune system continually encounters novel strains, making clearance very difficult, because the pathogen can generate diversity faster than the host can adapt to it, thus ensuring chronicity of infection.
In a previous study, we constructed individual genomic libraries from the eight CGS S. pneumoniae clinical isolates (CGSSp9BS68, CGSSp14BS69, CGSSp11BS70, CGSSp3BS71, CGSSp23BS72, CGSSp6BS73, CGSSp18BS74, and CGSSp19BS75). Of the 4,793 clones sequenced,
16% were not present in the TIGR4 reference strain, suggesting that many genes were not conserved across the species. In addition, the screen identified genes unrelated to any streptococcal sequences; analysis of the allocation of a subset of 58 of these found that they were not uniformly distributed across the eight strains (37). These results are in complete agreement with this study; both studies underscore the genomic plasticity of the S. pneumoniae species.
The use of the finite supragenome model suggests that 99% of orthologous clusters in the supragenome that have population frequencies equal or higher to 0.1 can be identified after sequencing of 33 strains and that the 17 available strains provide
95% coverage of this set. When analyzing the S. agalactiae supragenome, Tettelin and colleagues presented a different mathematical model, generated using the assumption that noncore genes are sampled in the population with equal probabilities (43). Unlike the finite supragenome model, this model predicts that a constant number of new strain-specific genes will be identified with the addition of each genome, such that sequencing a limited number of strains would not provide major coverage of the supragenome.
Our analysis includes clinical strains from multiple locations including the United States, the United Kingdom, Norway, and Spain. Diversity is generated from DNA exchange among strains; thus, it is tempting to consider that strains from the same geographical location may be more similar, since they have a higher probability of exchanging genetic information (directly or indirectly, via other strains). Interestingly, we did not observe this with our limited number of strains. While it is possible that a correlation between geographical distance and genic diversity will be observed when a larger number of strains from multiple geographic regions are sequenced and compared, we must nonetheless consider that this correlation may not exist. This result would be explained if the vast majority of the orthologous clusters in the S. pneumoniae supragenome have been in the species for a very long time, and horizontal transfer from other species and new mutations have introduced only a minority of the supragenome's orthologous clusters, or if the extent of human population migration is now so high (at least in the West) that human pathogens are essentially homogenized around the world.
This enormous genetic diversity calls attention to the need for markers of human virulence phenotypes and highlights the potential difficulty associated with this task. S. pneumoniae strains are presently categorized based on capsule type and MLST. The capsular serotype is an important virulence factor and affects the ability of pneumococci to cause invasive disease (2, 38). For example, the difference in virulence between type 2 D39, which is highly virulent in the murine model of infection, and unencapsulated R6, which is avirulent, is attributed to the loss of the capsule. However, it is critical to remember that even within the same capsular type, virulence is highly related to the genetic background of the strains (20). The virulence phenotypes displayed by the eight strains isolated in Pittsburgh differ significantly in a chinchilla model of S. pneumoniae infection; these differences may be due to distinct serotypes, genotypes, or both (M. Forbes and J. Hayes, personal communication). Our data in Fig. 2 clearly show that the serotype cannot be correlated with the genic content, since strains of serotypes 14, 6, and 23 were not grouped based on genic differences. An analysis of sequence variation of the type 6A and 6B capsular biosynthetic loci was related to the MLST profile, yet there was also ample evidence of horizontal transfer to unrelated lineages (27). Phylogenetic trees using MLST from the seven CGS clinical strains of known MLST types did not closely resemble the phylogenies created from genic differences (data not shown). Together these data suggest that in some cases, the serotype, MLST type, and/or genetic background may correlate, but in other cases, they do not, as would be expected from strains undergoing high rates of intraspecies horizontal gene transfer. Since pathogenesis is probably a consequence not only of capsular type but also of multiple other genes, MLST type and serotype alone are not ideal markers for the disease phenotype of S. pneumoniae strains.
Previous work on six strains of Streptococcus agalactiae described a supragenome with
80% core orthologous clusters and the remaining set consisting of partially shared and strain-specific orthologous clusters (43). In addition, these data resemble, in qualitative and quantitative terms, our comparison of 13 H. influenzae strains (18). The total number of H. influenzae orthologous clusters for the 13 strains was 2,786, of which 52% were core, 29% were distributed, and 19% were unique. Taken together, these studies suggest that a high degree of genic variation is common among multiple species. However, it may not be universal; analyses of eight Bacillus anthracis strains revealed substantially less variation among strains, with no new genes uncovered after analysis of only four genomes (43). It is possible that this degree of variation is intrinsic to naturally transforming bacteria such as H. influenzae and S. pneumoniae, which undergo extensive DNA recombination events. In addition, both of these bacteria exist exclusively in the human mucosa, where they form biofilms (15). Cells in a biofilm are embedded in an extracellular polymeric matrix that is rich in nucleic acids; thus, biofilms may provide ideal environments to foster such genomic plasticity (32). There is also quantitative similarity, since both species have core genomes that seem to stabilize around 1,400 proteins, or
50% of the supragenome. This similarity suggests that similar evolutionary forces may be determining the equilibrium between core genes, noncore genes, and genome size.
This diversity suggests caution in the use of model strains to test and develop vaccines and drugs, since effective targets in one strain may be missing in a significant percentage of the other strains. It is probable that these bacteria have evolved multiple and redundant mechanisms to evade immunity and adapt to variations among hosts and their commensal microbiota.
The sequence data for strain TIGR670-6B was produced by The Institute for Genomic Research with funds from NIAID through the University of Alabama. The sequencing of S. pneumoniae strains at The Sanger Institute was supported by the Wellcome Trust. We also acknowledge the use of the pneumococcal MLST database, which is located at Imperial College London and is funded by the Wellcome Trust. This work was supported by Allegheny General Hospital, Allegheny Singer Research Institute, and grants DC05659 (J.C.P.) DC04173 (G.D.E.), and DC02148 (G.D.E.) from NIDCD, NIH.
Published ahead of print on 3 August 2007. ![]()
Supplemental material for this article may be found at http://jb.asm.org/. ![]()
N.L.H. and B.J. contributed equally to this work. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»