Previous Article | Next Article ![]()
Journal of Bacteriology, January 2009, p. 65-73, Vol. 191, No. 1
0021-9193/09/$08.00+0 doi:10.1128/JB.01237-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.

Yuri I. Wolf,1
Inna Dubchak,2,3 and
Eugene V. Koonin1*
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894,1 Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California 94720,2 Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 945983
Received 4 September 2008/ Accepted 20 October 2008
|
|
|---|
|
|
|---|
Until recently, sets of multiple, closely related genomes suitable for microevolutionary studies have been available, at best, for a few model bacteria, such as Escherichia coli, or bacteria of special interest, such as Bacillus anthracis. However, with the progress in sequencing technology and the resulting rapid increase in the number of completed genomes, this situation has changed. Currently, the number of completely sequenced prokaryotic genomes is growing exponentially, with a doubling time of approximately 21 months (17). As of August 2008, 847 bacterial and 97 archaeal genomes have been available, and about 1,900 genome projects were in progress (19). From this diverse collection of prokaryotic genomes, we recently created a data set of alignable tight genome clusters (ATGCs) that includes closely related genomes from numerous groups of bacteria and archaea and that was devised as a flexible platform for research in microevolutionary genomics of prokaryotes (27; http://atgc.lbl.gov). Obviously, there can be many different definitions of a "close" relationship between genomes, and more importantly, different degrees of closeness are optimal for different types of analysis. Given that in prokaryotes gene order is known to evolve much faster than sequences of homologous proteins (4, 14, 24, 35), our main approach in the construction of ATGCs involved selecting genomes that maintained a sufficient amount of synteny to cover a significant fraction of genes and to serve as an aid in the identification of orthologs (16). Typically, the ATGCs include either different bacterial (archaeal) species from the same genus or strains of the same species. The inclusion of only closely related genomes in each ATGC ensures reliable identification of orthologous genes and the availability of high-quality alignments for most, if not all, of them. The ATGCs constructed with this partially formalized approach include bacterial and archaeal genomes with considerable variation of evolutionary distances that have the potential to inform the investigation of key microevolutionary questions. These questions include the effects of purifying and positive selection on different classes of sites (nonsynonymous, synonymous, and noncoding sites), the connections between these effects, global genome characteristics (size and nucleotide composition), genome architecture (the sizes of genes and intergenic regions, and gene density), various aspects of the organism's life style, and the relationship between sequence evolution and recombination.
Here, we compare the evolutionary features of the ATGCs and show that the ratio of the rates of nonsynonymous to synonymous nucleotide substitutions (dN/dS), which is widely used to characterize the nature and strength of the selective pressure affecting protein sequences (13, 18, 29), is a stable characteristic of prokaryotic lineages, at least at small evolutionary scales. We further reveal the strong relationship between dN and the rate of genome rearrangement and describe nontrivial connections between the purifying selection pressure and other characteristics of genomes.
|
|
|---|
In the second step, the additional criterion of gene order similarity was applied to increase the reliability of identification of orthologs within a cluster. For each pair of genomes, all BBHs detected in the previous step were considered, and each BBH was tested to determine whether it belonged to a synteny region. A BBH was considered to be supported by synteny if there was a high density of adjacent BBHs in its close vicinity in both genomes. Specifically, the standard dot plot for a given genome pair was constructed from the complete set of BBHs (32). For each BBH, the synteny support was calculated as the maximum number of other BBHs in a sliding window of length 7 that included the original BBH. The BBHs with five or more BBHs in the neighborhood were considered to be supported by synteny.
To determine whether two genomes were alignable, the rearrangement distance between two genomes was calculated as follows: DY = (Nb – Ns)/Nb, where DY is the synteny distance, Nb is the total number of BBHs and Ns is the number of BBHs supported by synteny.
Finally, to generate alignable clusters, all genomes in a cluster from the previous step were considered nodes in a graph; edges were added if DY was < 0.15 (an essentially arbitrary cutoff chosen so that the substantial majority are supported by synteny and, accordingly, are included in all analyses), and single-linkage clustering was performed. Connected components of the graph of size 2 or greater represented ATGCs.
Estimation of dN, dS, and dN/dS. For each pair of genomes within an ATGC, alignments of the coding nucleotide sequences of all synteny-supported BBHs were generated using the MUSCLE program (6), and the synonymous (dS) and nonsynonymous (dN) substitution rates were estimated using the maximum-likelihood method implemented in PAML (36). The medians of dS (DS) and dN (DN) over all BBHs supported by synteny were considered two types of intergenomic distances. The DN/DS ratio was used as an estimate of the selective pressure that affected the compared genomes on the path of evolution after their radiation from the last common ancestor.
To carry out the analysis of correlates of selective pressure, reliable estimates of selective pressure are required. Accordingly, only genome pairs with a DS value of <1.2 and a DN value of >0.01 were selected, and accordingly, only those ATGCs that included at least one pair of genomes satisfying these criteria were used for further analysis. The result of applying these criteria was that several clusters of densely sampled genomes, such as streptococci and staphylococci, which are represented in the ATGC web resource, were excluded from the present analysis.
Genomic variables and PCA. The values of the following seven genomic variables were calculated for each of the 41 ATGCs by averaging the values for the corresponding constituent genomes: the genome size (log scale), the number of proteins (log scale), the GC content, the median gene length (log scale), the median intergenic-spacer length (log scale), the fraction of pathogenic organisms, and the DN/DS ratio (log scale) (Table 1). All variables were standardized to a mean of 0 and a standard deviation of 1. Principal-component analysis (PCA) as implemented in the R analysis package (33) was performed on the 41-by-7 data table.
|
View this table: [in a new window] |
TABLE 1. Characteristics of the 41 ATGCsa
|
|
|
|---|
![]() View larger version (13K): [in a new window] |
FIG. 1. Distributions of dS, dN, and dN/dS in orthologous gene sets from three genome pairs from different ATGCs. (a) Distribution (probability density) of dN. (b) Distribution (probability density) of dS. (c) Distribution (probability density) of dN/dS. Metma, Methanococcus maripaludis C5-M. maripaludis C7 (Euryarchaeota); Bursp, Burkholderia cenocepacia MC0-3-Burkholderia vietnamiensis G4 (Betaproteobacteria); Salsp, Salinispora arenicola CNS-205-Salinispora tropica CNB-440 (Actinobacteria). The distribution curves were obtained by Gaussian-kernel smoothing of the individual data points (28).
|
![]() View larger version (15K): [in a new window] |
FIG. 2. Dependence of DN/DS on the distance between genomes, measured as DN. Each point corresponds to a pair of genomes in the given ATGC. (a) Xanthomonas sp. (b) Shewanella sp. (c) P. marinus.
|
The DN/DS values for individual ATGCs span an order of magnitude, between
0.02 and
0.2, with the median at
0.06 (Fig. 3 and Table 1). The distribution of the DN/DS values is supposed to reflect the variance of the purifying selection pressure that affects the evolution of diverse bacteria and archaea. According to the population-genetic theory, the strength of (purifying) selection depends on the effective population size and characteristic mutation and recombination rates of the respective organisms, and these values themselves depend on the life style (20, 21). Examination of the data in Table 1 reveals no overwhelming pattern and few major trends. It is noticeable that the highest DN/DS values, that is, apparently, the weakest purifying selection pressure, are seen in obligate parasites, including intracellular ones. This observation is compatible with previous findings and is most likely explained by the small effective population size, frequent bottlenecks, and low level of recombination that are characteristic of intracellular parasites and symbionts (22).
![]() View larger version (6K): [in a new window] |
FIG. 3. Distribution (probability density) of DN/DS in the 41 analyzed ATGCs. The distribution curve was obtained by Gaussian-kernel smoothing of the individual data points (28).
|
Correlates of purifying selection pressure in prokaryotic genomes. Population-genetic theory predicts that organisms that are subjected to strong selection pressure will experience genome streamlining resulting in compact, typically small genomes with few mobile elements and paralogs, short intergenic regions (high gene density), and possibly even relatively short proteins (20, 21). We examined the connections between the DN/DS value, which is thought to reflect the strength of purifying selection pressure, and six other genome-related variables that were measured for each ATGC, namely, the genome size, the number of annotated proteins, the protein-coding gene size, the intergenic-region size, the GC content, and the fraction of pathogens in the ATGC (Table 1; see Materials and Methods for additional details). The correlations between the DN/DS value and the other variables were found to be moderately strong to weak and not necessarily of the expected sign. In particular, there was a moderate but highly statistically significant negative correlation between the DN/DS and the genome size of prokaryotes (Fig. 4a), i.e., larger genomes, on average, appear to be subjected to a stronger selection pressure than small genomes in an apparent contradiction of the theoretical prediction. Despite this correlation, the majority of the ATGCs are characterized by DN/DS values within the relatively narrow window between 0.04 and 0.08, and that group includes organisms spanning a broad range of genome sizes (Table 1 and Fig. 4a). In agreement with these findings, but unexpectedly considering theoretical predictions, there was also a significant negative correlation between the DN/DS value and the median length of protein-coding genes, that is, organisms encoding longer proteins, on average, seemed to be subjected to stronger purifying selection than those encoding shorter proteins (Fig. 4b). In contrast, there was no significant correlation between the median intergenic-region size or gene density and the DN/DS ratio (Fig. 4c and d). On the whole, these correlations (or lack thereof) between the purifying selection pressure (measured through the DN/DS) and other genomic variables seem to be poorly compatible with the concept of streamlining caused by strong purifying selection.
![]() View larger version (16K): [in a new window] |
FIG. 4. Correlations between the purifying selection pressure (DN/DS) and other genomic variables. (a) DN/DS versus genome size. (b) DN/DS versus intergenic-region size. (c) DN/DS versus gene density. (d) DN/DS versus protein-coding-gene size. The dashed lines show linear regressions. Rs, Spearman ranking correlation coefficient.
|
66% of the original data variance.
![]() View larger version (15K): [in a new window] |
FIG. 5. PCA of seven genomic variables. (a) Loadings of the first two PCs. (b) Scatter of the ATGCs in the plane of the first two PCs. The red contour encloses the tight cluster of genomes, mostly those of free-living organisms, that are subjeced to relatively strong purifying selection. The three ATGCs that include various strains of P. marinus are denoted Pm.
|
Thus, the analysis of the links between the DN/DS ratio and other characteristics of prokaryotic genomes supports the notion that genomes of parasites, although small and often compact due to extensive gene loss, are typically subjected to weak purifying selection (22). In contrast, these findings do not seem to support the straightforward concept of genome streamlining caused by strong purifying selection pressure. One of the possible interpretations is that, although the DN/DS ratio is nearly constant within most ATGCs (see above), fluctuations are common at greater evolutionary scales, obscuring the effects of purifying selection on genomes. Alternatively or additionally, it is conceivable that the evolution of the genomes of bacteria and archaea, especially those that inhabit complex and variable environments, is shaped by the balance between streamlining under the pressure of purifying selection and selection for the maintenance of adequate complexity of the gene repertoire and regulatory networks.
Sequence evolution and rearrangement of prokaryotic genomes. It was shown in the early days of comparative genomics and subsequently confirmed by numerous observations that gene order in prokaryotes is relatively poorly conserved during evolution, typically changing much faster than protein sequences (4, 14, 24, 35). Comparisons of closely related bacterial and archaeal genomes revealed a characteristic "cross-like" pattern of localization of orthologous genes, indicating that inversions around the origin of replication comprise one of the dominant routes of genome rearrangement (7, 34). We exploited the ATGCs to examine the patterns of genome rearrangement in prokaryotes and the possible effects of selection on rearrangement.
Ideally, analysis of genome rearrangements would involve reconstruction of the history of recombination events that occurred after the radiation of the compared genomes from their last common ancestor. Several algorithms have been developed for this type of analysis (3, 12), but in many cases, the number of rearrangement events even between two strains within a bacterial species is so large (Fig. 6a) that the reconstruction methods fail.
![]() View larger version (27K): [in a new window] |
FIG. 6. Patterns of genome rearrangement in prokaryotes. (a) Nearly complete decay of synteny (DY = 0.69; DN = 0.15; DS >> 1); Streptococcus sanguinis SK36 and Streptococcus pneumoniae R6. (b) Virtual absence of rearrangement (DY 0; DN = 0.06; DS = 1.12); Chlamydophila caviae GPIC and Chlamydophila abortus S26/3. (c) Multiple inversions with limited transposition of individual genes (DY 0; DN = 0; DS = 0); Yersinia pestis Antiqua and Y. pestis CO92. (d) No inversion; hot spots of transposition of individual genes (DY = 0.04; DN = 0.03; DS = 0.41); P. marinus AS9601 and P. marinus MIT 9215. (e) Multiple inversions and transposition of individual genes (DY = 0.16; DN = 0.08; DS = 1.35); Pseudomonas fluorescens PfO-1 and P. fluorescens Pf-5.
|
Similarly to the way DN/DS is construed as a measure of purifying selection pressure that affects sequence evolution and, as shown above, is nearly constant within ATGCs, the DY/DS ratio could potentially reflect the purifying selection that affects genome rearrangement. Remarkably, when several ATGCs with a DY value of zero (see below) were left out, a strong positive correlation was observed between the DY/DS ratio and the DN/DS ratio (Fig. 7). This finding strongly suggests that in most prokaryotes the pressure of purifying selection that acts on at least certain types of genome rearrangement and on protein sequences is determined by the same factors.
![]() View larger version (9K): [in a new window] |
FIG. 7. Correlation between the mean purifying selection pressures affecting amino acid sequence evolution (DN/DS) and genome rearrangement (DY/DS). Rs, Spearman ranking correlation coefficient.
|
0 [Fig. 6b]); (ii) multiple inversions centered at the origin of replication, resulting in a cross-like pattern and limited transposition, a pattern that can result in substantially rearranged genomes but DY values close to zero, as inversion does not disrupt synteny blocks (Fig. 6c); (iii) limited rearrangement with hot spots of gene transposition and low (but nonzero) DY values (Fig. 6d); and (iv) multiple inversions and transpositions with high DY values (Fig. 6e). The factors that affect genome rearrangement are not well understood but presumably might have to do with the abundance of mobile elements (transposons) and the state of repair/recombination systems in the respective genomes. Of special interest are those ATGCs that, despite relatively large evolutionary distances reflected in high DN values, show virtually no rearrangement. One plausible view is that the lack of genome rearrangement is a selectively neutral phenomenon, simply reflecting the loss of a recombinational system that is required for rearrangement in the respective organisms. Indeed, it has been suggested that the low frequency of recombination in Corynebacterium compared to Mycobacterium was likely due to the absence of RecBCD, a well-characterized recombinational enzyme complex, in the former (26). However, inspection of the clusters of orthologous genes (30) failed to reveal consistent loss of any major repair/recombination genes (although individually these genomes certainly have lost some); most of these genomes also contained various transposons (data not shown). In particular, the RecBCD system is present in Chlamydia, Chlamydophila, and Borrelia, although it is missing in Rickettsia, an observation that rules out the straightforward explanation of the lack of rearrangement through the loss of recombinational capacity. Therefore, it remains unclear to what extent the lack of genome rearrangement in some of the bacterial parasites is due to the deterioration of repair/recombination systems in these genomes and to what extent this phenomenon might be caused by features of the population dynamics of these organisms and/or selective constraints. The latter remain to be investigated but generally might have to do with selection against breaking operons and, accordingly, disrupting gene coregulation.
Conclusions. The ATGCs present a platform for the analysis of various aspects of the microevolution of prokaryotes (27). Here, we have shown that the ratio of the medians of dN and dS over the set of orthologous genes that is thought to reflect the pressure of purifying selection affecting the protein-coding sequences in the respective genomes is a highly stable characteristic of ATGCs. Having established the stability of this measure, we examined the connections between the strength of purifying selection and other genomic characteristics. In agreement with previous reports (15, 22), we found that bacterial parasites, especially, intracellular ones, despite the sometimes dramatic genome shrinkage caused by gene loss, are typically subjected to weak purifying selection, presumably owing to relatively small characteristic population sizes and frequent bottlenecks. Otherwise, however, the present results seem to emphasize the complexity of prokaryotic-genome evolution and to defy straightforward interpretations based on population-genetic theory. In particular, we did not detect any evidence of genome streamlining caused by a strong pressure of purifying selection (21). Contrary to the streamlining prediction, the genomes that are subjected to strong selection pressure have a tendency to possess larger genomes and longer genes and intergenic regions than genomes evolving under weak selection. Certainly, this is only a statistical trend, so a variety of free-living prokaryotes with very close purifying selection pressures span nearly the entire range of genome sizes. Conceivably, despite the stability of the DN/DS values at short evolutionary distances (within an ATGC), on a larger evolutionary scale, the effective population sizes of archaea and bacteria (and perhaps, to a lesser extent, the mutation and recombination rates) fluctuate often enough to obscure the expected dependences between selection pressure and genome characteristics. It also seems possible that the genomes of bacteria and archaea, especially those that inhabit complex and variable environments, are under selective pressure to maintain the minimal metabolic and regulatory complexity that is required to survive in these habitats, so that the evolutionary trajectory depends on the balance between this requirement and the drive for streamlining. Perhaps, nearly "pure" streamlining can be observed only in organisms that live in relatively simple and stable environments and reach extremely high effective population sizes, as suggested, for instance, by the genome analysis of the most common and abundant marine bacterium, Pelagibacter ubique (10).
Notably, although the gene order changes much faster than protein sequences during the evolution of prokaryotes, we observed a strong positive correlation between the "rearrangement distance" and the amino acid distance. Thus, at least some of the events leading to genome rearrangement, such as transposition of individual genes, seem to be subjected to the same type of selective constraints as the evolution of the amino acid sequences of prokaryotic proteins. Remarkably, these findings mimic the observations of the relationship between sequence evolution and genome rearrangement in animals (37).
In our opinion, the ATGCs are a promising resource for evolutionary-genomic studies. Of course, the 41 ATGCs currently available comprise an inadequately small data set, considering the revealed complexity of the patterns of prokaryotic evolution. Within several years, the exponentially growing collection of genomes from bacteria and archaea with diverse life styles should provide opportunities for a more complete analysis and a more representative and appropriately nuanced characterization of the factors that govern microbial-genome evolution.
Published ahead of print on 31 October 2008. ![]()
Present address: Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720. ![]()
|
|
|---|
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»