Previous Article | Next Article ![]()
Journal of Bacteriology, October 2004, p. 6575-6585, Vol. 186, No. 19
0021-9193/04/$08.00+0 DOI: 10.1128/JB.186.19.6575-6585.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
Department of Bioengineering and Bioinformatics, Moscow State University,1 Institute for Problems of Information Transmission RAS,4 State Scientific Center GosNIIGenetika, Moscow, Russia,5 Department of Pathology, F. E. Hebert School of Medicine, Uniformed Services University of the Health Sciences,2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland3
Received 7 April 2004/ Accepted 28 June 2004
|
|
|---|
-Proteobacteria, the
-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for
70% of the sets, whereas the rest showed substantial anomalies. Subsequent detailed phylogenetic analysis of the genes with the strongest deviations indicated that over one-half of these genes probably underwent a distinct form of horizontal gene transfer, xenologous gene displacement, in which a gene is displaced by an ortholog from a different lineage. The remaining deviations from the clock-like model could be explained by lineage-specific acceleration of evolution. The results indicate that although xenologous gene displacement is a major force in bacterial evolution, a significant majority of orthologous gene sets in three major bacterial lineages evolved in accordance with the clock-like model. The approach described here allows rapid detection of deviations from this mode of evolution on the genome scale. |
|
|---|
Deviations from the molecular clock are thought to result from lineage-specific acceleration of evolution, which could be due either to functional changes entailing relaxation of purifying selection or positive selection or to increased mutational pressure caused, at least in part, by effective population size effects (5). These phenomena cause overdispersion of the molecular clock, which is manifested in unequal lengths of tree branches coming out of the same node, under the assumption that the topology of the phylogenetic tree for a given set of orthologs is known. Typically, the same species tree topology is assumed for all genes. This approach is likely to be valid for the multicellular eukaryotes, which, historically, have been the objects of the analyses that led to the molecular clock concept. However, recent comparative genomic studies strongly suggest that in addition to the regular pattern of vertical inheritance, evolution of prokaryotic genomes is dramatically affected by horizontal gene transfer (HGT) (6, 11, 12, 27, 30, 32, 38, 55). On many occasions, HGT seems to occur between evolutionarily distant organisms, although it has been argued that there could be a decreasing gradient of the HGT rate from closely related species to distantly related species; the existence of such a gradient could be one of the reasons why a species tree can be constructed at all, in spite of extensive HGT (20).
There seem to be certain connections between the amount of HGT and the biology of prokaryotic genes. In particular, the so-called complexity hypothesis holds that HGT is much less common among genes that encode subunits of macromolecular complexes, such as those involved in translation, transcription, and replication, than in genes coding for metabolic enzymes (25). While this prediction might hold statistically, subsequent studies have shown that there are very few, if any, genes that are completely refractory to HGT. In particular, evidence of HGT has been obtained for several ribosomal proteins, translation factors, and the major RNA polymerase subunits (3, 4, 24, 34).
From a comparative genomic perspective, HGT events have been classified into three categories: (i) acquisition of genes that are novel to a given phylogenetic lineage; (ii) acquisition of paralogs of genes preexisting in the given lineage; and (iii) xenologous gene displacement (XGD), in which the original gene from a given set of orthologs is displaced by a member of the same set of orthologs from a different lineage (30).
Obviously, if an HGT event, particularly XGD, goes unnoticed in the course of phylogenetic analysis, an apparent gross violation of this molecular clock will be seen when evolutionary rates are measured in the affected set of orthologous genes on the basis of the assumed species tree topology. HGT is detected through anomalies in the topology of phylogenetic trees of individual sets of orthologs or by so-called surrogate approaches, which, in the case of HGT between distant species, are based primarily on the phyletic distribution of the homologs of a given gene (7, 30, 41). Simply put, unexpected phyletic patterns (e.g., the presence of orthologs of a given gene in all or nearly all sequenced bacterial genomes but in only one archaeon) suggest that there has been HGT (in this case from a bacterium to the archaeon) (27, 30). These patterns can be expressed either in terms of presence-absence only or, more quantitatively, by comparing the significance levels of taxon-specific best hits. The general validity of this approach seems to be supported by the biological plausibility of some of the trends in the apparent horizontal gene fluxes that were detected by phyletic pattern analysis. Thus, hyperthemophilic bacteria showed a clear preponderance of genes of possible archael origin compared to mesophiles (2, 29, 36), and probable gene transfer from eukaryotic hosts to some bacterial pathogens also has been inferred (17, 40). In principle, phylogenetic tree analysis is supposed to be a more precise indicator of probable HGT events than similarity-based surrogate methods because of inaccuracies in the latter resulting from the lack of exact correspondence between sequence similarity and phylogenetic affinity (31). A genome-wide phylogenetic analysis, aimed specifically at detection of horizontally transferred genes, has been described (43). However, it is well known that phylogenetic analysis is fraught with its own slew of artifacts, such as long branch attraction, particularly when fast methods, such as minimal evolution or neighbor joining, are employed (14). In addition, phylogenetic analysis can be prohibitively expensive computationally when it is attempted on the genome scale, especially with powerful methods, such as complete maximum-likelihood analysis, and large sets of species. Therefore, surrogate, similarity-based methods have proved to be extremely useful, at least as a rapid, first-tier strategy that allows workers to delineate a set of HGT candidates.
We were interested in investigating a surrogate approach to genome-wide study of prokaryotic evolution, which combines a test of the validity and an analysis of the rate distribution of the molecular clock with detection of potential HGT events and lineage-specific acceleration of evolution. Using the Clusters of Orthologous Groups (COGs) database for proteins (46, 47), we analyzed the molecular clock behavior of COGs from three major bacterial lineages, the
-Proteobacteria, the
-Proteobacteria, and the low-G+C-content gram-positive bacteria. We found that clock-like evolution was dominant in all three groups, but we also detected many anomalies, some of which are best explained by XGD.
|
|
|---|
-Proteobacteria, the
-Proteobacteria, and the Bacillus-Clostridium group of gram-positive bacteria, were selected for the present study. The
-proteobacterial set included six species: Escherichia coli K-12, Haemophilus influenzae, Pasteurella multocida, Salmonella enterica serovar Typhimurium LT2, Vibrio cholerae, and Yersinia pestis. The
-proteobacterial set included seven species: Agrobacterium tumefaciens C58 Cereon, Brucella melitensis, Caulobacter crescentus CB15, Mesorhizobium loti, Rickettsia conorii, Rickettsia prowazekii, and Sinorhizobium meliloti. The Bacillus-Clostridium set included eight species: Bacillus halodurans, Bacillus subtilis, Clostridium acetobutylicum, Listeria innocua, Lactococcus lactis, Staphylococcus aureus N315, Streptococcus pneumoniae TIGR4, and Streptococcus pyogenes M1 GAS. For each of these species sets, we identified a set of COGs (46, 48) in which each of the relevant species was represented by exactly one protein (i.e., all species present, no paralogs). Additionally, the constituent proteins were required to have a sufficient alignable length (at least 60 amino acids in conserved blocks [see below]). This search resulted in 563 COGs for the
-proteobacterial set, 274 COGs for the
-proteobacterial set, and 234 COGs for the Bacillus-Clostridium set; the overlap for the three sets comprised 114 COGs. Alignments. Multiple alignments of sequence families within the bacterial groups were produced by using the MAP program (23). Sequence families involving wider taxonomic sampling of proteins were aligned by using the T-Coffee program (37). Multiple alignments were aggressively filtered for potential incorrectly aligned positions; only conserved blocks with no gaps containing 10 or more positions were retained for further analysis (53).
Evolutionary distances between genes and genomes and phylogenetic trees.
Maximum-likelihood distances between individual protein sequences were computed for each of the COGs analyzed by using the PAML package, with the JTT substitution model corrected for observed amino acid frequencies and the
parameter of
-distribution of intraprotein evolutionary rate variability set to 1.0; this estimate includes a correction for possible multiple substitutions in the same site (54). Additionally, multiple alignments consisting of 21 sequences each were produced for the 114 COGs in which all three groups were represented, and pairwise distances were similarly computed from this alignment. Phylogenetic trees for individual COGs were constructed by using the maximum-likelihood method implemented in the Tree-Puzzle package, with the expected likelihood weight determined for the tree topology involving split versus topologies in which the respective lineage remained monophyletic (42).
To calculate intergenomic evolutionary distances, all intergenic distances obtained from the same pair of genomes in the set of 114 conserved COGs were pooled, and the median of the distribution of these distances was taken to represent the intergenomic distance (21, 52). Neighbor-joining and least-squares trees were reconstructed from the pairwise genome distance matrices by using the programs NEIGHBOR and FITCH of the PHYLIP package, respectively (15).
Limited tree analysis. A comprehensive phylogenetic analysis of a protein family, even if it is limited to a set of orthologs from completely sequenced genomes, is rarely feasible. There are several reasons that usually preclude this type of analysis, as follows: in many cases, the sequences of orthologs from the most distant lineages, such as bacteria and archaea, are only weakly similar, which complicates the construction of an alignment suitable for building a phylogenetic tree; the large number of sequences hampers the use of advanced methods for reconstruction of the phylogeny; and the presence of in-paralogs of various ages hinders the interpretation of results. Therefore, we elected to perform a limited tree analysis for the cases where the split distance analysis indicated a likely split of a set of orthologs from a particular bacterial lineage into two subsets with distinct evolutionary histories; these groups are referred to below as left ({L}) and right ({R}) sets. For each sequence from the given COG, global pairwise alignment scores (calculated by using the ALIGN program with default parameters) (35) for alignments with the other COG members were obtained (sequences from closely related species were excluded). Members of the COG not belonging to either {L} or {R} (i.e., the sequences of orthologs outside the lineage analyzed) were arranged into two ordered lists, <HL> and <HR>, according to their similarity scores against {L} and {R}, respectively. Multiple alignments that included {L}, {R}, and the top five sequences from <HL> and <HR> were constructed by using the T-Coffee program (37). Maximum-likelihood trees were constructed by using the ProtML program of the MOLPHY package (1). For all internal branches separating the {L} and {R} subsets (see Fig. 4), the RELL bootstrap values were determined. The highest bootstrap value observed on such a branch indicated the level of support for the separation hypothesis.
![]() View larger version (9K): [in a new window] |
FIG. 4. Limited tree procedure. Formally, let us consider an internal branch B, which partitions all nodes into four subsets, {S1 } to {S4}. If the partition satisfies the criteria (i) {S1} contains all of {L} and none of {R}, (ii) {S2} contains all of {R} and none of {L}, and (iii) {S3} and {S4} contain neither {L} nor {R}, the branch provides the evolutionary separation between {L} and {R} regardless of the true position of the tree root.
|
|
|
|---|
Under a perfect molecular clock and strict vertical inheritance, the evolutionary rates of all genes differ from each other only by proportionality constants, so the distance between any pair of genes (dAB) is proportional to the distance between genomes A and B (DAB):
![]() | (1) |
![]() View larger version (16K): [in a new window] |
FIG. 1. Clock-like model of evolution: predictions and detection of deviations. (A) Clock-like evolution with vertical inheritance. (B) Clock-like evolution with one HGT from an outside source. The red branches in the trees indicate the genes from species B and C that are inferred to have been transferred from the unknown, distant species, X, and the red points on the plot at the bottom right indicate the evolutionary distances that are taken as evidence of this HGT event.
|
(ii) Statistical model.
The accumulation of substitutions in protein sequences was treated as a Poisson stochastic process. The number of substitutions in a given gene, accumulated over a given time interval, is a Poisson-distributed variable with the variance proportional to the expected value:
![]() | (2a) |
![]() | (2b) |
![]() | (3a) |
![]() | (3b) |
(iii) Statistical analysis: vertical inheritance.
Let us consider a set of N genomes {G}, each with a single ortholog in a given COG. One can measure the distance between orthologs in the given COG (dIJ) and the distance between the genomes (DIJ) for all N' = N(N 1)/2 pairs ([I,J]) from {G}. Minimizing the square error over all such pairs,
![]() |
![]() | (4a) |
![]() | (4b) |
![]() |
(iv) Statistical analysis: HGT.
If genes in one of the lineages in a COG were acquired via HGT from a distant source, all intergenic distances in this COG fall into two groups: those that reflect vertical inheritance (DAD and DBC in Fig. 1B) and those that reflect the transfer distance (DAB, DAC, DBD, and DCD in Fig. 1B). Considering all pairs that correspond to vertical inheritance ([I,J]) and all pairs that correspond to HGT ([K,L]), the square error of the approximation is:
![]() |
![]() | (5a) |
![]() | (5b) |
![]() | (5c) |
![]() |
(v) Statistical analysis: baseline noise.
If the pattern of a gene's inheritance is completely disjointed from the pattern of intergenomic relationships, neither equation 3a nor equation 3b adequately describes the relationships between the intergenomic and intergenic distances. In the absence of a clear dependence between these variables, the scatter plot for dAB versus DAB represents random scatter of points, and the following simple equation applies:
![]() | (6a) |
is simply the mean intergenic distance over all pairs. The baseline variance of eIJ is:
![]() | (6b) |
(vi) Statistical analysis of COG evolution. Each COG was analyzed with the three models described above.
(a) Noisy data, no clock-like evolution (equation 6a). The baseline variance of intergenic evolutionary distances (uN2) was calculated by using equation 6b.
(b) Simple molecular clock (equation 3a). The residual variance of the straight line fit (uC2) and the fit error (sC2) were calculated by using equation 4b. The relative evolutionary rate (v) was calculated by using equation 4a.
(c) Single significant deviation from molecular clock (equations 3a and 3b). The genomes were partitioned into two sets by breaking each branch of the species tree. For each of the possible splits, the residual variance (uT2) and the fit error (sT2) were calculated by using equation 5c. The split with the minimal sT2 was accepted. The relative evolutionary rate (v) was calculated by using equation 5a, and the distance to the transfer source was calculated by using equation 5b. Additionally, to detect the transfers originating from outside the group, the relative transfer distance (DT) was calculated as D*/max(DKL), with the maximum taken over all cross-group pairs.
Two statistical tests were performed for each COG. The first test aimed at discriminating between H0 (the data do not follow either of the two clock-like models) and H1 (the data fit either the simple-clock or single-transfer model). The ratio FC = u2N/min(u2C, u2T) was subjected to Fisher's test with (N' 1, N' 3) degrees of freedom. If the value of FC exceeded the critical level at the 0.05 level of significance (1.94 to 2.64, depending on the group of species analyzed), H0 was rejected, and the data were considered to conform to the molecular clock model.
The second test discriminated between H0 (the data fit the simple-clock model) and H1 (the data fit the single-transfer model). The ratio FT = sC2/sT2 was subjected to Fisher's test with (N' 2, N' 3) degrees of freedom. If the value of FT exceeded the critical level at the 0.05 level of significance (1.95 to 2.66 depending on the group of species analyzed), H0 was rejected, and the data were considered to conform significantly better to the single-transfer model. In this case, the value of v calculated by using equation 5a (rather than that calculated by using equation 4a) was used to describe the relative evolutionary rate for the COG in question; a DT value of >1 was considered to be an indication of HGT from an outside source.
Empirical results. (i) Molecular clock and deviations from clock-like evolution in bacteria.
We applied the theory described above to an analysis of the evolution of three major bacterial lineages, the
-Proteobacteria, the
-Proteobacteria, and low-G+C-content gram-positive bacteria. For each of these groups, the COGs that contained a single representative from each species were selected for analysis (Table 1). Examples of relationships between the intergenomic and intergenic distances are shown in Fig. 2. Figure 2B shows clear evidence of the evolutionary heterogeneity of the COG in question. Table 1 shows a breakdown of the COGs analyzed according to their fit to one of the three models of evolution described above. For the great majority of the COGs (90 to 99%), the data could be reconciled with either the clock-like model or the model that involved one significant deviation from the clock (Table 1). For
70% of the COGs in each lineage, the clock-like model was found to be compatible with the data, whereas the evolution of the rest of the COGs was better explained when the group was split into two subsets with different histories. For 13 to 22% of the COGs in the three lineages, the apparent acceleration of evolution in one of the clades (DT > 1) was significant enough to suspect HGT from an outside source.
|
View this table: [in a new window] |
TABLE 1. Alternative evolutionary models for COGs
|
![]() View larger version (14K): [in a new window] |
FIG. 2. Three representative examples of the relationships between intergenic and intergenomic distances. (A) Clock-like evolution. (B) Clock-like evolution with one strong deviation probably due to HGT from an outside source. (C) Random scatter of points: uncertain evolutionary scenario. The magenta lines in panels B and C show the horizontal trend lines with the best fit to the data; the evolutionary distances thought to reflect XGD in panel B are indicated by magenta diamonds.
|
-Proteobacteria and almost 1 order of magnitude within the set of 108 conserved COGs that fit one of the two evolutionary models (Table 2). The distributions of relative evolution rates were close to log normal in all three lineages (Fig. 3). Because the median intergenomic distance was calculated only for the 114 COGs that are conserved in all three bacterial groups (see Materials and Methods for details), the medians of the rate distributions within each of the groups were not equal to one. Of the three groups, the
-Proteobacteria showed the fastest median rate and the broadest variation in the relative evolutionary rates (Fig. 3A). This is probably due to the fact that the selected set of
-Proteobacteria species forms a tighter (more recently diverged) cluster than the sets of
-Proteobacteria and Bacillus-Clostridium group bacteria analyzed, as indicated by the comparison of the calculated intergenomic distances (data not shown). Accordingly, the set of
-proteobacterial COGs is more functionally diverse and includes faster-evolving genes than the COGs from the other two lineages. This conclusion is supported by the comparison of the rate distributions for the full set of 555
-proteobacterial COGs and the 108-COG subset that is conserved in all three lineages; not unexpectedly, the distribution in the latter set is considerably narrower and is shifted toward lower rates (Fig. 3B). Within the conserved set of 108 conserved COGs, the relative evolutionary rates are strongly correlated among the three lineages (Table 3). However, statistically significant differences between the lineages were detected; in the same COG, the rate among
-Proteobacteria tended to be the lowest, and the rate among the
-Proteobacteria tended to be the highest (Table 3). |
View this table: [in a new window] |
TABLE 2. Relative evolution rates
|
![]() View larger version (26K): [in a new window] |
FIG. 3. Distributions of the relative evolutionary rates in the three bacterial lineages analyzed and in the conserved set of COGs represented in all lineages. (A) Distributions of the relative evolutionary rates in the three lineages. gamma, -Proteobacteria; alpha, -Proteobacteria; bacil, Bacillus-Clostridium group. (B) Distributions of the relative evolutionary rates in the full set of -Proteobacteria (gamma-555) and in the subset of COGs represented in all lineages (gamma-108). The distributions for the three bacterial lineages and the corresponding log-normal approximations (dashed lines) are color coded. The scale for the horizontal axis is logarithmic.
|
|
View this table: [in a new window] |
TABLE 3. Evolution of the same COG in different groups
|
Briefly, the purpose of this procedure was to investigate whether the two sets of species separated by the statistical analysis described above were also separated by a strongly supported internal branch in the phylogenetic tree for the given COG (Fig. 4). Not surprisingly, the COGs that had the longest split distance (DT) also showed a consistent tendency to have anomalies in the reconstructed phylogenies (Table 4).
|
View this table: [in a new window] |
TABLE 4. COGs with most pronounced deviations from the clock-like model
|
In COG0060 (isoleucyl-tRNA synthetase, IleS), both Rickettsia species and C. acetobutylicum are separated from their sister groups (
-Proteobacteria and gram-positive bacteria, respectively) and partition into the archaeon-eukaryote part of the tree (Fig. 5A). No alternative topology placing Rickettsia with other
-Proteobacteria and/or Clostridium with bacilli showed a detectable degree of support in statistical tests. The apparent origin of many bacterial IleRS proteins via HGT from eukaryotes is well documented (10, 49, 50); the resulting inconsistency of gene-to-gene distances is readily captured by the analysis presented here.
![]() View larger version (20K): [in a new window] |
FIG. 5. Maximum-likelihood trees for detected cases of probable XGD. Colors indicate species belonging to the bacterial lineage(s) for which a significant deviation from the clock-like model was detected, namely, -Proteobacteria and gram-positive bacteria (A), gram-positive bacteria (B), -Proteobacteria (C), and -Proteobacteria (D). The triangles represent collapsed clades (the numbers of proteins are indicated in brackets). (A) COG0060 (isoleucyl-tRNA synthetase, IleS). The following proteins and species are included: RP617 of R. prowazekii, RC0953 of R. conorii, CC0701of C. crescentus, BMEII1043 of B. melitensis, AGc1230 of A. tumefaciens, SMc00908 of S. meliloti, mlr8250 of M. loti, CAC3038 of C. acetobutylicum, SA1036 of S. aureus, BS_ileS of B. subtilis, BH2545 of B. halodurans, lin2127 of L. innocua, L0350 of L. lactis, SPy1513 of S. pyogenes, SP1659 of S. pneumoniae, FN0067 of F. nucleatum, TM1361 of Thermotoga maritima, aq_305 of Aquifex aeolicus, DR1335 of Deinococcus radiodurans, BB0833 of Borrelia burgdorferi, and TP0452 of Treponema pallidum. (B) COG1217 (predicted membrane GTPase involved in stress response, TypA). The following proteins and species are included: CAC1684 of C. acetobutylicum, SA0959 of S. aureus, BS_ylaG of B. subtilis, BH2632 of B. halodurans, lin1055 of L. innocua, L0370 of L. lactis, SPy1527 of S. pyogenes, SP0681 of S. pneumoniae, FN0634 of F. nucleatum, and DR1198 of D. radiodurans. (C) COG1282 (NAD/NADP transhydrogenase beta subunit, PntB). The following proteins and species are included: RP074 of R. prowazekii, RC0104 of R. conorii, CC3303 of C. crescentus, BMEII0325 of B. melitensis, AGc4531 of A. tumefaciens, SMc03938 of S. meliloti, mlr5184 of M. loti, slr1434 of Synechocystis sp., and all3408 of Nostoc sp. (D) COG1526 (uncharacterized protein required for formate dehydrogenase activity, FdhD). The following proteins and species are included: FdhD of E. coli K-12, ZfdhD of E. coli O157:H7, ECs4821 of E. coli O157:H7 EDL933, STM4038 of S. enterica serovar Typhimurium, YPO4060 of Y. pestis, HI0005 of H. influenzae, PM0410 of P. multocida, VC1519 of V. cholerae, PA5180 of Pseudomonas aeruginosa, RSp1050 and RSc2367 of Ralstonia solanacearum, CC1242 of C. crescentus, and Cj1508c of Campylobacter jejuni.
|
In COG1282 (NAD/NADP transhydrogenase beta subunit, PntB) the Agrobacterium sequence clusters with
-Proteobacteria and Neisseria, whereas the rest of the
-Proteobacteria cluster with Ralstonia (Fig. 5C). No alternative topology is viable according to the tests performed. Thus, in Agrobacterium, the original
-proteobacterial gene apparently has been displaced with the
-proteobacterial ortholog.
COG1526 (uncharacterized protein required for formate dehydrogenase activity, FdhD) exemplifies the frequently observed separation of Vibrio from the rest of the
-Proteobacteria (Fig. 5D). In this case, the Vibrio protein is closely related to the Ralstonia ortholog, and the Vibrio-Ralstonia clade joins the gram-positive branch instead of the proteobacterial branch. No topology joining Vibrio with other
-Proteobacteria is supported. In this case, at least two HGT events have to be inferred: displacement of the original proteobacterial gene in either the Vibrio or the Ralstonia lineage by the ortholog from gram-positive bacteria and a secondary HGT between Vibrio and Ralstonia.
We also examined whether the presence a molecular clock violation in a COG is correlated for the three bacterial lineages studied. In the set of 108 conserved COGs, the Pearson correlation coefficients were 0.02 between
- and
-Proteobacteria, 0.11 between
-Proteobacteria and gram-positive bacteria, and 0.17 between
-Proteobacteria and gram-positive bacteria. The distribution of the number of anomalies detected in each COG (range, 0 to 3) nearly perfectly agreed with the expectation under the independence hypothesis [P(
2) > 0.5]. Thus, somewhat counterintuitively, HGT appeared to occur independently in different lineages, and there was no obvious HGT propensity that could be considered a characteristic of evolution of an entire COG. Nor did we detect any significant connection between the violations of the molecular clock detected and the relative evolution rates. The distributions of v values were very similar for split and nonsplit sets, and the difference between the means was insignificant for all three groups according to the t test (data not shown).
General discussion and conclusions. Comparative genomics allows researchers to restate old questions of evolutionary biology on a grander scale and, perhaps, in a more biologically meaningful way. Thus, rather than analyzing the molecular clock for one or a few selected protein families, it is possible to address the issue at the level of complete genomes and to ask what fraction of the genes follow the clock-like model of evolution and which genes demonstrably deviate from it. Here we describe a simple theoretical framework that allowed us to classify orthologous sets of bacterial genes (COGs) into these two categories. A similar relative rate test was described by Syvanen in his analysis of the acceleration of evolution of rRNA in eukaryotes (45).
We found that for several hundred COGs analyzed representing three well-defined bacterial lineages, the
-Proteobacteria, the
-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for
70% of the COGs, whereas the rest showed substantial anomalies. It should be noted that the null hypothesis employed here is, in fact, a soft clock, which, strictly speaking, does not require constancy of evolutionary rates for the genes analyzed. All that is required is that the rate distribution remains constant (i.e., all genes are allowed to accelerate or decelerate synchronously) (21). Notably, we also found that, within the set of 108 COGs that were represented by a single ortholog in all species analyzed from the three lineages, the relative evolutionary rates were strongly correlated among the lineages. This observation emphasizes the general validity of the soft genomic clock.
We also analyzed the nature of the observed anomalies and found that, for the most conspicuous anomalies, the majority were most readily explained by HGT from phylogenetically distant lineages. Importantly, this is a conservative estimate because we analyzed only a special set of well-behaved COGs, which contained exactly one ortholog from each of the species included. HGT events have been classified into three broad categories: (i) acquisition of genes new to the recipient lineage, (ii) acquisition of paralogs of resident genes, and (iii) XGD, in which a resident gene is displaced by an ortholog from a different lineage (30). The present analysis was designed to identify only cases of XGD (although in some exceptional situations we obtained indications of paralog acquisition where the COG analyzed seemed to contain hidden paralogs). The results suggest that XGD occurs during evolution of
10 to 15% of the bacterial genes. This relatively low fraction of HGT among single-ortholog bacterial genes is compatible with the notion that, at least within well-defined clades, such as the
-Proteobacteria, the
-Proteobacteria, and the Bacillus-Clostridium group, these genes may be combined to produce organismal phylogenies, preferably after exclusion of the genes with detected HGT (9, 33, 51). However, the fraction of likely HGT detected here is considerably greater than that recently reported for single-ortholog gene sets from
-Proteobacteria (33). It seems likely that the difference is a consequence of the criterion used for selection of orthologous gene sets in the latter study, in which only genes that are highly conserved within
-Proteobacteria were examined. This criterion probably resulted in exclusion of orthologous sets deviating from the clock-like behavior.
An unexpected finding of this study is the lack of significant correlation among the three bacterial lineages analyzed with respect to the deviations from the clock model and the probable occurrence of XGD. An observation of such a deviation in any one of the lineages was a poor predictor of deviations in other lineages. This result seems to be poorly compatible with the complexity hypothesis (25) and similar notions concerning the dependence of HGT on biological function and is more in line with the ideas on random scatter and lineage-specific trends of HGT events (55). It should be noted that we analyzed a limited gene set and only one form of HGT (XGD). Functional correlates are likely to emerge in larger-scale studies, but our present results indicate that these connections are far from being absolute.
The approach described here allows rapid identification of orthologous gene sets whose evolution significantly deviates from the soft clock model. Interpretation of these deviations as lineage-specific acceleration of evolution, XGD, or a combination of the two requires detailed phylogenetic analysis. Nevertheless, we believe that this methodology has its own advantages and could be useful in the study of genome-wide evolutionary trends. In particular, this approach allows workers to detect significant deviations from the clock-like model of evolution in a particular lineage without using any information on species outside that lineage, such as the (often unknown) source of the potential HGT. More practically, the procedure described here could be suitable for removing anomalous COGs from multigene sets employed for construction of organismal phylogenies.
P.S.N., M.S.G., and A.A.M. were partially supported by grants from the Howard Hughes Medical Institute (grant 55000309), the programs "Molecular and Cellular Biology" and "Origin and Evolution of the Biosphere" of the Russian Academy of Sciences, and the Fund for Support of Russian Science (MSG).
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»