Previous Article | Next Article ![]()
Journal of Bacteriology, April 2002, p. 2072-2080, Vol. 184, No. 8
0021-9193/02/$04.00+0 DOI: 10.1128/JB.184.8.2072-2080.2002
Copyright © 2002, American Society for Microbiology. All Rights Reserved.
NeuroGadgets Inc., Ottawa, Ontario K1G 4B5,,1 Department of Biology, University of Ottawa, Ottawa, Ontario K1N 6N5, Canada, and,2 The Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland 4072, Australia,3 Program in Evolutionary Biology, Canadian Institute for Advanced Research,4
Received 31 October 2001/ Accepted 14 January 2002
|
|
|---|
|
|
|---|
However, a potentially serious complication has arisen. It is now appreciated that some genes are sometimes transmitted not vertically (along individual branches of the organismal tree through time) but laterally or horizontally (directly from one branch of the tree to another). The closest relatives (nearest orthologs) of a laterally transferred gene thus occur in genomes in the organismal lineage from which the transfer originated, not in the relatives of its new host; as a consequence, the extrapolated organismal phylogeny is incongruent with the phylogeny inferred from families whose members have been transmitted only vertically. If such lateral gene transfer (LGT) were frequent, the number and diversity of anomalous gene trees might erode our ability to reconstruct correct organismal trees or perhaps even empty this concept of meaning. In the extreme case, what we consider species might have been generated and be maintained not by common ancestry but rather by barriers and firewalls to genetic recombination 38; W. F. Doolittle, personal communication). Many studies have suggested that LGT has occurred frequently in many prokaryotic lineages (7-9, 16, 19-21, 24-26, 31, 33, 38).
The degree to which our understanding of genome evolution may be compromised by LGT depends on the extent to which genetic material has been transferred laterally and, more generally, on our ability to detect and control for incongruent data. We hypothesize that removal of incongruent data should leave sets of genes which, analyzed jointly, should yield a phylogeny that better corresponds with organismal (phenotypic, ultrastructural, and physiological) groupings and with trees inferred from gene families that have undergone little or no lateral transfer (e.g., small-subunit rRNA genes [43]). If the incongruent data are topologically biased (e.g., if they can be identified with one or a few major LGT events), eliminating them from the analysis might yield a topologically different tree. On the other hand, if the incongruent data are unbiased (e.g., if they arose from a large number of dissimilar, quantitatively less-significant events), their removal might improve the statistical support for subtrees but have little effect on tree topology. These competing predictions can be tested directly by inferring whole-genome trees before and after suspect data have been identified and eliminated.
Phylogenetic trees are inferred for individual gene or protein families by maximizing or minimizing an appropriate function over all putatively homologous positions. Trees of organisms are typically derived by parsimony analysis of discrete character states. The first genome trees were constructed by distance analysis and were based on distances derived from proportions of statistically similar open reading frames (ORFs) (presumptive orthologs) or protein folds that could be recognized by pairwise comparisons of genomes (14, 27, 41, 42, 44). Basing these trees on shared ORF or fold contents avoided the perhaps intractable complexities of sequence-based alignment of entire genomes but failed to capture information about the degrees to which sets of ORFs (presumed orthologs) are similar. Grishin et al. (17) used a distance measure based on empirically determined distributions of pairwise interprotein amino acid substitution rates for genomes, while Wolf et al. (44) examined transformations of the pairwise percentages of identity between orthologs. Here we utilized an alternative method of constructing genome trees based instead on mean normalized BLASTP (1) scores, and we compared trees produced by this approach with a content-based genome tree. This study did not constitute a test of the idea that lineages are artifacts of recombinational barriers rather than of shared descent but did explore the extent to which sequences that are incongruent for any reason with the majority signals from their own genomes erode support for a single common phylogenetic tree.
Lateral genetic transfer is not the only process that can obfuscate a phylogeny. An ORF can be phylogenetically discordant (i) if its orthologs have been lost in some but not all other genomes, leaving a patchwork of orthologous and paralogous matches; (ii) because of convergent evolution and nonneutral evolution in general; and (iii) in cases in which certain genes exhibit rates or patterns of sequence change substantially different from those of the other genes in the lineage. To avoid circularity in using phylogenetic analysis to assess phylogenetic incongruity, in this study we instead developed a pairwise statistical approach, correlating patterns of observed genomic similarity among species.
|
|
|---|
If all genes within a genome have the same phylogenetic history, the RBMs for each gene vis-a-vis the target genomes (here, the GenBank virtual genomes) should rank similarly. The rankings should be the same, within statistical variation, for the genome as a whole and for each constituent ORF. If an ORF does not have this common ranking but instead shows a conflicting pattern, it is discordant. Indeed, it is phylogenetically discordant because, by analogy with construction of a tree from a distance matrix, a conflicting pattern of similarity relationships must specify a conflicting tree. Missing orthologs create gaps but not misinformation (incorrectly ordered rankings).
To quantify these relationships, we introduced a normalized similarity score (u) for a query ORF and its RBM in a given target species. We computed this score by dividing the BLASTP-based similarity score (bit score [S']) by the ORF's self-matching score (Fig. 1). The median of the u values for a target species defined w, a measure of the query genome's overall sequence similarity with sequences from that target species. By correlating u and w for all species in which the ORF found a match at a BLASTP expectation value greater than a defined threshold (here e = 1.0 x 10-10), we could determine if the pattern of relationships was consistent with the ORF having evolved concordantly with the rest of the genome.
![]() View larger version (21K): [in a new window] |
FIG. 1. Statistical strategy. In this schematic diagram, ORF 1 finds RBMs in species 1, 2, and S; ORF 2 finds RBMs in species 1, 3, and S; and ORF L finds RBMs in species 2 and S. Thus, w1 = {u1,1, u2,1, . . .}; w2 = {u1,2, . . ., uL,2}; w3 = {u2,3, . . .}; and wS = {u1,S, u2,S, . . ., uL,S}. In this example, ORF 1's set of w values is {w1, w2, . . ., wS}; ORF 2's set of w values is {w1, w3, . . ., wS}; and ORF L's set of w values is {w2, . . ., wS}. An ORF's set of u values is correlated with its set of w values as described in the text.
|
r0 and NR is the total number of randomizations. (NR was the lesser of 199,999 and
ni,
ni > 0, where ni is the number of ORFs shared by the query genome and the target genome [i].) For each pair of genomes (the query and each target), the size of each resampled set was equal to the number of RBMs (Table 1). |
View this table: [in a new window] |
TABLE 1. Proportions of ORFs analyzable and proportions of ORFs found to be phylogenetically discordant in each genome
|
Genome trees were inferred by Fitch-Margoliash least-squares analysis of distance-type matrices (17), each element of which was an ORF-based measure of pairwise dissimilarity between genomes, either 1.0 minus the proportion of ORFs shared by a given pair of genomes (41) or 1.0 minus the mean of normalized pairwise BLASTP scores. BLASTP scores (42) were normalized (4) by dividing an ORF's score against the target genome by the ORF's score against itself. The target database was the set of ORFs identified for the genome itself, not, as described above, all genes deposited under the species designation in GenBank. Only ORFs with matches better than a defined threshold (BLASTP e = 1.0 x 10-10) were used; the reciprocal best-match criterion was not used to derive genome trees. Distance matrices were generated using the NeuroGadgets Inc. web service (http://www.neurogadgets.com). Distance analysis was carried out using the FITCH program in the PHYLIP software package (11), with global rearrangements and randomized (jumbled) species input order. For bootstrap analysis, samples (n = 100) were taken with replacement from the ORFs shared by each pair of genomes, and from these a mean distance was calculated for that pair of genomes. Distance trees were again generated using FITCH, and the majority rule consensus tree was computed using the CONSENSE program in PHYLIP. Trees were visualized and bootstrap values were added using TREEVIEW (36).
|
|
|---|
As described above, the first genome phylogenies were based on a distance-type measure derived from the proportions of ORFs shared pairwise by genomes. Figure 2 shows a genome tree for the 37 microbial genomes based on the proportion in each query genome of ORFs that had an initial BLASTP match better than the threshold (e = 1.0 x 10-10) in each other genome. It was not immediately obvious how to assess statistical support (confidence intervals for internal nodes) for such a tree. Snel et al. (41) assessed confidence by the half-delete jackknife method (45), deleting random halves of ORFs in each query genome, reassessing the proportions of shared genes, inferring a new tree for each replicate, and counting the number of times out of 100 that a particular cluster was found. Wolf et al. (44) instead used the nonparametric bootstrap method (12), resampling from among identified orthologs. The proportion-based genome tree for the 37 genomes (Fig. 2) resembles the tree based on 16S rRNA sequences (Fig. 3) in many respects, but there are a number of discrepancies. In the proportion-based genome tree the Thermotoga maritima genome branches basally among bacterial genomes, as the Thermotoga 16S rRNA does. However, in the proportion-based tree, the genome of Aquifex aeolicus does not branch second deepest; instead, it groups with the genome of Synechocystis as the third-deepest branch. The Mycobacterium tuberculosis genome does not group with the genomes of other Firmicutes; instead, it joins the genome of Deinococcus radiodurans on the second-most-basal branch. In addition, the Haemophilus influenzae genome does not group with the genomes of enteric members of the
subclass of the class Proteobacteria (
-proteobacteria). Even more discrepant is the archaeal subtree, in which the genomes of Halobacterium salinarum and A. pernix constitute the two most-basal branches (the former does not group with genomes of other euryarchaeotes). The methanogen genomes, together with the genomes of Archaeoglobus fulgidus and the two Pyrococcus species, form a group not seen in 16S rRNA trees or indeed in most other molecular sequence trees.
![]() View larger version (33K): [in a new window] |
FIG. 2. Genome tree based on the proportions of query ORFs that found a match in each target genome better than the threshold BLASTP expectation (e = 1.0 x 10-10). Proportions were computed (http://www.neurogadgets.com) by determining the number of ORFs shared by the query genome (smaller of the pair) and the target genome (larger of the pair) and then dividing this number by the number of ORFs in the query genome. A distance matrix was generated by computing 1.00 minus the proportion of shared loci; this matrix was imported into PHYLIP (12) for analysis by FITCH (see Materials and Methods). The branch lengths reflect distances, as assessed by these criteria, between genomes.
|
![]() View larger version (25K): [in a new window] |
FIG. 3. 16S rRNA gene tree, adapted from the Ribosomal Database Project (29). Where 16S rRNA sequences were not available from the Ribosomal Database Project, close relatives were selected. The major organism classifications are consistent with those described by Olsen et al. (34).
|
![]() View larger version (39K): [in a new window] |
FIG. 4. Genome tree based on normalized BLASTP scores, constructed by using all query ORFs that found a match in the target genome better than the BLASTP expectation (e = 1.0 x 10-10). The tree was constructed as described in the legend to Fig. 2, but 100 random replicates (pairwise genome scores reconstructed from individual ORF scores, with resampling) were examined. The numbers at the nodes are bootstrap values, which were generated by the CONSENSE program in PHYLIP.
|
-proteobacterial genome group, as it is in the proportion tree, although this group includes genomes of ß-proteobacteria (represented by the two Neisseria meningitidis genomes). 16S rRNA sequence trees (Fig. 3) (34) also suggest that the ß-proteobacterial sequences might be included among the sequences of the
-proteobacteria. The genome of H. influenzae groups with the genomes of the other enteric
-proteobacteria, as it does in the 16S rRNA tree, which is consistent with the findings of nonmolecular bacterial systematics. The archaeal portion of the score-based genome tree is rearranged compared to the archaeal portion of the proportion-based genome tree (Fig. 2) and is topologically quite different from the archaeal portion of the 16S rRNA tree (Fig. 3), with genomes of both H. salinarum and Thermoplasma acidophilum branching earlier than the genome of the crenarchaeote A. pernix. Excluding sequences that could not be analyzed for potential phylogenetic discordance (but retaining the discordant sequences) yielded a tree that was essentially identical to the tree shown in Fig. 4 (data not shown) and differed in only three topological features. The group of five Bacillus, Mycoplasma, and Ureaplasma genomes was resolved as a sister group of the spirochete and chlamydial genomes, albeit with very weak bootstrap support (55%), instead of branching next from the backbone as in Fig. 4. The genome of Pseudomonas aeruginosa branches immediately after the two Neisseria genomes, instead of forming a sister group with them as in Fig. 4, although the bootstrap support is weak in both cases (67 and 62%). Finally, the branching order of the H. influenzae and Buchnera sp. genomes is reversed, again with low bootstrap support. Thus, excluding the 47.1% of the ORFs that had relatively few strong matches in these genomes had an almost negligible effect on the topology of the genome tree based on mean normalized BLAST scores.
Exclusion of PDSs (P < 0.05) from the latter set yielded the tree shown in Fig. 5. The topology of this tree is almost identical to that of the tree described above, differing only in the position of the P. aeruginosa genome, which groups weakly with the two Neisseria genomes, as in Fig. 4. The two other topological changes compared with Fig. 4 resulted from excluding the 38,235 nonanalyzable ORFs, not from removing the 4,331 discordant ORFs. Exclusion of discordant sequences did, however, substantially improve the resolution of some subtrees, as measured by the nonparametric bootstrap method. Within the
and ß-proteobacteria, for example, the mean bootstrap values increased from 86.1% (Fig. 4) and 86.6% (in the tree containing all analyzable ORFs [data not shown]) to 91.3% (Fig. 5).
![]() View larger version (36K): [in a new window] |
FIG. 5. Genome tree based on normalized BLASTP scores better than the threshold (e = 1.0 x 10-10), after removal of PDSs (P < 0.05). Tree construction and bootstrapping were performed as described in the legend to Fig. 4.
|
|
|
|---|
We identified 4,331 ORFs (10.1% of the 42,925 analyzable ORFs) that have patterns of BLASTP matches which are significantly different from the patterns of their host genomes and would thus probably specify incongruent trees. Removing these ORFs from the analysis affected the topology of the genome tree very little, although it did improve the confidence in a number of subtrees as assessed by the nonparametric bootstrap method. We therefore concluded that both according to the proportions of shared ORFs and according to sequence divergence, genome phylogeny largely agrees with the canonical view of prokaryote evolution based on 16S rRNA gene sequences. It is especially noteworthy that restricting the analysis to ORFs that were found not to be discordant produced so few topological changes; this demonstrates that neither the set of nonanalyzable ORFs (47.1% of the ORFs) nor the 5.3% of the total ORFs identified as PDSs at P < 0.05 is specifically biased toward a preferred alternative topology. Removal of PDSs improved bootstrap support most noticeably in the
- and ß-proteobacterial subtree, although the comparison was imperfect, as there was limited or no room for improvement in several regions of the tree.
Trees based on overall pairwise genomic distances do not, however, agree completely with the 16S rRNA tree. This is especially true for the archaeal subtree, since in all our genome trees the genome of the crenarchaeote A. pernix grouped with the genomes of euryarchaeotes. Euryarchaeota are paraphyletic in other genome trees as well (Fig. 3 and 5 of reference 44) and in a multigene tree based on sequences of ribosomal proteins (Fig. 6 of reference 44), while Crenarchaeota are paraphyletic in trees based on radA (40). Thus, single-gene trees, such as 16S rRNA trees (5, 35), are increasingly isolated in suggesting that Euryarchaeota and Crenarchaeota are monophyletic. The basal position of the genome of Halobacterium sp. (Fig. 2, 4, and 5) is unexpected based on small-subunit ribosomal DNA analysis (Fig. 3) but occurs in the distance-based genome tree of Wolf et al. (Fig. 8 of reference 44) and in a multigene tree based on concatenated ribosomal proteins (Fig. 6 of reference 44). We suspect that this could be an artifact of the high G+C content of this organism and the resultant systematic bias in the amino acid composition of the proteins (15) or an artifact of the elevated content of acidic amino acids that help stabilize proteins in the presence of the high concentrations of intracellular salt characteristic of halobacteria (10). Haloarchaeal genomes may also contain significant numbers of laterally transferred genes (6). It is more difficult to explain why the genome of T. acidophilum branches more basally than expected, although this occurs in other genome trees (Fig. 3 and 5 of reference 44) and in the tree based on ribosomal proteins (Fig. 6 of reference 44). Methanogens are monophyletic in our trees (Fig. 2, 4, and 5) and other genome trees (Fig. 3 of reference 17; Fig. 3 and 5 of reference 44) but not in the trees based on small-subunit ribosomal DNA (5, 40) (Fig. 3), radA (40), or ribosomal proteins (Fig. 6 of reference 44).
Among the bacteria, T. maritima and A. aeolicus appear on the deepest (most basal) branches in the trees based on 16S rRNA (5, 35) (Fig. 3) and ribosomal proteins (Fig. 6 of reference 44) and in some (Fig. 3 and 5 of reference 44) but not all genome trees. In our analyses, the genome of T. maritima always appeared to be the most basal genome, but the genome of A. aeolicus was second deepest only in our distance-based trees (with or without removal of discordant sequences). The numbers of and degrees of similarity between proteins in one or both of these hyperthermophilic bacteria and some archaea have opened a debate about common ancestry versus LGT (2, 3, 23, 28).
Spirochete and chlamydial genomes form a single clade in our distance-based genome trees (Fig. 4 and 5), in the distance-based genomes tree of Grishin et al. (Fig. 3 of reference 17) and Wolf et al. (Fig. 5 of reference 44), and in a multigene ribosomal protein-based tree (Fig. 6 of reference 44). They do not group together in genome trees based on proportions of shared genes (Fig. 2 of reference 42; Fig. 3 of reference 44) (Fig. 2) or in many 16S rRNA-based trees (34) (Fig. 3). No genome trees support monophyletic grouping of all Firmicutes (low-G+C-content and high-G+C-content gram-positive organisms), as some 16S rRNA trees do (Fig. 3). Previously, the low-G+C-content gram-positive bacteria (Bacillus-Clostridium group of Firmicutes) were polyphyletic in genome trees based on proportions of shared genes (Fig. 2 of reference 42; Fig. 3 of reference 44) but monophyletic in distance-based genome trees (Fig. 3 of reference 17; Fig. 5 of reference 44), 16S rRNA-based trees 34) (Fig. 3), and concatenated ribosomal protein gene-based trees (Fig. 6 of reference 44). These organisms were monophyletic in all of our genome trees (Fig. 2, 4, and 5).
Our genome trees are the first trees to resolve the Proteobacteria as a stable monophyletic group, in agreement with 16S rRNA trees (34) (Fig. 3) and concatenated ribosomal protein trees (Fig. 6 of reference 44); the levels of bootstrap support were high (99 and 100% before and after removal of discordant sequences, respectively). Our proportion-based genome tree (Fig. 2) may be unique in resolving the ß-proteobacteria as a sister lineage of the
-proteobacteria; in our distance-based trees (Fig. 4 and 5), as in the trees based on16S rRNA (34) (Fig. 3) and ribosomal proteins (Fig. 6 of reference 44), Neisseria groups with the
-proteobacteria.
Details aside, many genome trees are substantially similar to the 16S rRNA gene tree and to many other single-gene trees and contain physiologically coherent groups largely consistent with modern prokaryote systematics. This finding is obviously consistent with the Darwinian model of descent along genealogical lineages, as on a bifurcating tree, but in and of itself it does not prove that prokaryotes have evolved mostly by traditional vertical descent. Doolittle (8, 9; personal communication) has suggested a scenario of genome evolution in which LGT plays a predominant role and lineages owe their existence to selective sharing of gene pools that might be constrained by environmental, physiological, and other nongenealogical factors. In this scenario, distances such as those underlying the construction of genome trees might reflect the frequency of LGT, not the time since a common ancestor. The genes that confuse genome phylogenies are a testament to this lateral exchange; we explore them further elsewhere (39). Thus, it remains to be determined whether the topologies in genome trees reflect Darwinian evolutionary lineages or are artifacts of an elaborate network of differential lateral genetic exchange. We strongly urge that our results be interpreted in this context.
This work was funded by the Natural Sciences and Engineering Research Council of Canada and by the Institute for Molecular Bioscience, University of Queensland. We are grateful to G. Drouin for financial support of G.D.P.C.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»