| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,
Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia 24061
Received 16 February 2007/ Accepted 24 April 2007
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Special interest attaches to the Alphaproteobacteria as the ancestral group for mitochondria. The Rickettsiales are most often cited as the alphaproteobacterial subgroup from which mitochondria arose, but there has been disagreement on this point (17, 19, 32). Proper placement of the mitochondrial branch on the alphaproteobacterial species tree benefits the study of eukaryotic nuclear genomes, which house many genes of mitochondrial origin.
Improvement in phylogenetic tree reconstruction can come from increasing the number of characters used, which whole-genome sequences have provided in abundance. However, even long character matrices can produce artifacts, for example, when taxon sampling is limited. Recently, the number of completely or nearly completely sequenced alphaproteobacterial genomes has become large, allowing for the assembly of a carefully selected character matrix that is both long and broad, in turn enabling robust phylogenetic inference. With such a matrix, we have generated a species tree for these bacteria, into which we have placed the mitochondrial branch. The reliability of the tree is indicated not only by its high Bayesian and bootstrap support values, which are generally very high for such large matrices, but by the agreement of maximum-likelihood (ML) and Bayesian methods, by convergence to this same tree for subsets of the full protein set, and by its good agreement with the highly supported bipartitions among the single-protein trees.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Key programs for phylogenetic analysis. Unless otherwise stated, the following programs were employed. MUSCLE (15) was used with up to 100 iterations for sequence alignment. Gblocks (11) was used for masking gapped and other noisy portions of the alignments, with a minimum block length of 10 amino acids and gaps allowed in any alignment position for no more than half of the sequences. We performed ML phylogenetic analysis with PHYML v2.4.4 primed with the BIONJ tree (22), using the Whelan and Golding (WAG) amino acid substitution matrix, estimating all parameters, and using four substitution rate categories. We performed Bayesian phylogenetic analysis using MrBayes v3.1.2 (28) in model-jumping mode with a single chain and primed with the BIONJ tree, assessing burn-in (arrival at a likelihood plateau) as described previously (7).
rRNA gene sequences and trees. The 16S and 23S rRNA gene sequences of the bacteria under study were identified using author annotations or homology searching and were trimmed or extended so that their endpoints corresponded to those of the E. coli RNAs. The initial trees showed that multiple versions of an rRNA gene within any genome were more closely related to each other than to those of any other genome (except for 23S rRNA gene sequences in Brucella melitensis 16 M and Brucella melitensis Abortus), so single representative 16S and 23S rRNA genes were taken from each genome. The rRNA gene sequences from all study genomes were aligned and guided by secondary structure, using ClustalW with a profile. Seed profiles came from structure-based prealignments: a 16S alignment for 59 ingroup and 8 outgroup species from the Ribosomal Database Project II (RDP) release 9.39 (14) and a 23S alignment for 20 species from the comparative RNA website (10). Masking by Gblocks (with a minimum block length of 5 nucleotides) left 1,337 and 2,449 positions for 16S and 23S sequences, respectively. The concatenation of these two masked rRNA alignments was also prepared.
Relationships among aligned sequences were examined by ML and Bayesian phylogeny inference analyses using the Hasegawa-Kishino-Yano substitution matrix. ML analysis employed 100 bootstrap resamplings. We employed Bayesian analysis with two MrBayes runs, each with four chains for 100,000 generations, taking the best (highest likelihood) and the consensus of trees after burn-in within 10,000 to 15,000 generations.
Bacterial protein families and trees. Homology groups suitable for multiprotein phylogenetic analysis were identified using the GeneTrees database (29), which at the time covered 14 alphaproteobacteria and 254 other prokaryotes. Searching this large set of protein trees for subtrees spanning all 14 alphaproteobacteria and containing no more than four additional taxa yielded 216 seed groups. Hidden Markov models were built for each seed group (4) and used to search the full protein sets for all genomes under study. We filtered the resulting gene families, rejecting those missing an ortholog in more than four ingroup strains, those with more than four strains with multiple members, and those with the outgroup widely dispersed upon visual inspection of the initial trees. This resulted in a set of 115 homology groups that approximated the ideal of representing each strain once (see Table S2 in the supplemental material). TBLASTN helped to find 8 missing proteins that were previously unannotated and helped to reconstruct 22 missing or incomplete proteins that had a putative single-nucleotide frameshifting sequencing error.
Because the criteria for determining N termini can vary among genome projects, we used conservation as a criterion for their revision. All protein sequences were extended in the N-terminal direction until blocked by an in-frame stop codon. The N-extended sequences were aligned and masked. We used the N-terminal conserved sequence block to trim the extended sequences, identifying 225 genes in which the true N terminus may be further upstream than the annotated one; these trimmed N-extended sequences were used in further analysis.
The integrity of the 115 families with respect to possible paralogs was reexamined by first allowing each family to repopulate even with second-best BLAST hits. Each family member protein was used as a BLASTP query against the whole set of proteins for each strain. Any protein that was either a best hit for a query or a second-best hit with a bit score at least 80% of that of the best hit for that strain was added to the protein family. For the ingroup, this analysis left 80 families with no paralogs, 27 slightly expanded families (1 to 8 paralogs), and 8 greatly expanded families (41 or more paralogs). The sequences of the expanded families were aligned and masked, and the resulting neighbor-joining (BIONJ) trees were rooted with outgroup species. For the slightly expanded families, the cases of multiple-ortholog candidates for a strain were resolved by retaining the candidate with the shortest distance to the shared node if this distance was at least 20% shorter than any other; otherwise, no candidate was retained for that family. This simple technique always identified the candidate that best fit our emerging picture of the species tree and, for ingroup strains, resulted in 50 retentions of the original family member, five replacements of the original, and nine ambiguous cases in which neither candidate was retained. Among the greatly expanded families, the tree for one of them could be resolved into two subtrees corresponding to the
32 and
70 RNA polymerase subunit families and a third small group that was rejected. The trees for the other seven greatly expanded families could not be readily resolved, and these families were entirely rejected.
The tests of Novichkov et al. (25) were applied to identify protein families that might follow evolutionary models other than vertical inheritance. Distance scores were taken for each family alone and for the concatenation of all 109 remaining families, using TreePuzzle with the WAG amino acid substitution model and a setting of 1.0 for the shape parameter of the gamma distribution. As a tentative species tree, the neighbor-joining tree was prepared for the concatenated alignment by using PHYLIP with jumbling (18). For each family, distances were fit to each of three models: (i) no relationship between family member distances and species distances (noise), (ii) proportionality between family member distances and species distances (vertical inheritance), and (iii) a model for a single horizontal transfer evaluated for each branch of the tentative species tree. Five families were eliminated by these tests because they did not reach the critical value of the F distribution (95% confidence) for rejection of the noise model. The ability of this method to detect horizontal transfer within the Alphaproteobacteria was tested by switching a single protein within a family; not all such artificial transfers were detected.
The remaining 104 families closely approached 100% representation by the genomes, reaching 99.8% for the ingroup and 96.7% for the outgroup (see Table S2 in the supplemental material). No ingroup strain was missing more than 2 of the 104 families (see Table S1 in the supplemental material).
Ten amino acid substitution models were tested, and WAG was found to be ideal or nearly so for each family. For each masked alignment, two independent MrBayes runs proceeded in the model-jumping mode for 500,000 generations primed with the neighbor-joining tree for the same family. Most families (85) used the WAG matrix exclusively, and for four more families, WAG was preferred in more than 75% of the post-burn-in trees. For the remaining 15 families, which preferred a variety of other models, ML trees were generated with either the preferred substitution model or WAG. The log likelihood for the WAG-constrained ML tree was lower than for the preferred-model ML tree, on average only by 0.24% and by no more than 1.0%, justifying an expedient of the fixed use of the WAG substitution model in further ML analyses of these families.
Sets of families. Several subsets of the 104 alignments were selected: the group of 16 protein families found in mitochondria (see below), 26 mutually exclusive groups of 4 randomly selected proteins, 5 mutually exclusive groups of 10 randomly selected proteins, 4 mutually exclusive groups of 26 randomly selected proteins, and the 5 groups based on similar alpha-versus-Pinvar values (see below). Alignments were concatenated for each subset, and BIONJ trees were prepared and used to prime Bayesian and ML analyses. A single MrBayes run proceeded for 200,000 generations, reaching burn-in at 10,000 to 80,000 generations.
To further explore the validity of concatenating protein alignments, parameters were evaluated for the plateau set of trees from a WAG-constrained MrBayes run for each family. In particular, the shape of the gamma distribution (alpha) had a 3.3-fold range (0.76 to 2.52) and the proportion of invariant sites (Pinvar) had a 15.6-fold range (0.016 to 0.25). Families were segregated into five partitions containing from 7 to 40 proteins that fell into approximately equal-sized areas in an alpha-versus-Pinvar plot.
Full concatenated alignment. For the full concatenation of all 104 alignments, a neighbor-joining tree was determined using PHYLIP with jumbling. This was used to prime a MrBayes run of 100,000 generations for the partitioned alignment, unlinking the statefreq, alpha, and Pinvar parameters for the partitions; the topology reached by 8,180 generations was unchanged by the end of the run. A second MrBayes run for 100,000 generations with the unpartitioned alignment was primed with the ML tree for the concatenated rRNAs; the topology reached by 5,480 generations was identical to that of the previous run and was unchanged by the end of the run. Fifty bootstrap resamplings of the full concatenation were generated and subjected to ML analysis primed with the neighbor-joining tree. The extended majority-rule consensus tree matched the topology of the two MrBayes runs.
To examine concordance, the 5,320 highly supported bipartitions (found in
95% of post-burn-in trees) from each of the 106 single-gene (104 proteins and two rRNAs) MrBayes analyses were identified. All trees generated in the course of this work, in ML bootstrap and MrBayes runs for single genes and concatenations, comprising at least 2,200 different topologies, were evaluated by taking the bipartitions of the tree and counting the number of matches to the highly supported bipartitions from the single-gene trees.
Links between nodes on the species tree.
We used the EEEP program (6) to determine, when possible, the minimal edit paths between each MrBayes consensus protein tree (considering only nodes with
95% posterior probability) and the final species tree, applying the built-in ratchets when necessary. The same minimum number of edits may be shared by different edit paths such that two types of edits may occur for any tree comparison, obligate and nonobligate edits (6). The edits specify links between nodes on the species tree, and the links were scored by adding the counts of the corresponding obligate edits and the weighted nonobligate edits for all tree comparisons.
Mitochondrial protein families. Sixteen of our 104 alphaproteobacterial protein families had homologs in at least one of the mitochondrial genomes, and each of the eight mitochondria were represented in at least 6 of these families. Mitochondrial homologs were added to each family, and sequences were aligned, masking ambiguous portions by using Gblocks in two steps. Two steps were applied because a single application of Gblocks either (i) trimmed too heavily, removing informative portions of the alphaproteobacterial alignment, or (ii) left unrelated portions of the more-divergent mitochondrial sequences (mainly at the termini) inappropriately aligned to the alphaproteobacterial sequence. The first Gblocks run was for ingroup Alphaproteobacteria and mitochondrial members of the alignment, and the two outermost endpoints of the resulting blocks were used for trimming only terminal portions of mitochondrial sequences. The second Gblocks run was for ingroup Alphaproteobacteria only, creating a mask for the ambiguous positions in all sequences in the alignment.
Based on the results of initial MrBayes runs for each family, the 16 families were segregated into five partitions containing one to five proteins that fell into approximately equal-sized areas in an alpha-versus-Pinvar plot. A BIONJ neighbor-joining tree was prepared and used to prime two parallel MrBayes runs of 200,000 generations, unlinking statefreq, alpha, and Pinvar parameters for the partitions, in the model-jumping mode, with burn-in by 80,000 generations. The BIONJ tree was also used to prime ML analysis for 100 bootstrap resamplings of the alignment.
| RESULTS |
|---|
|
|
|---|
|
We used a novel approach to standardize the starting position for protein sequences by extending protein sequences N terminally according to genome sequence, realigning, and marking the N-terminal endpoint of conservation. As a second test of whether the remaining families might mix members of paralogous subgroups, the families were enlarged again by the addition of high-scoring second-best BLAST hits. Of eight greatly enlarged families, only one could be resolved into two orthologous subfamilies and the others were rejected. Among the families still under consideration, 27 had one or more paralogs. These were resolved, when possible, by a simple test based on tree branch lengths (55 of 64 cases); otherwise, no candidate was retained. The families were then subjected to tests for both noisy phylogenetic signal and possible horizontal gene transfer (25); five families were rejected due to noisy signal and none due to possible horizontal transfer.
The molecular functions of the 104 retained protein families were approximately equally distributed among ribosome structure, other protein synthesis roles, protein fate, RNA and DNA metabolism, and other metabolism; one family had no functional characterization (see Table S2 in the supplemental material).
Tree building. For each protein family, sequences were aligned, ambiguous portions of the alignments were masked out, and trees were built using ML and Bayesian methods. As with the rRNA trees, tree topologies did not agree either between the two methods or between any two families and support for nodes was often weak. Of 7,968 nodes among the ML trees, 2,201 (28%) had <50% bootstrap support.
It was observed that the pair of tree topologies produced by the two methods had better agreement for the families with longer alignments (Fig. 2A). This suggested that part of the problem with single families is that they have insufficient information content and that trees from even longer alignments would be more reliable. Producing a longer alignment by the concatenation of alignments for multiple-protein families is justified if the families have evolved under similar parameters or if those parameters are allowed to vary independently in partitions of the family set. The WAG amino acid substitution matrix was found to be either the best available or near optimal in each of the single-protein analyses, so this matrix was used for analyzing concatenated alignments. Initial trees also showed that two parameters, the shape parameter of the gamma distribution and the proportion of invariant sites, varied somewhat among the protein families. The families were sorted into five subgroups according to their position in a plot of these two parameters, and the concatenation of all 104 alignments was partitioned accordingly. Bayesian analysis, primed with either a neighbor-joining tree for the concatenated alignment or an rRNA tree, quickly and stably settled upon a single tree topology. The consensus tree from ML analysis of bootstrap samples of the concatenated alignment had an identical topology. Thus, a single tree topology emerged from multiple analyses of the concatenated alignment, and the support values were very high: 100% for each node by Bayesian analysis and nearly so for ML bootstrap analysis. Additional measures described below provided further support for this topology, and we term the most likely Bayesian tree (Fig. 3) our "final tree."
|
|
95% support, 4,362 (82.0%) were in agreement with the final tree (7). Thus, the genes for these molecules show a strong trend toward the same pattern of vertical inheritance. Concordance was also used to assess each of the 77 nontrivial nodes in the final tree. Values for the percentage of single-molecule trees that showed high (
95%) support for a node were low (53.6%) on average and ranged from 7 to 100% (Fig. 3). Node concordance should be considered an extremely stringent criterion for evaluating support within the tree and should not be interpreted as support values usually are, i.e., as the probability that a node is correct. Nonetheless, the values do appear to provide some measure of the reliability of the nodes, in that the nodes that have ML bootstrap values below 100% are among those with the lowest concordance values. As a second approach to evaluating the reliability of the tree, we examined how well its topology was reproduced by subgroups of the full set of 104 proteins, in what might be considered whole-protein bootstrap analysis of the concatenated alignment (Fig. 2B). Random groups of 4, 10, or 26 of the proteins and other subgroups were taken, and their concatenated alignments were evaluated by ML and Bayesian methods. With longer alignments, the trees built by the two methods agreed better and also agreed better with the final tree. The curves fitting these two trends were extrapolated to the number of characters in the full concatenated alignment, showing that the full alignment was expected to produce a very reliable topology (Fig. 2). Comparison with the trends for the rRNA alignments indicates that the phylogenetic resolving power of nucleotide characters was six- to ninefold lower than for amino acids in our masked alignments.
Overall concordance with the single-molecule trees, a criterion quite different from the original one (likelihood), was used to reassess all the thousands of tree topologies encountered in all the MrBayes runs and ML studies for individual sequence alignments and concatenations. The final tree had the second-best concordance; another tree exceeded its concordance by 1 (agreeing with 4,363 of the highly supported nodes from the single-molecule trees). A single subtree prune-and-regraft operation (edit) to the final tree, grouping the Caulobacterales/Parvularculales/Hyphomonadaceae with the Rhizobiales rather than with the Rhodobacterales, was sufficient to produce the most concordant tree. No additional edits to the final tree that improved its concordance were identified among the next 10 most concordant topologies.
From tree to network. When a single-protein tree disagrees with a species tree, a series of edits can often be proposed that bring the trees into agreement (6). An extreme interpretation of the minimal set of such edits is that they represent a minimum number of horizontal transfers in the history of the protein gene, although it should be noted that discordance between protein and species trees may have other causes, such as inappropriate alignment and masking. These edits create links between nodes of the species tree, converting the tree to a network, and the links can be ranked by the edit counts for all the protein trees under study. The systematic determination of such links, which have been termed "highways of gene sharing," on a large prokaryotic species tree showed that most edits occurred within the major phylogenetic divisions but that a large number of edits occurred between divisions (7). We examined the minimal edit series that converted our final species tree into each of the single-protein trees of our study (except for 13 protein trees for which no series could be determined); the number of minimum edits averaged 7.5 per tree. The edits defined 556 links on the species tree, with 243 scoring at least 1.0. Figure 4 shows that the top links tend to cross the nodes with poorer concordance support. They illustrate some of the more likely phylogenetic avenues along which even the highly conserved genes in our collection may have been successfully transferred during the evolution of these genomes.
|
For these 16 protein families, the representative sequences from outgroup Proteobacteria, Alphaproteobacteria, and mitochondria were aligned, masked, and concatenated. Two trees were generated using MrBayes and ML bootstrap analyses; their topologies were compatible and, in the bacterial portion, matched that of the 104-protein tree. The mitochondria grouped together on a single branch that emerged from within the Rickettsiales, with the combined Anaplasmataceae/Rickettsiales as a sister group, and were subtended by the Pelagibacter branch (Fig. 5).
|
| DISCUSSION |
|---|
|
|
|---|
This tree splits the Hyphomonadaceae from the Rhodobacterales and places them with the combined Caulobacterales and Parvularculales. The clustering of Hyphomonadaceae with Caulobacterales has been noted before. In one study, the Hyphomonadaceae were observed to cluster with the Caulobacterales in protein-based trees, yet with the Rhodobacterales in 16S rRNA trees (1). We obtained these same results; indeed, the unfavored Hyphomonadaceae/Rhodobacterales clustering was obtained for 16S rRNA trees whether we used (i) an alignment based on a profile from RDP and masked using Gblocks (Fig. 1), (ii) a masked alignment prepared directly by a server at RDP, or (iii) a de novo alignment by MUSCLE masked using Gblocks (data not shown). Horizontal transfer of the 16S rRNA gene has been invoked for other alphaproteobacteria and specifically for Hyphomonas (1, 30). However, the favored Hyphomonadaceae/Caulobacterales clustering did appear in a recent analysis of a manual alignment of 16S rRNA sequences (24), suggesting that horizontal transfer need not be invoked and that RDP-based and other 16S rRNA sequence alignments may have been misleading. Members of the Hyphomonadaceae and Caulobacterales share an unusual dimorphism, exhibiting both a nonmotile stalked reproductive cell type and a motile cell type, and comparison of Hyphomonas and Caulobacter genomes have revealed several close relationships. There have been calls to unite these groups within the order Caulobacterales (1, 2, 24). Our tree supports this unification and suggests that the order should additionally include the Parvularculaceae, with the abandonment of the order Parvularculales, which was designated mainly on the basis of 16S rRNA analysis (12). It can be noted that one characteristic linking the Hyphomonadaceae and Caulobacterales, a proliferation of TonB-dependent receptors (2), is shared by Parvularcula; 29 of its proteins are so annotated. However, this characteristic may not be highly distinctive; 18 of 34 non-Rickettsiales Alphaproteobacteria in a recent study had more than 10 TonB- dependent receptors, although only 6 had more than 25 (8). Comprehensive comparative genome analysis of the single Parvularcula genome will be an important next step in understanding its affiliation, especially since no candidate strain for a second genome project has been described that is both closely related and sufficiently distinct to provide a phylogenetic perspective. It has been further suggested that the remaining Rhodobacterales (exclusive of the Hyphomonadaceae) should be subsumed within the Caulobacterales (24); however, neither our data nor the data in that study strongly support this grouping.
Despite its success in grouping Hyphomonadaceae with Caulobacterales, the 16S tree used in the study of Lee et al. disagreed in several important ways with our multiprotein tree (24). It split the Rhodospirillales. It grouped Parvularcula with Sphingomonadales and Sphingomonadales with Rhizobiales. These discrepancies may be due to the usual problems with inference based on 16S rRNA gene sequences, possibly insufficient information content, or difficulties in proper alignment and masking. By concatenating masked alignments for multiple sequence families, the information content is increased and the effects of nonsystematic errors in alignment and masking are diluted.
The evolutionary branching order for six alphaproteobacterial orders inferred based on conservation patterns for indels in protein sequences (22) agrees with the branching order obtained for our tree, although that study did not resolve the branching order of the Rhodospirillales and Sphingomonadales as ours has. Our tree also confirms the deep split in the Rhizobiales noted frequently in previous studies (23, 24).
Much evidence supports the origin of mitochondria from within the Alphaproteobacteria, although various positions within the alphaproteobacterial phylogeny have been proposed. One study has pointed outside the Rickettsiales to Rhodospirillum (17), but most have placed the mitochondrial ancestor within or basal to the Rickettsiales, some specifically within the Rickettsiaceae (16), and others, like ours, as a sister to the combined Rickettsiaceae and Anaplasmataceae (19, 32). The availability of many recently sequenced genomes for our analysis adds new perspective to this placement, showing the free-living marine bacterium Pelagibacter closely subtending the mitochondria/Rickettsiaceae/Anaplasmataceae group. Thus, Pelagibacter will serve as a useful member of the outgroup in future phylogenetic analysis of mitochondrial genes.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Published ahead of print on 4 May 2007. ![]()
Supplemental material for this article may be found at http://jb.asm.org/. ![]()
| REFERENCES |
|---|
|
|
|---|
-proteobacterial genome evolution. Proc. Natl. Acad. Sci. USA 101:9722-9727.
-Proteobacteria. Int. J. Syst. Evol. Microbiol. 53:1031-1036.This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Appl. Environ. Microbiol. | Infect. Immun. | Eukaryot. Cell |
|---|---|---|
| Mol. Cell. Biol. | J. Virol. | Microbiol. Mol. Biol. Rev. |
| ALL ASM JOURNALS |