Previous Article | Next Article ![]()
Journal of Bacteriology, September 2006, p. 6429-6434, Vol. 188, No. 17
0021-9193/06/$08.00+0 doi:10.1128/JB.00484-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan,1 Division of Bioinformatics, Medical Institute of Bioregulation, Kyushu University, 3-1-1, Maidashi, Higashi-ku, Fukuoka, Fukuoka 812-8582, Japan2
Received 6 April 2006/ Accepted 12 June 2006
|
|
|---|
|
|
|---|
The amino acid sequences of ComC and ComD have highly diverged (8, 30), and each CSP predominantly interacts with its cognate ComD (7, 8, 10, 22). Therefore, each bacterial species or strain with a distinct CSP and ComD pair displays a specific induction of natural competence (8, 30). The mechanisms that generate sequence variation are considered to be recombination (8, 30) and point mutations (8). The former mechanism was discussed by Håvarstein et al. (8). However, the latter mechanism has not been fully examined yet. Positive selection is one of the possible mechanisms for the accumulation of point mutations. We examined here whether positive selection plays a role in the sequence diversity of ComC and ComD.
Sequence data. The nucleotide sequences of streptococcal comC genes, collected by keyword and sequence similarity searches, were classified into 10 groups, each consisting of almost identical sequences. We selected a sequence from each group as a representative if the nucleotide sequence of the cognate comD gene was available. Among the 10 cognate pairs of comC and comD genes, nine pairs have been registered as the same entries in the GenBank nucleotide sequence database. The identification (ID) codes and the source strains of the nine pairs in GenBank are as follows: AJ240763 (Streptococcus mitis Col16), AJ240787 (S. pneumoniae R6), AJ240790 (S. pneumoniae Fs), AJ240791 (S. pneumoniae 101/87), AJ240792 (S. pneumoniae Pn13), AJ240794 (S. oralis Col19), AJ240795 (S. mitis NCTC10712), AJ000866 (S. mitis Hu8), and AJ000871 (S. mitis B5). As for the remaining pair, derived from S. mitis strain NCTC12261, the nucleotide sequence of the comC gene is available in GenBank (ID code AJ000875). The cognate comD gene (ID code SMT0012) was obtained from The Institute for Genomic Research Microbial Database (http://www.tigr.org/). The nucleotide sequences of some comD genes lacked the 3' regions. The 5' segmental region consisting of 384 nucleotides, which is shared by the available comD genes, is considered to encode the receptor domain for the CSPs (7, 30). Accordingly, we used the 5' segmental sequences of the comD genes in the present study.
Calculation of kA and kS. The numbers of nonsynonymous (amino acid-altering) and synonymous (silent) nucleotide substitutions per site between two related sequences, KA and KS, are commonly used to study the mechanisms of nucleotide sequence evolution (4, 16-19, 23, 31). We used these values to evaluate the recombination and positive selection of the comC and comD genes. Here, KA and KS were calculated by Miyata and Yasunaga's method (18). The initiation and termination codons and gap sites were excluded from this calculation. In addition, we applied the Jukes-Cantor method (11) to KA and KS to correct for the effect of multiple substitutions. The corrected values are indicated as kA and kS values.
Analysis of recombination and classification of genes. To accurately evaluate whether the point mutations had accumulated by positive selection, the effect of recombination must be eliminated, because recombination violates the correct estimation of evolutionary distance (2). Besides, recombination has reportedly occurred in the comC and comD genes (8, 30). Therefore, we searched for the presence of putative recombination points and located these positions in the comC and comD genes. Synonymous substitutions for kS are free from the selective constraint at the protein level, whereas nonsynonymous substitutions for kA are under the selective constraint at the protein level (18, 19, 24). We used the property of kS to find the recombination points as follows. Consider a pair of aligned nucleotide sequences consisting of N codons. We calculated the difference in kS between the 5'-m codons and the remaining N-m codons, by changing m from x to N-x. To avoid an artificially large difference due to the small number of codons in the terminal region, divisions including a boundary region consisting of less than x codons were excluded from the detection. In the present study, x was set to six. When no sequence under comparison has undergone recombination, the difference is small at any division, because the kS value is expected to be similar across the entire gene (19). In contrast, when recombination has occurred in the genes under comparison, the kS values are expected to be different between the two regions divided by the recombination point, reflecting the different evolutionary histories of each region. In this analysis, we assumed that one recombination point, if any, is present in a pair of comC gene sequences, because only one recombination point was reportedly observed within the comC gene (8). The same procedure was applied to the comD genes, under the same assumption as for the comC genes.
The putative comC recombination points were clustered in a region that corresponds to the amino acid sequence alignment sites 18 to 27 of Fig. 1A. The region was close to the boundary between the coding regions of the leader and functional peptides. Based on the presence of the putative recombination points, the 10 comC genes were classified into two groups. No recombination was observed within each group, but putative recombination points were detected between the members of the two groups. One of the groups consisted of the comC genes from six strains (Fs, R6, Col16, Col19, 101/87, and B5), whereas the other comprised the comC genes from four strains (NCTC10712, Hu8, Pn13, and NCTC12261). These groups are hereafter referred to as groups 1 and 2, respectively. In contrast, the putative recombination points of the comD genes were clustered into two regions. One was clustered in a region corresponding to the amino acid sequence alignment sites 8 to 19 of Fig. 1B. As in the case of the comC genes, the 10 comD genes were classified based on the presence of the putative recombination points. Then, two groups without recombination were obtained. The source strains of each comD gene group were identical to those of each comC gene group. Therefore, the comD gene groups are also referred to as groups 1 and 2. The other region was present at the boundary corresponding to the amino acid sequence alignment sites 111 to 112 of Fig. 1B. As shown in the figure, the region is close to the C termini of the ComD fragment sequences used in the present study. The putative recombination points further divided group 2 of the comD genes into two groups. One consisted of the comD genes from three strains: NCTC10712, Hu8, and Pn13. The other included a single gene, from NCTC12261. We excluded the downstream portion of the comD genes from the region corresponding to the latter recombination points in further analyses because both the comC and comD genes could be classified into two groups, corresponding to groups 1 and 2, by the exclusion. Then, the sequences within the same group were compared to examine the possibility of positive selection.
![]() View larger version (51K): [in a new window] |
FIG. 1. Multiple alignments of the ComC (A) and ComD (B) amino acid sequences. The nucleotide sequences of the 10 comC genes and the cognate comD genes were aligned by using CLUSTAL W 1.82 (27). The amino acid sequence alignment was constructed based on the nucleotide sequence alignment. The upper six sequences belong to group 1, whereas the lower four sequences belong to group 2. When more than 50% of the sequences of a group have the same amino acid residue at an alignment site, the residue at that site is shaded. The amino acid residues of ComC from a strain R6, which correspond to the residues used to determine the specificity of receptor recognition (10), are enclosed by rectangles. The dashed lines 1, 2, and 3 indicate the regions where putative recombination points were detected from pairwise sequence comparisons. The regions represented by the dashed lines 1 and 2 were detected between the members of groups 1 and 2. The putative recombination points detected within group 2 for the comD gene are included in the region indicated by the dashed line 3. The asterisks, plus signs, and period symbols indicate the amino acid sites that correspond to the codon sites that have an of >1.0, with posterior probabilities higher than 99, 95, and 50%, respectively. The posterior probabilities were calculated by the codeml program.
|
, the ratio of kA to kS (18), was examined for every pair of aligned nucleotide sequences for each group.
is used as a criterion for detecting genes under positive selection (18, 31). Some nonsynonymous substitutions, which induce replacements with amino acids with different physicochemical properties, are likely to be deleterious to the protein and thus to the individual (24). Such substitutions are unlikely to become fixed in the population and would exert negative selections that make kA lower than kS. In this case, therefore, the
value would be less than 1. When nonsynonymous substitutions are advantageous to the protein and the individual, such substitutions are likely to become fixed in the population, and consequently amino acid substitutions in the gene product are promoted, which makes the ratio
greater than 1.
In this analysis, we divided the alignment of the comC genes into two parts, based on the functions of the encoded peptides. Then, the kA and kS values were calculated for every pair within the same group in each part. One of the parts was the 5' region consisting of 72 nucleotides, which encodes the leader peptide. The other was the 3' region consisting of 48 nucleotides, which encodes the functional peptide. Hereafter, these regions are referred to as the leader region and the functional region, respectively. The kA and kS values of 15 pairs were calculated for group 1, and those of six pairs were calculated for group 2. The results of the comC gene calculations are summarized in Fig. 2A. The plot with an
of <1.0 is shown in the lower right region below the diagonal line in Fig. 2, whereas the plot with an
of >1.0 is shown in the upper left region, above the diagonal line. All of the plots for the pairs of the leader region, except for two, had an
of <1.0. Most protein-coding genes generally have an
of <1.0 (24). The
values of the leader region agree with the
values of the ordinary protein-coding genes. The two exceptional plots, which come from group 1, may be artifacts by the small number of mutations due to close relationships, because both plots had a kS of 0.0 and a kA of <0.04. Of the 21 plots for the pairs of the functional region, 11 had an
of <1.0, as well as those for the leader region. However, the remaining 10 plots had an
of >1.0. Such plots with an
of >1.0 were found in both groups. This observation indicates the possibility that the functional region has undergone positive selection. The single outlier of the functional region, identified in the vicinity of kS = 0.9, is considered to be an artifact due to the small number of sites available for calculation.
![]() View larger version (14K): [in a new window] |
FIG. 2. Plots of kS (horizontal axis) versus kA (vertical axis). The kA values are plotted as a function of the kS values. If a plot is present above the diagonal dashed line, then the value for the pair corresponding to the plot is >1.0. Likewise, the sequence pair corresponding to a plot present under the diagonal dashed line has an value of <1.0. (A) Open and closed circles indicate the plots for the functional and leader regions, respectively. Of 10 open circles with an of >1.0, two circles are overlaid. (B) Open and closed circles indicate the plots for the conserved and the variable regions, respectively.
|
of <1.0. The two exceptional plots may be artifacts, for the same reason as in the case of the comC genes, because both plots had kS = 0.0 and kA < 0.03. In contrast, of 21 plots for the variable region of the comD genes, 12 showed an
of >1.0. This result suggests that the variable region of the comD genes has been the target of positive selection. Considering that the functional region of ComC is subjected to positive selection, the variable region may be involved in directly binding to CSP.
Analysis of positive selection by
at each codon site.
To confirm the above results, we used the codeml program of the PAML package (version 3.14) (31) to examine whether the comC and comD gene alignments include the sites with an
of >1, by the likelihood ratio test (LRT). In this analysis, the nucleotide sequence alignments were not divided, but the entire regions of the comC and comD gene alignments were analyzed by the codeml program. LRT by the codeml program requires information about the topology and branch lengths of the phylogenetic tree constructed from the alignment. We obtained this information by constructing maximum-likelihood (ML) trees of the comC and comD genes with a tree estimation program, gamt (12). Given the information, the parameters, including
, for a model of evolution were inferred by the ML method. Then, the log-likelihood value of the alignment was calculated with the estimated parameters under the model. The log-likelihood was calculated for a null model and an alternative model. The null model does not allow the sites with an
of >1, whereas the alignment includes the sites with an
of >1 in the alternative model. Then, twice the difference in log-likelihood between the two models was calculated, which follows a
2 distribution with the degree of freedom = 2. At first, two models, M7 and M8 implemented in the PAML package, were used as the null and alternative models for LRT, respectively. In M7, codon sites are classified into 10 categories, each corresponding to a distinctive
value within an interval from 0 to 1. That is, M7 does not include the sites with an
of >1. The alternative model, M8, is constructed by adding an 11th category which has an
of >1 to M7. If M7 is rejected by LRT, then it means that the alignment includes sites subjected to positive selection. The LRT results for the comC and comD genes are summarized in Table 1. For either group of the comC and comD genes, the null model M7 was rejected with statistical significance. We also examined the LRT between two models, M1a and M2a, which are regarded as the simple versions of M7 and M8. In the null model M1a, the codon sites are classified into two categories: the conserved category with 0 <
< 1 and the neutral category with
= 1. The alternative model, M2a, is constructed by adding a third category of sites to M1a, which has an
of >1. Like the cases with M7 and M8, M1a was rejected for each group of the two genes (see Table 1). These results suggest that the comC and comD genes have been subjected to positive selection.
|
View this table: [in a new window] |
TABLE 1. Summary of the likelihood ratio tests by the PAML packagea
|
of >1.0 under either model, M8 or M2a, by the naive empirical Bayes method. We estimated such sites from the alignments of the comC and comD genes with the program. Many codon sites with an
of >1.0 estimated under M2a agree with the sites estimated under M8 (data not shown). The amino acids that correspond to the sites with an
of >1.0 under M8 are shown in Fig. 1. Recently, five amino acid residues of a CSP from S. pneumoniae strain Rx, which determine the specificity of receptor recognition, have been identified by structural and biochemical studies (10). The amino acid sequence of the CSP from strain Rx is identical to that from strain R6, which belongs to group 1. The alignment sites corresponding to the five residues were 28, 36, 37, 40, and 41 (see Fig. 1A). Of the five sites, three28, 37, and 41were identified as the targets of positive selection in group 1, whereas only site 37 was subjected to positive selection in group 2. This observation suggests that the positive selection is related to the determination of the specificity of receptor recognition in group 1 and that the mechanism for receptor recognition is somewhat different between the two groups.
We examined the biased distribution of the codon sites with an
of >1.0 estimated by the codeml program. As shown in Table 2, the number of codon sites in the functional region of the comC genes was greater than those in the leader region (P < 0.01) in group 1. The number of such sites in the variable region of the comD genes was also greater than those of the conserved region (P < 0.01) in group 1. As for group 2, the codon sites with an
of >1.0 also seemed to be abundant in the functional region of the comC genes and the variable region of the comD genes, although the biases did not satisfy the statistical criterion of a P value of <0.05. These results suggested that the functional region of the comC gene and the variable region of the comD gene have been the targets of positive selection and also corresponded to the ratio of kA to kS analysis results.
|
View this table: [in a new window] |
TABLE 2. Number of codon sites with an of >1.0, estimated by the codeml programa
|
Driving force of positive selection. The results of our study suggested that positive selection is an important mechanism to generate sequence variability in the CSPs and ComDs, in addition to recombination. Positive selection has been found in many molecules (4, 16, 23). However, positive selection due to competition among bacterial populations has not yet been detected. Recent studies have suggested that the competence system of S. pneumoniae is involved in the competition for DNA resources and nutrition by promoting the bacteriolysis of noncompetent cells (5, 13, 20, 25, 26) and protecting the competent cells against their own bacteriolytic agents (5, 9). CSPs and ComDs display variability (8, 30) and specificity (7, 8, 10, 22), which may guarantee the strict self-recognition of a strain and consequently the occupation of DNA resources by the same strain. Therefore, the competition for DNA resources through competence-mediated cell lysis would be the driving force of positive selection for the ComC-ComD system, and the present study provides the first example of positive selection due to competition between closely related bacterial populations. The sizes of the sequences used in the present study were short, and the number of available sequences was small. Further accumulation of comC and comD gene sequences and the development of theoretical treatments of positive selection will clarify the evolutionary mechanism of this system.
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»