Previous Article | Next Article ![]()
Journal of Bacteriology, October 2002, p. 5733-5745, Vol. 184, No. 20
0021-9193/02/$04.00+0 DOI: 10.1128/JB.184.20.5733-5745.2002
Copyright © 2002, American Society for Microbiology. All Rights Reserved.
Department of Biological Sciences,1 Department of Mathematics, Stanford University, Stanford, California 943052
Received 29 April 2002/ Accepted 22 July 2002
|
|
|---|
|
|
|---|
The SD sequence plays an important role in formation of the initiation complex by base-pairing with the anti-SD sequence found at the 3' end of 16S rRNA. This has been demonstrated by extensive experiments with Escherichia coli (9, 19, 46), other bacteria, and even archaea (8, 32, 35, 37). The SD sequences could be different subsequences of the complementary sequence of the anti-SD sequence (see Table 1); however, most SD sequences are slight variations of the GGAGG core (43). The effectiveness of an SD sequence is determined by both its base-pairing potential with the anti-SD sequence and its spacing from the start codon (10, 34). The aligned spacing of the SD sequences (see the legend to Fig. 1A for definition) generally varies from 5 to 13 bases, with optimal spacings of about 8 to 10 bases for E. coli genes (7, 34). Although it is not mandatory in translation initiation, a strong SD sequence may compensate for a weak start codon and counteract mRNA secondary structures that hinder access to the start (10, 55).
|
View this table: [in a new window] |
TABLE 1. Features of the prokaryotic genomes studied
|
![]() View larger version (29K): [in a new window] |
FIG. 1. Distribution of aligned spacings of SD sequences for RP, PHX, and PMX gene classes. (A) Simplified diagram of the translation initiation complex formed between an E. coli mRNA and the 30S ribosomal subunit. The aligned spacing is designated as the distance between the center of the SD sequence GGAGG and the start codon AUG. (B and C) Histograms of the SD aligned spacings in the genomes of Escherichia coli K-12 (ESCCO) and Pyrococcus abyssi (PYRAB), respectively.
|
The main objective of this paper is to investigate, in 30 complete prokaryotic genomes (available as of June 2001), the correlation between the presence of an SD sequence and predicted expression levels of genes based on codon usage biases, functional gene classes, type of start codon, and distance between successive genes.
|
|
|---|
Detection of SD sequences.
To detect putative SD sequences, we calculated the free energy (designated
GSD, in kilocalories per mole) for all possible duplexes between the anti-SD sequence and the 20 bases upstream of the start codon of a gene. Dynamic programming was implemented to find the duplex that gave the lowest free energy. This method has been described in several publications and is well accepted in SD detection (12, 32, 37, 43). The stacking energy was calculated based on the rules developed by Freier et al. (13). Only canonical Watson-Crick base pairs and G-U pairings flanked by Watson-Crick base pairs were allowed, and the free energy loss by duplex initiation, 3.4 kcal/mol (13), was subtracted. To reduce ambiguity, a cutoff value of
GSD = -4.4 kcal/mol was used, which is the
GSD for the core SD motifs GGAG, GAGG, and AGGA in bacteria (43). A specific anti-SD sequence was used for each genome (Table 1).
There are several rationales in favor of a threshold of -4.4 kcal/mol. (i) An effective SD sequence usually binds to the core CCUCC of the anti-SD sequence, which is conserved in all but one of the genomes (Table 1). It seems unlikely that the basic mechanism of the SD interaction will change from genome to genome, given the conservation of the core anti-SD motif. Thus, we define the SD sequences GGAG, GAGG, and AGGA as core SD motifs, all of which have a free energy of binding of -4.4 kcal/mol. (ii) Constrained by the base composition of the anti-SD sequence, its complementary motifs with a free energy of binding greater than -4.4 kcal/mol most often bind parts other than the core CCUCC and are likely to be random motifs. Thus, we have designated this threshold to exclude these random motifs. (iii) We analyzed several genomes with SD sequences defined by the core SD motifs GAGG, GGAG, and AGGA (SD sequences were defined as sequences harboring any of these motifs) and obtained highly concordant results (data not shown). Also, we relaxed the stringency and accepted the 3-bp motifs GGA, GAG, and GGA as SD sequences. The final results are consistent (see Supplementary Data Fig. S-1 and Table S-1; all supplementary data can be accessed at http://gnomic.stanford.edu/jiongm/SD/).
For most genes, this cutoff value effectively leaves only one or no SD sequence in the 5' region (20 nucleotides). In rare cases there might be two or more competing motifs that qualify as SD sequences. When this happens, we chose the one with the lowest free energy of binding.
The aligned spacing of an SD sequence is defined as the number of bases between the first base of the start codon and the U in the core anti-SD motif CCUCC (Table 1) in the duplex formed (Fig. 1A). The aligned spacing of the SD sequence GGAGG in Fig. 1A is 7 bases. There are 22 possible spacings (0 to 21 bases). However, generally more than 80% of all the SD sequences occur at spacings of 5 to 13 bases to the start codon (see below).
Theoretical measures of gene expression.
We used a method introduced by Karlin and Mrázek (22) to assess codon biases of a class of genes (or a single gene) relative to a second class of genes. Let G be a group of genes with average codon frequencies g(x,y,z) for the codon triplet (x,y,z) such that
g(x,y,z) = 1 for each amino acid family. Similarly, let {f(x,y,z)} indicate the average codon frequencies for the gene group F, normalized to 1 in each amino acid codon family. The codon usage difference of F relative to G is calculated by the formula
![]() | ((1)) |
![]() | ((2)) |
![]() | ((3)) |
Definition of PHX and putative alien (PA) gene classes.
We defined a gene as PHX if the following two conditions were satisfied: (i) at least two among the three expression values ERP(g), ECH(g), and ETF(g) exceeded 1.05 and (ii) the overall expression level E(g) was
1.00. A gene was defined as PA if it fulfilled the following criteria: B(g|RP)
M + 0.10, B(g|CH)
M + 0.10, B(g|TF)
M + 0.10, and B(g|C)
M + 0.10, where M is the median value among B(g|C) for all g (22, 23, 27). Predicted moderately expressed (PMX) genes are genes that are neither PHX nor PA. PMX genes constitute roughly more than 90% of a genome and thus represent average genes. We often use PMX as a standard to compare to PHX and PA genes.
Logistic regression analysis.
To study the correlation between SD presence and E(g) values, we used a logistic regression model (18). Considering a genome with n genes (each
100 codons), we observe n pairs (xi, yi), i = 1, 2, . . . , n, where xi = E(gi) is the predicted expression level of gene gi calculated by equation 3 and where yi designates the presence or absence of the SD sequence in gi (1 if present and 0 if not). We attempted to fit the data pairs to the logistic regression model
![]() | ((4)) |
(x) = F(Y|x) represents the conditional mean of Y (SD presence), given x = E(g). The logit transformation is defined as
![]() | ((5)) |
|
View this table: [in a new window] |
TABLE 2. SD% for RP, PHX, PMX, and PA genesa
|
|
|
|---|
100 amino acids), and the optimal aligned spacings (OAS) for the SD sequences (discussed below). In bacterial genomes, the anti-SD sequence is AUCACCUCCUUU, although the archaeal genomes show some variation in their anti-SD sequences around the conserved core CCUCC (Table 1).
Using the free-energy method and a cutoff value of -4.4 kcal/mol, all the SD sequences detected were at least 4 bases in length, and most harbored the motif GGAG, GAGG, or AGGA (e.g., 88% of the SD sequences in Escherichia coli K-12). In some natural mRNAs an SD sequence can consist of a weaker motif, e.g., AAGG, with a
GSD of -2.9 kcal/mol (57). For our purposes we prefer to find only unambiguous SD sequences. In terms of base-pairing potential with the anti-SD sequence, the SD sequences defined by our method may be considered strong SD sequences. Most of them are present at an aligned spacing of between 5 and 13 bases, as verified by histograms of spacings of all the SD sequences in a genome (Fig. 1B and C; also see Supplementary Data Fig. S-2). An SD sequence at this range of spacings has been established to be effective (7, 16, 34).
Of the 30 genomes, 22 had an SD% exceeding 40% for all genes. Bacillus subtilis and Thermotoga maritima registered the highest SD%, 89.4% and 90.1%, respectively. The lowest genome SD% occurred for Rickettsia prowazekii, Mycoplasma genitalium, Mycoplasma pneumoniae, Halobacterium sp. strain NRC-1, Thermoplasma acidophilum, Sulfolobus solfataricus, and Pseudomonas aeruginosa, each at around 20%. In general, fast-growing bacteria, gram-negative thermophiles, spirochetes, methanogens, and hyperthermophilic archaea achieved relatively high SD%, while obligate intracellular parasites, surface parasites, pathogens, and cyanobacteria had diminished genome SD%.
We carried out a simulation study to determine whether these SD% values represent real DNA elements or just random motifs. For each genome, we generated 100 (1,000 for Escherichia coli K-12) data sets of random sequences 20 nucleotides long according to the base composition of the original 20-nucleotide 5' end sequence data set, each with the same number of sequences as in the given genome. SD sequences were detected and SD% was calculated for each set of these random sequences. The SD% values shown in Table 1 were found to represent real motifs in all the genomes except for Mycoplasma genitalium and Halobacterium sp. strain NRC-1, as assessed by distributions of the SD% for these simulated data sets (the probability of these SD% values coming from random sequences was <0.01).
Correlation between SD presence and predicted gene expression levels. It is known that not all genes contain an SD sequence. In some genomes, the majority of genes do not have such a motif (Table 1). Although an SD sequence is not compulsory for the translation of many genes (21), it may still be effective for genes that contain such a motif. This raises the question of how the SD sequences are distributed in different gene classes.
First we examined SD sequences for the RP genes. Primarily highly expressed during fast growth, the RP gene class showed a very high SD%, around 80% in most genomes (Table 2). Even for genomes with a low overall SD%, the RP SD% was significantly high. For example, the SD% was 85.7% for RP genes in Thermoplasma acidophilum (23.5% for the genome) and 58.5% for RP in Sulfolobus solfataricus (23.0% for the genome). This is consistent with a greater SD presence for highly expressed genes.
We then divided the genes of a genome (
100 codons) into three classes, PHX, PA, and PMX, based on codon usage biases (22). The percentage of PHX genes in different genomes ranged from 2% to 19%, whereas PA genes ranged from 0 to 13% (Table 2). PMX genes constitute the bulk of a genome and consisted mostly of average genes. The major PHX genes were RP, TF, and CH genes. Other PHX genes included those encoding enzymes of essential energy metabolism pathways and the principal genes of amino acid and nucleotide biosyntheses (22, 23, 27). Our results on PHX agree well with two-dimensional gel experimental assessments in several prokaryotes (1, 22, 23, 42, 53, 54). The PHX genes in most of the 30 genomes carried a significantly higher SD% than PMX genes. PA genes generally showed an SD% about the same as or less than that of the PMX genes (Table 2). Since PA genes are largely composed of putative lateral transfer genes, they tend to have low expression levels (28).
To verify the positive correlation of SD presence and gene expression levels, we applied logistic regression analysis. The regression coefficient ß and its estimated standard error for each genome are given in Table 2. All but six genomes (Borrelia burgdorferi, Bacillus subtilis, Mycoplasma genitalium, Methanococcus jannaschii, Halobacterium sp. strain NRC-1, and Pyrobaculum aerophilum) recorded a significant positive correlation between SD presence and E(g) values (P < 0.01 for a likelihood ratio test of the regression). For the genomes of Borrelia burgdorferi, Methanococcus jannaschii, and Halobacterium sp. strain NRC-1, the P value for the likelihood test was between 0.05 and 0.1, indicating a relatively strong correlation. Of the three genomes that did not record a significant correlation (P > 0.1), Mycoplasma genitalium had the lowest SD% (10.8%); Bacillus subtilis was among the highest in SD%; and Pyrobaculum aerophilum was low at about 23% (Tables 1 and 2).
Since all the data sets used were original genome annotations, a reasonable concern was that incorrect annotations of the gene start sites may have affected the accuracy of our SD analysis. To better determine how the genome data would compare to more reliable data sets, we analyzed the SD% for genes from several human-curated Escherichia coli K-12 data sets and achieved very similar results, as shown in Table 3. The data sets on essentiality were from the Profiling of E. coli Chromosome (PEC) database (http://www.shigen.nig.ac.jp/ecoli/pec/). The PEC data set classifies all E. coli genes into three groups: genes essential for cell growth ("essential"; total of 191 genes), those dispensable for cell growth ("nonessential"), and those unknown to be essential or nonessential ("unknown"), mainly using information from the literature. The "verified" (total, 656 genes) data set was extracted from EcoMap12 (http://bmb.med.miami.edu/EcoGene/EcoWeb/), which consists of genes whose starts have been confirmed by N-terminal protein sequencing (41). There are 65 genes in the verified set whose start sites were incorrectly annotated in the NCBI genome (4), giving an accuracy of about 90% for start site annotation, which is consistent with the average accuracy estimated for various gene-finding programs (25).
|
View this table: [in a new window] |
TABLE 3. E. coli data setsa
|
To further reduce potential errors caused by annotation inaccuracies, we compiled a "single-start genes" data set for each genome, which consists of genes with a single start codon (AUG, GUG, or UUG as the first codon) within 90 nucleotides of their annotations. Of the 65 wrongly annotated genes in the E. coli "verified" data set, the correct start was found within 30 codons of the annotations for 54 (83%). Therefore, the single-start genes may have a chance of <0.02 of being wrongly annotated if the error rate for the genome annotations is 10% or about only 0.04 if the error rate reaches as high as 25% in certain genomes, as estimated by some authors (3). In general, these genes constitute about 26% of a genome (29% PHX genes, 25% PMX, and 26% PA; see Supplementary Data Table S-2). Compared to the whole-genome data, they registered highly comparable SD% for the three gene classes PHX, PMX, and PA, indicating that the inaccuracies in start site prediction could only slightly affect the validity of our results obtained from genome annotations (see Supplementary Data Table S-2).
There was also evidence suggesting that wrong starts are likely to be distributed evenly among the different classes of genes (PHX, PMX, and PA) that we used and thus would not significantly affect our comparisons of SD presence between PHX and PMX or PA gene classes. Of the 65 E. coli genes with incorrect starts mentioned above, 20% were PHX, 77% were PMX, and 3% were PA, indicating that incorrect annotations do not tend to bias strongly toward PMX or PA genes.
Taken together, our results on the correlation of SD presence and predicted expression levels have been verified by both human-curated E. coli data sets and the high-quality single-start gene data sets. The validity of the results holds despite the existence of a few incorrectly predicted gene start sites in the genome data.
It is also evident that the increased SD% for PHX genes is not due solely to the presence of RP genes, as shown in Table 3 for Escherichia coli K-12. The collection of PHX genes, excluding RP genes, achieved an SD% similar to that of the complete PHX class for the verified, essential, and whole-genome data sets (Table 3).
The results corroborate our assignment of genes as PHX based on codon usage, even in the many prokaryotes for which little direct information on protein abundances is available. Although many factors affect protein abundances, a high rate of translational initiation is essential to achieve a high level of expression and is the factor most simply observed by genome analysis.
SD sequences for PHX and PMX genes.
We also tried to determine whether the SD sequences of RP and PHX genes are stronger than those of PMX genes in terms of base-pairing potential with the anti-SD sequence and with respect to their aligned spacings, which reflect the two major determinants of the strength of an SD sequence (17). Ringquist et al. (34) showed experimentally that the SD sequence UAAGGAGG is about fourfold more effective than AAGGA. The former SD has a
GSD of -12 kcal/mol, while the latter has a
GSD of -5.3 kcal/mol. Spacing has a substantial effect only when the SD sequence is short (17). Experimental evidence demonstrated that an aligned spacing of 8 to 10 bases is optimal for E. coli genes (7, 34).
We first determined the OAS for each genome based on the distribution of SD spacings for all the genes in general and the PHX and RP gene classes in particular. The genomes of Escherichia coli K-12 and Pyrococcus abyssi are shown as two examples in Fig. 1. The OAS are 7, 8, and 9 bases for Escherichia coli K-12 and 9, 10, and 11 bases for Pyrococcus abyssi (Fig. 1B and C). Notably, 6, 7, and 8 bases are the most occupied SD spacings for PMX genes from Escherichia coli K-12, whereas 7, 8, and 9 bases are preferred by PHX and RP genes (Fig. 1B). In fact, no SD sequence for the Escherichia coli K-12 RP genes occurs at an aligned spacing of 6 bases.
Assuming that the SD sequences for RP genes are the most optimal, the three aligned spacings of 7, 8, and 9 bases were chosen as the OAS for SD sequences in Escherichia coli K-12. These OAS agree excellently with experimental evidence that 8 to 10 bases are optimal for SD sequences in Escherichia coli K-12 genes (7, 34). These also indicate that SD sequences for PHX genes may have a distribution closer to the actual optimal spacings than PMX genes.
For the genomes of Haemophilus influenzae, Vibrio cholerae, Campylobacter jejuni, Helicobacter pylori 26695, Chlamydophila pneumoniae, and Chlamydia trachomatis, the OAS were determined in a way similar to that used for Escherichia coli K-12. In other genomes, the OAS were aligned spacings occupied by the largest fraction of SD sequences for both PHX and PMX genes, e.g., for Pyrococcus abyssi (Fig. 1C; see also Supplementary Data Fig. S-2). However, the SD sequences in the genomes of Mycoplasma genitalium and Pyrobaculum aerophilum were spread to all positions. Their OAS were chosen in the same way but may not represent optimal spacings (see Supplementary Data Fig. S-3).
Table 1 displays the OAS for each genome. In general, bacterial genomes attain similar OAS, with position 8 being the most common optimal spacing. Archaeal genomes show a preference for OAS about 2 bases longer than that of most bacterial genomes, usually at positions of 9 to 11 bases (Table 1, Fig. 1B and C).
We display in Fig. 2 for each genome the mean
GSD of the SD sequences and the frequencies of the SD sequences at the OAS (designated OAS%) in RP, PHX, and PMX genes. The mean
GSD indicates the average affinity of the SD sequences for a given gene class. The 30 genomes are divided into three groups (Fig. 2). The first group consists of the proteobacteria. Their SD sequences were among the weakest, with a mean
GSD of -6.5 kcal/mol, and about 50% to 70% occurred at the OAS. The most common SD sequence for these genomes was AGGAG (
GSD = -6.5 kcal/mol). In comparison, AGGAGG had a
GSD of -9.8 kcal/mol. It is also noteworthy that these genomes were highly similar in SD sequences for all three classes of genes (Fig. 2).
![]() View larger version (39K): [in a new window] |
FIG. 2. SD sequences for RP, PHX, and PMX gene classes. (A) The y axis, OAS%, is the fraction of SD sequences present at the three OAS (given in Table 1) for each gene class. * indicates genomes where the OAS% for RP is significantly higher than for PMX genes (P < 0.05 for a 2 test using the Yates correction). ** indicates that the OAS% for both the RP and PHX genes are significantly higher than for the PMX genes. (B) The y axis shows mean GSD, the mean free energy of binding of the SD sequences in a gene group. * indicates genomes where the mean GSD for the RP genes is significantly less than that for the PMX genes (the difference is at least 20% of the bacterial mean GSD, or 1.3 kcal/mol); ** indicates that the mean GSD for both the RP and PHX genes is significantly less than for the PMX genes. Abbreviations: ESCCO, Escherichia coli; HAEIN, Haemophilus influenzae; VIBCH, Vibrio cholerae; PSEAE, Pseudomonas aeruginosa; CAMJE, Campylobacter jejuni; HELPY, Helicobacter pylori; RICPR, Rickettsia prowazekii; NEIME, Neisseria meningitidis; CHLPN, Chlamydophila pneumoniae; CHLTR, Chlamydia trachomatis; BORBU, Borrelia burgdorferi; TREPA, Treponema pallidum; BACSU, Bacillus subtilis; MYCGE, Mycobacterium genitalium; MYCPN, Mycobacterium pneumoniae; UREUR, Ureaplasma urealyticum; MYCTU, Mycobacterium tuberculosis; SYNSQ, Synechocystis sp. strain PCC6803; DEIRA, Deinococcus radiodurans; AQUAE, Aquifex aeolicus; THEMA, Thermotoga maritima; METJA, Methanococcus jannaschii; METTH, Methanobacterium thermoautotrophicum; ARCFU, Archaeoglobus fulgidus; PYRAB, Pyrococcus abyssi; PYRHO, Pyrococcus horikoshii; THEAC, Thermoplasma acidophilum; HALSP, Halobacterium sp. strain NRC-1; SULSO, Sulfolobus solfataricus; PYRAE, Pyrococcus aerophilum.
|
GSD and with an OAS% of around 40%. Bacillus subtilis was the only genome in this cluster to have very strong SD sequences (lower mean
GSD).
The third group consisted of Aquifex aeolicus, Thermotoga maritima, and all the archaea. The SD sequences in this cluster were the strongest, except for the genomes with a very low genome SD% (Halobacterium sp. strain NRC-1, Sulfolobus solfataricus, and Pyrobaculum aerophilum). In Bacillus subtilis, Aquifex aeolicus, Thermotoga maritima, and the euryarchaea, the SD sequences for RP genes were significantly higher in OAS% and significantly lower in mean
GSD than the PMX genes. This was mostly valid also for the PHX gene classes in these genomes (Fig. 2). In particular, Bacillus subtilis did not show a significant correlation between SD presence and predicted expression levels (Table 2), but the SD sequences for its PHX genes did tend to be stronger than those of its PMX genes in both
GSD and OAS% (Fig. 2). In contrast, the genomes of Mycoplasma genitalium and Pyrobaculum aerophilum appeared to have SD sequences that were weak and not at optimal spacings, even for the PHX and RP genes (Fig. 2). The SD sequences in these genomes may not play any significant role in translation initiation as in other genomes, which is also implied by the logistic regression analysis (Table 2; see below).
It was previously suggested that there is no direct correlation between the affinity of the SD sequence for the anti-SD sequence and the efficiency of initiation complex formation under certain experimental conditions (10). An SD interaction that involves the center of the anti-SD sequence, CCUCC, may be more efficient in facilitating translation initiation than when it involves off-center sequences (24). This could explain the results of Ringquist et al. (34) and also the twofold-higher yields for GAGGU (
GSD = -6.6 kcal/mol) than for UAAGG (-4.2 kcal/mol) found by Chen et al. (7). Not coincidentally, the core anti-SD sequence CCUCC provides the greatest contribution to
GSD, as a G:C pair is more stable than an A:U pair.
Since a majority of the SD sequences that we detected involved interaction with the core anti-SD sequence, it might be reasonable to speculate that a lower mean
GSD indeed signifies a higher efficiency for SD sequences of PHX genes. We also found that, in Escherichia coli K-12, SD sequences for PHX genes had a higher frequency of GGAG and GAGG (24.7%) and a lower frequency of AGGA (5.0%) than the PMX genes (16.7% and 7.8%, respectively). These three SD sequences had the same
GSD of -4.4 kcal/mol, but AGGA was apparently a weaker SD sequence than the other two. In fact, 72% of all the SD sequences for PHX genes in Escherichia coli K-12 harbored the core SD motif GGAG or GAGG, compared to 62% for PMX genes. This trend appears to be valid for most genomes, even those for which no significant decreases in the mean
GSD were found for PHX genes versus PMX genes, e.g., proteobacterial genomes (see Supplementary Data Fig. S-4). Therefore, it appears that PHX genes tend to have an SD sequence that has higher affinity to the anti-SD sequence, occurs at a more optimal spacing, and involves interaction with the core anti-SD region. Such an SD sequence is very likely to have a higher efficiency in translation initiation.
Variation of SD% for different functional gene classes. We also tried to find out whether SD presence is correlated with certain gene classes by assessing the SD% for different functional classes defined in the Cluster of Orthologous Groups (COG) database (50, 51). The two COG categories that are persistently highest in SD% are J (translation, ribosome structure, and biogenesis) and C (energy production and conversion) (see Supplementary Data Table S-3), consistent with the recognition that most genes in these groups are PHX (22). In contrast, the COG categories with low SD% include L (DNA replication, recombination and repair), M (cell envelope biogenesis, outer membrane), and I (lipid metabolism) (see Supplementary Data Table S-3). Genes in these classes usually attain the expression levels of PMX genes (22). Thus, variations in SD% for different COG classes seem to reflect an association with the expression levels of the genes in the class.
Relationship between SD presence and start codon. Most genes rely on AUG as a start codon, while GUG and UUG are used sparsely (Table 4). Moreover, genes with an AUG start codon tend to have a higher SD% than genes with either GUG or UUG. The increase was significant in 12 genomes and most pronounced in the five euryarchaeal genomes with SD% exceeding 40% (Table 4).
|
View this table: [in a new window] |
TABLE 4. SD% for genes with different start codonsa
|
We have shown that SD presence is significantly correlated with predicted gene expression levels in most prokaryotic genomes. In particular, the RP genes and more generally the PHX genes display a higher SD% than the PMX genes (i.e., the average genes). Also, in some genomes the SD sequences of RP and PHX genes are closer to optimal in both base-pairing potential with the anti-SD sequence and spacing to the start codon (Fig. 2). This provides further evidence that the SD sequence is important in translation of these genes. A strong SD sequence may also work together with other features of the highly expressed genes, e.g., the stronger start codon AUG and favorable secondary structure around the translation initiation region (16), that ameliorate the translation initiation efficiency.
Relationship between SD presence and distance between successive genes. The intergenic distance (Dg) is another important feature of prokaryotic genes that might correlate with the SD presence. For ease of discussion, we refer to the Dg of gene g as the distance (in base pairs) from g's start codon to the end of its immediate upstream gene in the same orientation. Negative values of Dg signify genes that overlap their immediate upstream genes. In most genomes, the most prevalent value of Dg is -4 bp (the junction is always AUGA; also see reference 38), which is observed for on average 7.8% and as much as 18% for Thermotoga maritima.
The median Dg in a genome varies from 9 bp for Campylobacter jejuni and 11 bp for both Thermotoga maritima and Mycoplasma genitalium to 187 bp for Methanococcus jannaschii and 201 bp for Halobacterium sp. strain NRC-1 (see Supplementary Data Table S-4). In most archaeal genomes, the SD% for genes with a Dg of -4 bp is marked higher than the SD% for all the other genes, at a level comparable to the SD% of the RP genes. In contrast, many genomes recorded a reduced SD% for the collection of genes with a Dg of >20 bp, compared to genes with a Dg of <20 bp. This is especially valid for all the archaeal genomes (see Supplementary Data Table S-4).
We then assessed SD% for genes with different Dg ranges. Since the SD% does not show much variation among the groups with a Dg of greater than 30 bp, we focused on genes with a Dg of below 30 bp, which on average constitute 35% of a genome. We divided all the genes in a genome into seven Dg groups: genes with a Dg below -20 bp; five groups with a Dg of from -20 to 30 bp, with 10-bp intervals; and genes with a Dg exceeding 30 bp (see Supplementary Data Table S-5). In most genomes, each group contained more than 30 genes. The gene group with a Dg of -10 to 0 bp was the largest among the five groups of 10-bp intervals. Figure 3 shows the SD% for these Dg groups.
![]() View larger version (28K): [in a new window] |
FIG. 3. Relationship between SD% and distances between successive genes (Dg). The y axis represents SD%. The symbols for the lines and points for each plot are shown. In each plot, the seven data points represent seven Dg groups (from left to right): genes with a Dg of less than -20 bp; five groups of genes with a Dg from -20 to 30 bp, at 10-bp intervals; and genes with a Dg of more than 30 bp (see Supplementary Data Table S-5 for details of the groups). For abbreviations, see the legend to Fig. 2.
|
Genes with a Dg of 0 to 20 bp may have strong biases in base composition in their translation initiation region because their 5' end is located in the regions around the stop codon of the upstream gene (49). Rocha et al. (35) found that the 6 bases following the stop codon in Bacillus subtilis genes are AU rich. Such biases could discount the occurrence of an SD sequence, which might be the reason for the somewhat reduced SD% for the group with a Dg of 0 to 10 bp in bacterial genomes (Fig. 3). On the other hand, Eyre-Walker (12) showed that Escherichia coli K-12 genes overlapping a downstream gene tend to have low codon preferences at the 3' end, which would more easily enable the presence of an SD for the downstream gene (e.g., with a Dg of -20 to 0 bp).
The archaeal genomes revealed a common trend distinctive from the bacteria. The genes with a Dg of less than 20 bp (Fig. 3F) or less than 10 bp (Fig. 3G) were strongly biased with an extant SD compared to genes with a larger Dg. This was even more emphatic for genomes with less than 30% overall SD%, especially for gene groups with a Dg of between -20 and 10 bp (Fig. 3G). These increased SD% were again not correlated with higher expression levels (data not shown). It is interesting that Bacillus subtilis, Aquifex aeolicus, and Thermotoga maritima were distinctively like bacteria in their relationship between Dg and SD presence (Fig. 3D), even though they were very similar to the archaea in the SD sequences with respect to
GSD and OAS (Table 1; Fig. 2). Thus, the parameters of translation initiation do not sort along simple phylogenetic lines.
Relationship between SD presence and operon structure. The greatly increased SD presence in genes in close proximity to their upstream genes led us to investigate the connection between the SD sequence and operon structure. Apparently many genes in the groups with a Dg of -20 to 20 bp are genes within operons (38). It has been suggested that operon structure might have arisen during the evolution of both bacteria and archaea by thermoreduction from a common thermophilic ancestor (14). The operon structures in the two kingdoms thus might have some common features, such as the SD sequence. The high SD presence suggests that the SD sequences may play an essential role in translation of these genes.
We analyzed SD sequences for 391 documented operons from Escherichia coli K-12 (each with at least two genes) extracted from the RegulonDB database (39). Of the 601 internal genes within these operons, 69.2% had a Dg of between -20 and 30 bp, compared to only 6.6% of the 391 initial operon genes. The SD% was 71.0% for genes within operons and 67.3% for initial genes.
We then conducted a more general analysis over the 30 genomes. Based on the Dg, we partitioned the genes in a genome into three classes, types I to III, as illustrated in Fig. 4A. Type I consists of genes at least 100 bp in distance from both the upstream and downstream genes; type I genes are presumably single genes. Type II consists of genes with a Dg larger than 50 bp and followed by at least two consecutive downstream genes with a Dg below 20 bp; type II genes are likely initial genes of operons. Type III comprises all genes with a Dg below 20 bp following a type II gene; type III genes are likely genes within operons. The three classes encompass about half of a genome. We found that more than one third of the type II and type III genes in Escherichia coli K-12 were present in the 391 known operons, and most of them were also predicted to be operons by Salgado's method (38). On average, there were three type III genes following each type II gene (see Supplementary Data Table S-6). Figure 4B presents the SD% for these three gene classes.
![]() View larger version (33K): [in a new window] |
FIG. 4. SD sequences for genes with different internal positions. (A) How the three types of genes were classified (see text for details). (B) Asterisks indicate genomes where the SD% for type III genes is significantly higher than that for type I genes. Boldface indicates that the SD% for the type II genes was significantly higher than for the type I genes (P < 0.05 for a 2 test using the Yates correction). For abbreviations, see the legend to Fig. 2.
|
This conservation was even more significant in the genomes where the overall SD% was very low and/or no correlation between the SD presence and predicted expression levels was observed. Such genomes included those of Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp. strain PCC6803, Halobacterium sp. strain NRC-1, Sulfolobus solfataricus, and Pyrobaculum aerophilum (Fig. 2, 3, and 4B). Thus, it is tempting to speculate that the SD sequence may have coevolved with the operon gene structure in both bacteria and archaea (14). The correlation of SD presence with gene expression levels might have been established later. This would explain the observation that, in all archaeal genomes and Aquifex aeolicus, PHX genes with a Dg of below 50 bp recorded a significantly higher SD% than other PHX genes (data not shown). The RP genes are both highly expressed and profusely expressed in operons, and not surprisingly, they always attained the highest SD% (Table 2).
The archaeal genomes provide an excellent system with which to analyze the evolution of both the SD sequence and the bacterial translation mechanism utilizing the SD-anti-SD interaction. Some euryarchaea (Thermoplasma acidophilum and Halobacterium sp. strain NRC-1), and especially crenarchaea (Sulfolobus solfataricus and Pyrobaculum aerophilum), seem to have gradually lost conservation of both the anti-SD and the SD sequences (Table 1; Fig. 2). Accumulating evidence suggests that many single genes, or initial genes of operons, in these genomes are translated through leaderless mRNA by mechanisms that do not involve the SD-anti-SD interaction (45, 47, 52). The SD sequence may thus become dispensable for these genes. However, for genes within operons, the SD sequence appears to be particularly important, evidenced by the prevalence of the SD motifs in those genes (Fig. 3F and G). Experimental evidence supporting this hypothesis has been reported for Sulfolobus solfataricus (8).
SD presence and other gene features. It has been suggested that the SD sequence is especially important in a genome where an S1 ribosomal protein is missing, e.g., Bacillus subtilis, which has only a reduced S1 homologue and achieves the second highest SD% of all the genomes (Table 1) (35). However, we did not find such a correlation for other genomes. Three bacteria (Ureaplasma urealyticum, Mycoplasma genitalium, and Mycoplasma pneumoniae) and all archaeal genomes did not have an S1 or any S1 homologues. But, unlike Bacillus subtilis, the genomes of Ureaplasma urealyticum, Mycoplasma genitalium, and Mycoplasma pneumoniae recorded a very low SD% (Table 1). On the other hand, genomes with an S1 gene can achieve very high SD%, e.g., Thermotoga maritima, which had the highest SD% (Table 1). Thus, SD presence is not correlated with the presence or absence of an S1 RP gene. Also, the SD sequence seems to be uncorrelated with factors such as copy number of the 16S rRNA, G+C content, total number of genes, gene length, or lifestyle (data not shown).
Further comments. Given the correlation between the SD sequence and other gene features, especially expression levels and distances between successive genes, it is suggested that the SD sequence should be incorporated in algorithms for gene start determination, expression level prediction, and operon prediction to improve accuracy. Most of the genomes studied in this report were annotated with the programs GeneMark (20, 26) and GLIMMER (40) or a combination of automatic gene-finding methods and similarity searches in protein databases. Now SD information has been incorporated in recent programs, such as GeneMark.hmm and GeneMarkS (3, 25). It appears to work well for genomes with high SD%, such as low-G+C gram-positive bacteria (e.g., Listeria monocytogenes [15]). However, for many genomes, the SD% is around 30 to 50% and thus would provide only marginal improvements (36).
On the other hand, the relationship between SD presence and intergenic distances may contribute greatly to operon predictions, an important part of prokaryotic genomics. No highly reliable method to date has been developed for operon prediction (38). Also, little is known about operons in archaeal genomes. Our findings that archaeal genes that are presumably within operons have remarkably increased SD presence should help in developing an effective method for operon characterization in these genomes.
Recently, the crystal structures of both the 50S and 30S complexes of the bacterial ribosome have been determined at high resolution (2, 41, 56). A structure of the 80S ribosome from Saccharomyces cerevisiae was also reported (48). These accomplishments greatly augment our understanding of the mechanisms of protein synthesis at the atomic level (5, 6, 29-31, 33, 44). Furthermore, Yusupova et al. (58) directly observed the path of mRNA in the 70S ribosome from Thermus thermophilus at 7 Å resolution. The model mRNA was based on the phage T4 gene 32 mRNA except that the SD sequence was expanded to AAGGAGGU. They found that about 30 nucleotides are bound to the 30S subunit (15 bp upstream of the initiator to 15 bp downstream), which is roughly the whole translation initiation region. The SD interaction was clearly observed to form a helix, which was accommodated in a cleft formed by 16S rRNA elements and the ribosomal proteins S11 and S18 (58). These results provide additional proof that the SD interaction can be an important part of translation initiation.
The SD sequence in the mRNA, AAGGAGGU, had an aligned spacing of 7 bases. It is interesting that of the 67 AAGGAGGU SD sequences in the 21 bacterial genomes (Table 1), only 4 occurred at an aligned spacing of 7 bases, while 10, 19, and 12 conferred 8, 9, and 10 bases of spacing, respectively. A total of 55 (82.1%) were present at a spacing larger than 7 bases. Thus, most likely an aligned spacing of 9 bases should be more preferable for the mRNA in the structure. There are apparently structural constraints that require such an optimal spacing, and three-dimensional simulation studies based on the structure using different SD sequences and spacings could provide insights into these structural constraints and a better understanding of the SD interaction.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»