Predicted highly expressed (PHX) genes are characterized for the
completely sequenced genomes of the four fast-growing bacteria Escherichia coli, Haemophilus influenzae,
Vibrio cholerae, and Bacillus subtilis. Our
approach to ascertaining gene expression levels relates to
codon usage differences among certain gene classes: the collection
of all genes (average gene), the ensemble of ribosomal protein genes,
major translation/transcription processing factors, and genes for
polypeptides of chaperone/degradation complexes. A gene is predicted
highly expressed (PHX) if its codon frequencies are close to those of
the ribosomal proteins, major translation/transcription processing
factor, and chaperone/degradation standards but strongly deviant from
the average gene codon frequencies. PHX genes identified by their codon
usage frequencies among prokaryotic genomes commonly include those for
ribosomal proteins, major transcription/translation processing factors
(several occurring in multiple copies), and major chaperone/degradation
proteins. Also PHX genes generally include those encoding enzymes of
essential energy metabolism pathways of glycolysis, pyruvate oxidation,
and respiration (aerobic and anaerobic), genes of fatty acid
biosynthesis, and the principal genes of amino acid and nucleotide
biosyntheses. Gene classes generally not PHX include most repair
protein genes, virtually all vitamin biosynthesis genes, genes of
two-component sensor systems, most regulatory genes, and most genes
expressed in stationary phase or during starvation. Members of the set
of PHX aminoacyl-tRNA synthetase genes contrast sharply between
genomes. There are also subtle differences among the PHX energy
metabolism genes between E. coli and
B. subtilis, particularly with respect to genes of the tricarboxylic acid cycle. The good agreement of
PHX genes of E. coli and B. subtilis
with high protein abundances, as assessed by two-dimensional gel
determination, is verified. Relationships of PHX genes with
stoichiometry, multifunctionality, and operon structures are also
examined. The spatial distribution of PHX genes within each genome
reveals clusters and significantly long regions without PHX genes.
 |
INTRODUCTION |
Escherichia coli, Vibrio
cholerae, and Haemophilus influenzae are gram-negative
-proteobacteria that can grow in human tissue and produce
or contribute to disease. The principal habitat of E. coli
is the human gut, V. cholerae is mainly a freshwater
microbe, and H. influenzae is found in the human lung. On
the other hand, Bacillus subtilis is a gram-positive,
nonpathogenic soil bacterium. The minimal doubling time for these four
bacteria in cultures is significantly less than 1 h. Fast
growth implies many ribosomes, and these four bacteria have large
numbers of rRNA operons per genome.
Predicted highly expressed (PHX) genes are characterized for the
rapidly dividing bacteria E. coli, H. influenzae,
V. cholerae, and B. subtilis using a method based
on codon usage differences among gene classes
(21). For complete lists of PHX genes, consult the website
ftp://gnomic.stanford.edu/pub (see also Table 2).
 |
MATERIALS AND METHODS |
Assessments of gene or protein expression levels from codon
usage were carried out as follows. High expression is predicted from
codon usage as follows. Let G be a family of genes with
average codon frequencies g(x, y,
z) for the codon nucleotide triplet (x,
y, z), normalized so that
where the sum extends over all codons (x,
y, z) translated to amino acid a. Let
f(x, y, z) indicate the
average codon frequencies for the gene family F,
normalized to 1 in each amino acid codon family. The codon
usage difference of the gene family F (or a single gene)
relative to the gene family G, termed the codon bias of
F with respect to G, is calculated with the
following formula:
where pa(F) are the
average amino acid frequencies of the genes of F (cf.
references 19 and 20). Denoted by C is the collection of all genes, by RP the ribosomal protein genes, by CH
chaperone/degradation protein genes, and by TF
translation/transcription processing genes. Qualitatively, a gene
g is deemed PHX if B(g|C) is appropriately high, whereas
B(g|RP),
B(g|CH), and
B(g|TF) are suitably low. Predicted
expression levels with respect to individual standards are based on the
ratios
and the combined expression measure is
Other weighted combinations are also possible, but the results
do not qualitatively change when different weights are used. We impose
higher weight on the RP standard because the RP genes are generally the
most PHX in all current completely sequenced genomes (21).
The specification of the RP, CH, and TF gene classes as standards
derives from the observation that these gene classes are consistently
highly expressed in most genomes (21). Thus, these three
gene classes (RP, CH, and TF) serve as representatives of highly
expressed genes, and our method specifies genes with similar codon
usages as PHX genes. These assignments are reasonable under fast growth
conditions, where there is a need for many ribosomes, for proficient
transcription and translation, and for many chaperone/degradation proteins needed to ensure correctly folded, modified, and translocated protein products.
A gene is predicted highly expressed (PHX) if the following two
conditions are satisfied: at least two of the three expression values
ERP(g),
ECH(g), and
ETF(g) exceed 1.05, and the
general expression level E(g) is
1.00. We
sometimes refer to genes that do not unequivocally satisfy this
definition but that have an E(g) of approximately
1.00 as marginally PHX.
 |
RESULTS |
To expose the significance of the PHX gene classes, we plotted
B(g|C) versus
B(g|RP) traversing all individual
genes g (
100 codons in length). The plots are given in
Fig. 1 for each of the four rapidly
growing bacteria. The distribution of points reveals two horns. The
left horn effectively corresponds to the PHX genes. The right horn we
refer to as putative alien genes. It consists of genes that
significantly differ in their codon usages from the four classes C,
RP, CH, and TF and will be discussed in a separate publication. If we
replace the horizontal axis B(g|RP) with the coordinates of B(g|TF) or
B(g|CH), the plots in Fig. 1 remain
largely unchanged (data not shown).

View larger version (45K):
[in this window]
[in a new window]
|
FIG. 1.
Genes of 100 codons in the four fast-growing
bacteria. Each gene is represented by a single point. Its position is
determined by its bias relative to all genes
B(g|C) and by its bias
relative to the RP genes
B(g|RP). PHX genes are
indicated by red circles. Partial overlaps among the PHX and normal
gene clusters are due to minor differences among
B(g|RP),
B(g|CH), and
B(g|TF), which all
contribute to the PHX predictions (see Materials and Methods). The
upper right horn corresponds to putative alien genes (22,
28).
|
|
Top 20 PHX genes.
The distribution of PHX genes among the four
fast-growing bacteria is displayed in Table
1. The highest
E(g) value exceeds 2 in all four genomes. Such
high values are rare among the completely sequenced genomes (cf.
reference 21). These four bacteria have a substantial
number of PHX genes, ranging from 142 to 306.
Table 2 presents the 20 genes with the
highest predicted expression levels in the genomes of E. coli, V. cholerae, H. influenzae, and
B. subtilis. In those few instances when the homologous
genes in the other genomes are not PHX, their
E(g) values are shown in parentheses. The genes
are segregated into functional categories. Almost all ribosomal
proteins attain high expression levels in all rapidly growing bacteria
(Tables 2 and 3). The S1 ribosomal protein gene (exceeding 500 codons in length in most bacteria) in
B. subtilis is found at the diminished length of 327 codons but is still PHX, with an E(g) value
of 1.20. Ribosomal protein genes are present in single copies, in
contrast to rRNA genes, and are predominantly of a high expression
level, presumably conforming with stoichiometric requirements for
ribosome formation between proteins and RNA and among the proteins
themselves.
The major (eubacterial) chaperone/degradation proteins HSP70 (DnaK) and
HSP60 (GroEL) and the mRNA degradation protein polynucleotide phosphorylase (Pnp) are prominently PHX. Pnp in B. subtilis, however, is not PHX [E(g) = 0.79]. The corresponding genes in H. influenzae achieve E(g) values of 1.29, 1.47, and 1.72, respectively. The gene enolase (eno), listed under energy
metabolism as part of the glycolysis pathway, is potently PHX. It is
also a component of the mRNA degradosome in a multifunctional
capacity (27) and so has reasons for being potently PHX.
Processing factors for protein synthesis are outstandingly PHX,
especially the ATP-dependent DNA-directed RNA polymerase units RpoB and
RpoC and the elongation factors EF-G (fus), EF-Tu
(tuf), EF-Ts (tsf). The elongation factor EF-Tu
often is present in two copies, both dramatically PHX. B. subtilis has but one copy, and it is PHX. EF-G (fusA)
is present in two copies in V. cholerae, with
E(g) values of 2.02 and 0.96. The DNA helicase
DeaD is PHX in E. coli, V. cholerae, and B. subtilis but not in H. influenzae, nor is the second
copy in B. subtilis PHX. It will be interesting to see if
there are functional differences between the two copies of DeaD and the
EF-G proteins. DeaD box proteins protect mRNA from
endonucleases (27).
Many glycolysis genes are among the top PHX genes; they include genes
for pyruvate kinase (pykA and pykF),
fructose-1,6-bisphosphate aldolase (fba), phosphoglycerate
kinase (pgk), enolase (eno), and
glyceraldehyde-3-phosphate dehydrogenase (gap).
PHX genes contributing to anaerobic fermentation include the
alcohol/acetaldehyde dehydrogenase gene (adhE). Other PHX
genes of energy metabolism include several, but significantly not all, genes of the tricarboxylic acid (TCA) cycle. Several subunits of the
pyruvate dehydrogenase complex genes are among the top PHX genes. These
include genes for multiple copies of the three enzymatic components,
pyruvate dehydrogenase E1 (aceE), except in B. subtilis, dihydrolipoamide acetyltransferase E2 (aceF), and lipoamide dehydrogenase E3 (lpdA), all part of the
pyruvate oxidation pathway. Genes contributing to proton
gradient-driven ATP synthesis (namely, the genes for the two major
subunits of the ATP synthase catalytic domain, atpA and
atpD) are potently PHX. The PHX gene for
adenylosuccinate synthetase (purA) stands out, except in
B. subtilis. It participates in the de novo
biosynthesis pathway of purine nucleotides and in the first step of AMP
biosynthesis. However, the genes for the other enzymes of that pathway
are not PHX.
Several porin genes of E. coli, H. influenzae, and V. cholerae are PHX. These are absent from B. subtilis,
which is a gram-positive bacterium, lacking the distinctive
gram-negative outer membrane. The peptidoglycan-associated lipoprotein
(Pal) attached to the outer membrane by a lipid anchor is PHX in
gram-negative bacteria. Several lipid biosynthesis PHX genes are among
the top 20. The first enzyme of the glyoxylate shunt pathway,
isocitrate lyase (AceA), is PHX in the moderately fast-growing
Deinococcus radiodurans (90-min average doubling time) and
the slow-growing Mycobacterium tuberculosis (24 to 36 h). It exists in E. coli and V. cholerae but is
not PHX and has not been detected for most prokaryotic genomes.
Isocitrate lyase is widespread in plant and fungal organisms. There is
an open reading frame (ORF) (yeiM) of unknown function but
possibly encoding a nucleoside transporter, with an
E(g) value of 2.00, in V. cholerae,
and there is a homolog with an E(g) value of 1.07 in H. influenzae but not PHX in E. coli and
B. subtilis. Differences among genes in predicted expression
levels present challenging questions for experimentation.
PHX genes in H. influenzae parallel PHX genes in E. coli. These include genes for mainstream glycolysis and TCA
enzymes and genes for detoxification and DNA damage control, such as
the sodA and catalase genes. The highest
E(g) value is 2.01, attained by the elongation
factor EF-G (fusA). The heat shock proteins GroEL and DnaK
are among the most highly expressed. The ribosome release factor (Rrf)
is the top PHX protein in H. influenzae. Rrf is responsible for the release of ribosomes from mRNA at the termination of
protein synthesis (37). Rrf is present and generally
highly expressed in all eubacterial organisms with completely sequenced
genomes but is absent from archaea (35).
Comparison of predicted levels of expression in E.
coli with 2D gel patterns.
For many E. coli
proteins, two-dimensional (2D) gel electrophoresis data for their
abundances during growth in minimal medium are available. We compared
the molar abundances of 96 proteins (with lengths of
100 amino acids
[aa] [45, 46]) with the set of PHX genes (Table
4). Among the 20 most abundant of the 96 proteins, 17 were identified as PHX by our method. Among the 20 least
abundant proteins of the 96, only 7 qualified as PHX. Of the remaining
56 proteins, which have intermediate molar abundances on 2D gels, 28 were identified as PHX. This agreement between high 2D gel abundances
and high E(g) values supports naming the genes
"highly expressed."
Three exceptions to the good agreement between high protein molar
abundances and PHX status are MetE, FolA, and IlvE, which are involved
in amino acid biosynthesis and methylation. These proteins are among
the most abundant in 2D gel determinations but do not qualify as PHX.
The enzymatic turnover rate for MetE, determined by kinetic studies, is
low but is compensated for with a high molar abundance
(12). In E. coli, the methionine
biosynthesis pathway includes MetK, with a very high
E(g) value, 2.21, whereas MetE has an
E(g) value of 0.69 and MetH has an
E(g) value of 0.60. MetE and MetH offer strict
alternative pathways for L-methionine synthesis.
MetK acts on homocysteine to produce S-adenosylmethionine, which serves as a methyl donor for a broad range of metabolites, lipids, and vitamins (41). It has been conjectured that
the metE gene or the entire Met operon in E. coli, because of its codon usage, may be a newly laterally
transferred gene analogous to the Cob operon of Salmonella
enterica serovar Typhimurium (24). FolA
(dihydrofolate reductase) registers high 2D gel assessments but has a
low E(g) value, 0.60.
Hecker and colleagues (e.g., reference 3) have conducted
extensive 2D gel assessments of B. subtilis proteins.
Consulting their 2D database
(http://microbio2.biologie.uni-greiswald.de:8880), we
compared the brightest spots on their gels with the
E(g) values for the corresponding proteins: RpS2,
1.84; SerA, 0.62; IlvC, 1.03; AroA, 0.93; Gap, 1.80; PdhC, 2.05; CitC,
1.33; TufA, 1.97; Fus, 2.34; YwjH, 1.01; RpL10, 1.46; ClpP, 1.05; SodA,
1.64; and CitH, 0.81. Most of these proteins are PHX, and several
achieve an E(g) value of >1.8. Thus, there is a
good correlation of PHX proteins with high 2D gel abundances in
B. subtilis, as in E. coli.
Classes of PHX genes.
Tables 3 and 5 through
9
compare for the four fast-growing bacteria predicted levels of
expression of all ribosomal protein genes, of the genes for the major
transcription/translation processing factors, of the
chaperone/degradation protein genes, and of the major energy metabolism
genes. The extended repair gene repertoire of the four genomes and the
vitamin biosynthesis genes of E. coli are evaluated in terms
of E(g) levels (Tables
10 and 11). Each class is discussed in turn.
View this table:
[in this window]
[in a new window]
|
TABLE 5.
Predicted expression levels for translation/transcription
processing genes among four fast-growing
bacteriaa
|
|
View this table:
[in this window]
[in a new window]
|
TABLE 8.
Relationship between aminoacyl-tRNA synthetase expression
levels and amino acid frequencies in E. coli
proteinsa
|
|
(i) Ribosomal protein genes (Table 3).
Ribosomes of the four
fast-growing bacteria have practically the same numbers of small- and
large-subunit proteins. However, among all prokaryotic genomes, that
number ranges from 50 to 65, while in eukaryotes, the number is
constant at 79 (except in yeast, 78) (48, 50). This
information suggests a greater range of variation in the patterns of
protein synthesis among prokaryotes, consistent with the constrained
phylogenetic origin of eukaryotic cells compared with the less
constrained origin of prokaryotic species.
Thirty-five RP genes are shown in Table 3 (only those
100 codons
long). Unlike those of yeast and Drosophila, many of the bacterial RP genes are concatenated to form a large operon
encompassing 20 to 40% of all RP genes. Genes for some of the major
translation/transcription processing factors, including tuf,
fus, rpoA, rpoB, and rpoC, are within or near the large RP operon. Other RP operons typically consist of two to five genes. In E. coli, the cluster of
L7/L12, L10, L1, L11, rpoB, and rpoC is
noteworthy. B. subtilis possesses an RP cluster that
effectively combines the two largest clusters of E. coli. In
these fast-growing bacteria, most of the eubacterial RP genes are
positioned near the origin of replication, oriC. It is
evident from Table 3 that virtually all RP genes are PHX. The EF-Tu
gene is often duplicated, with both copies being PHX and incorporated
near or in an RP cluster. groEL, rpoB, and
rpoC also tend to localize to the vicinity of the main RP
cluster. Many eukaryotic and eubacterial ribosomal proteins are
multifunctional (50).
The "giant" RP (labeled S1 or RpsA, generally exceeding 500 amino
acids in length) has a remarkable phylogeny. It is recognized in most
eubacteria but is not part of an RP operon, and it generally reaches
among the highest expression levels. In B. subtilis, there is an S1 homolog, but it is only 327 codons long, and the S1 gene is entirely missing from the three current completely sequenced mycoplasma genomes. The S1 gene is essential in E. coli,
where it is thought to contribute to the initiation of polypeptide
synthesis. The absence of an S1 protein in B. subtilis can
possibly be compensated for by a strong ribosome binding site
(34). The evolutionarily deep branching bacterium
Aquifex aeolicus has a giant S1 gene. Thermotoga
maritima, allowing for a frameshift, also has an S1 homolog. None
of the archaeal genomes has an S1 homolog, and eukaryotic genomes also
lack an S1 homolog.
The origin of replication (oriC) for E. coli is
identified within the 232-bp interval from 3923372 to 3923603. The
major RP cluster is proximal to oriC at 3436600 to 3476134 and contains, in addition to RP genes, genes for the elongation factors
EF-Tu and EF-G and two flanking chaperones of the peptidyl-prolyl
cis-trans isomerase (PPIase) family. Proximity to
oriC implies a higher-than-average gene copy number per
rapidly growing cell. A second RP cluster occurs proximally on the
other side of oriC and includes genes for a duplicate copy
of EF-Tu (tufB) and the DNA-directed RNA polymerase units
rpoB and rpoC. The E(g)
values for RP genes (
100 codons long) in E. coli range
from 2.44 to 1.13. All but one of the RP genes are PHX; the single
exception is L9 in B. subtilis. The majority have
E(g) values exceeding 1.50. The correlations of
E(g) values among the RP genes of E. coli, V. cholerae, and H. influenzae are high (Table 3).
Does stoichiometry matter? For example, among the RP genes, why aren't
all 50S units PHX at the same expression level? A partial answer may be
that not all ribosomal proteins play an exclusive role in determining
ribosome structure. Some may have a regulatory role (e.g., S1 is
proposed to function in translation initiation) (M. Nomura, personal
communication) (34). The acidic ribosomal protein
component P0 is PHX in archaea but is absent from
eubacteria. L7/L12 is also acidic and is thought to act in adapting
mRNA chains to the ribosome. Actually, L7/L12 forms dimers with an
elongated shape. Two dimers associate with a copy of L10 to form a very strong complex (4). Very relevant is that several
ribosomal proteins are multifunctional (50). For example,
S9 provides ancillary utility in certain repair activities
(49); S16, in part, acts as an endonuclease
(31).
(ii) Genes for transcription/translation processing factors (Table
5).
The majority of protein synthesis factors are PHX over all
prokaryotic genomes. Expression levels correlate highly across species (Table 5, footnote a). As with the ribosomal
proteins, the E(g) values cover a wide range.
Elongation factor EF-G (fus) is distinctive, with an
E(g) value exceeding 2 for each genome. The
highest expression levels in E. coli occur for the RpoB and RpoC subunits of the core RNA polymerase. RpoA is PHX in B. subtilis but not in E. coli, V. cholerae,
and H. influenzae. Why are the predicted expression levels
for the RpoB and RpoC subunits higher than that for RpoA? Based on the
RNA polymerase stoichiometry (one copy of RpoB, one copy of RpoC, but
two RpoA units), should one expect elevated expression levels for RpoA
compared to RpoB and RpoC? A possible explanation relates to the
differences in protein sizes, RpoB and RpoC being larger proteins than
RpoA. It has been observed for E. coli that codon
choices in long genes tend to be more biased than those in short genes
(10). Interestingly, Mycoplasma genitalium, its
relative Ureaplasma urealyticum, and the spirochete
Treponema pallidum feature PHX RpoA but not RpoB and RpoC.
(iii) Chaperone/degradation protein genes (Table 6).
Among the top PHX genes in most eubacterial genomes are those for
the major chaperone protein archetypes, DnaK and GroEL. These reach
E(g) values exceeding 1.3 (>2 in E. coli). The gene for the multifunctional enzyme Pnp, fundamental in
RNA processing and mRNA degradation, attains the highest predicted
E(g) value, 2.66, among all E. coli
genes. Pnp is PHX in many eubacterial genomes but not in B. subtilis.
Thioredoxin (trxA) implements protein folding by catalyzing
the formation or disruption of disulfide bonds. The eukaryotic thioredoxin homolog is protein disulfide isomerase, operating in the
endoplasmic reticulum. It has been verified experimentally that protein
disulfide isomerase augments protein folding needs (7, 15,
47). The highest E(g) values for
thioredoxin occur in B. subtilis (1.35) and then in other
fast-growing bacteria in the order D. radiodurans (1.23)
(data not shown), V. cholerae (1.21), H. influenzae (1.11), and E. coli (1.06).
Peptidyl-prolyl cis-trans isomerases (PPIases) accelerate
the proper folding of proteins by promoting the cis-trans
isomerization of imide bonds in proline within oligopeptides. E. coli has at least nine PPIases defined by sequence similarity. One
of these, the survival protein SurA, enhances the folding of
periplasmic and outer membrane proteins. As expected, SurA does not
exist in gram-positive B. subtilis, which has neither
compartment. Trigger factor (Tig) is a ribosome-associated chaperone
that can complement DnaK (8). Tig and DnaK cooperate in
the folding of newly synthesized proteins. Simultaneous deletion of Tig
and DnaK is lethal under usual growth conditions (43). Tig
is broadly PHX for eubacterial genomes but is not found for archaeal
genomes. Expression levels of Tig in fast-growing bacteria
are quite similar (Table 6).
DegP is a chaperone folding factor that is significantly PHX,
with an E(g) value of 1.26; it acts
primarily in degrading misfolded proteins in the
periplasm. Also associated with periplasmic and cytoplasmic
chaperones are several PPIases, including PpiC [E(g) = 1.02], PpiB (1.53), FkpA (1.40), SlyD (2.08), PpiA (0.95), PpiD (1.11), SurA (1.10), FhlB (0.85), and YaaD (0.77); four are
active in the periplasm, and five are active in the cytoplasm.
Another relevant chaperone protein is disulfide oxidase (DsbA), which is marginally PHX, with an E(g) value of
1.02;
it senses misfolded proteins in the periplasm.
Correlations among the fast-growing bacteria for levels of expression
of major chaperone genes are generally significantly high (Table 6,
footnote a). However, E. coli and B. subtilis are marginally correlated (0.3). In E. coli,
degradation proteins are mostly PHX, but this is not consistently the
case for the other fast-growing bacteria. Why are the major
chaperone genes so often PHX? Chaperone/degradation proteins are
vitally needed both during rapid growth and in stationary phase.
In normal cell physiology, these proteins have multiple functions: they
contribute decisively in ensuring correct protein folding, in remedying
misfolded structures, in directing protein trafficking, and in
coordinating protein secretion. Chaperone proteins also contribute to
conformational changes and to minimizing protein damage during stress.
(iv) Levels of expression of aminoacyl-tRNA synthetases (Table
7).
There are 19 PHX tRNA synthetase polypeptides in E. coli, including two subunits of phenylalanyl-tRNA synthetase
(PheS-
and PheT-
) and two subunits of glycyl-tRNA
synthetase (GlyQ-
and GlyS-
). However, there are only eight
in V. cholerae, seven in H. influenzae, and three in B. subtilis. IleS is missing
from H. influenzae, and GlnS is missing from B. subtilis, which uses amidotransferase modifications to produce
Gln-tRNAGln from
Glu-tRNAGlu synthetase. Actually, the GlnS
gene is absent from most prokaryotic genomes (14).
Expression level correlations for the tRNA synthetase genes among the
three rapidly dividing gram-negative genomes are generally positive but
low. On the other hand, the corresponding relationship of B. subtilis with E. coli is uncorrelated (
0.04) and that
of B. subtilis with V. cholerae is
modestly negatively correlated (
0.24). LysS is the only PHX tRNA
synthetase for all four genomes.
There are three aminoacyl-tRNA synthetases in E. coli which occur at only moderate predicted expression
levels: CysS, with an E(g) of 0.89; TrpS, with an
E(g) of 0.91; and HisS, with an E(g) of 0.74. The average amino acid usage
frequencies for E. coli genes correlate positively with
the predicted expression levels for tRNA synthetases. Interestingly,
the three lowest amino acid usage frequencies in E. coli are
for Cys (1.2%), Trp (1.5%), and His (2.3%) (Table 8).
(v) Levels of expression of major energy metabolism genes (Table
9).
Enzymes of major catabolic pathways can be divided into four
groups: glycolysis, pyruvate metabolism, the pentose phosphate pathway,
and the TCA cycle. The glycolysis genes are predominantly PHX in all
four fast-growing bacteria, with very high
E(g) values, >2.00, for several of these genes
in E. coli. Hexokinase and glucokinase are
prominent glycolysis proteins in most eukaryotes, but the former is not
found in most prokaryotes, including the four fast-growing bacteria under analysis in this study. Why? In glycolysis,
hexokinase converts glucose to glucose-6-phosphate. However,
glucose-6-phosphate arises from other hexoses and from glucose
transported into the cell via the phosphotransferase system. Perhaps
the multiplicity of sources means that glucokinase need not be PHX.
Glucokinase occurs in many (but not all) eubacteria, normally at low to
moderate E(g) values, 0.3 to 0.8.
The genes for pyruvate dehydrogenase are commonly PHX in the four
genomes. The TCA genes are generally PHX in E. coli but generally not PHX in H. influenzae and B. subtilis. In B. subtilis, two TCA genes are PHX
and the others cover the range 0.4 to 1.0. Many prominent TCA genes
appear to be absent from H. influenzae. Why are TCA
genes in B. subtilis mostly not PHX? The TCA cycle, apart
from energy (ATP) production, can contribute in myriad ways to cellular
needs, especially in making precursors and intermediates to
macromolecules, e.g., in amino acid, vitamin, and heme biosyntheses (see Discussion). The order of actions in the TCA cycle is as follows: citrate synthase (GltA; in B. subtilis, there
are two versions, designated CitZ and CitA), aconitate hydratase
(AcnA/AcnB), isocitrate dehydrogenase (Icd), 2-oxoglutarate
dehydrogenase (SucA), succinyl coenzyme A (succinyl-CoA) synthetase
(SucD and SucC), succinate dehydrogenase (SdhB, SdhC, and SdhD),
fumarate hydratase (FumA, FumB, FumC, or CitG), and malate
dehydrogenase (Mdh/CitH). The initial enzymes of the TCA pathway in
E. coli are all PHX, with E(g) values
1.29, whereas those beyond succinyl-CoA synthetase (except for Mdh)
all have E(g) values
1.10, and most are not PHX. Apart from the differences in the expression levels among the TCA
cycle genes, correlations among genomes for energy metabolism gene
expression levels across all four fast-growing bacteria are high,
suggesting similar uses for this set of enzymes (Table 9, footnote
a).
Certain gene groups generally not PHX.
Specific regulatory
proteins or proteins responding to special demands and used few times,
as in the highly specialized DNA repair processes, are not expected to
be PHX. Also, specific transcription proteins and DNA replication
proteins, because the cell assembles few replication machines, tend not
to be PHX.
(i) Genomic repair proteins.
Table 10 reports predicted
expression levels for the main collection of repair proteins for the
four genomes. Only two repair proteins of E. coli reach PHX
levels: RecA and Ssb (single-stranded DNA binding protein)
[E(g) for both, 1.48]. Two other repair
proteins are borderline PHX: Dut (deoxyuridine 5'-triphosphate
nucleotide hydrolase) and HepA [E(g) = 0.97 and 0.99, respectively]. Other repair proteins have low to
moderate predicted expression levels, the E(g)
values almost always in the range from 0.35 to 0.80. These evaluations
parallel those for D. radiodurans, in which RecA
[E(g), 2.04] has a dramatically high
predicted expression level and MutT (gene no. DR2358) reaches an
E(g) of 1.29, these being the only two proteins
qualifying as PHX (22). The other repair proteins of
D. radiodurans have E(g) values in the
range 0.40 to 0.80.
(ii) Vitamin biosynthesis proteins (Table 11).
Pathways to the
synthesis of vitamins, of which only small amounts are needed to
provide adequate cofactor function, have largely low predicted
expression levels, with E(g) values of about 0.40 to 0.75. In E. coli, the genes acting in the synthesis of six vitamin cofactors, biotin, thiamine, riboflavin, lipoate, pyridoxal, and cobalamin, were examined. Only RibH, which participates in riboflavin biosynthesis, is PHX in E. coli. Although the
enzymes of the biosynthetic pathways are poorly expressed, some of the enzymes that utilize the vitamins as cofactors are highly expressed, for example, biotin carboxylase (a subunit of E. coli
acetyl-CoA carboxylase). In B. subtilis, RibE, which
is not PHX, in the same pathway forms an oligomer complex with
RibH in which the structural union (RibE-RibH) combines 3 units of RibE
with 60 units of RibH (23). This anomalous stoichiometry
makes it likely that RibH furnishes structural support and, for this
reason, is PHX; in this guise, RibH may be used in other
capacities. Paradoxically, RibH is not PHX in B. subtilis.
Interestingly, M. tuberculosis features nine PHX proteins
among the vitamin biosynthesis pathways. Synechocystis and
A. aeolicus each have three PHX vitamin biosynthesis
genes, Borrelia burgdorferi has one, Archaeoglobus
fulgidus has two, T. pallidum has one, and
D. radiodurans has one. The biotin carboxylase protein
is PHX in the E. coli, H. influenzae, V. cholerae,
Helicobacter pylori, Synechocystis, Chlamydia trachomatis, and
A. fulgidus genomes.
(iii) Genes of signal transduction pathways.
In Table 8 of
reference 21, the predicted expression levels for several
two-component sensor genes (histidine kinases) of E. coli and B. subtilis are reported. In
all of those examples, the predicted expression levels were low, the
E(g) values ranging from 0.30 to 0.70.
One particular example is the Cpx regulon of the sensor
kinase/phosphatase periplasmic family, which encompasses the
genes encoding CpxA and CpxR (components of a histidine
kinase), CpxP (down regulates the Cpx pathway), and NlpE
(membrane lipoprotein), believed to eliminate abnormal proteins in
the periplasm and to recover amino acids during nitrogen
starvation (32). These proteins regulate a hierarchy of
factors, including
32 and
E, active in autoregulation and repression.
The predicted expression levels are low [for CpxA,
E(g) = 0.70; for CpxR,
E(g) = 0.57; for CpxP,
E(g) = 0.62; and for NlpE,
E(g) = 0.61], as is common with specific
regulatory proteins. Cpx is a sensor kinase acting in the
periplasm. The Cpx pathway apparently also monitors pilus assembly
during infection of tissues by uropathogenic E. coli (17).
(iv) Principal starvation genes of E. coli and their
predicted levels of expression (Table
12).
The genes shown in Table 12
are associated with starvation states, as discussed in the review
(26). Three genes in this category are PHX:
dps, also labeled pexB
[E(g), 1.13], which provides protection from
oxidative radicals; rpoH, which encodes
32 [E(g),1.46]; and
the survival protein, SurA [E(g),1.10], a
chaperone which is a member of the PPIase family. We expect these
proteins, by virtue of their codon usage patterns, to be capable of
high levels of expression, especially when induced by starvation. Other starvation proteins (Table 12) have low to moderate
E(g) values. The
E
factor, which regulates the activity of other periplasmic proteins, is not PHX, and the same is true for
54 and
38, which respond to nitrogen and/or carbon
starvation, respectively. However,
32
(rpoH), the principal chaperone sigma factor, pervasively
registers as PHX, presumably to establish high levels of chaperone
production.
Homologous PHX genes among the fast-growing bacteria.
Table
13 compares the numbers of homologous
PHX gene families among the four rapidly dividing bacteria. There are
60 gene families common to the four fast-growing bacteria, with each
member PHX. Thirty-two of these are families of RP genes, eight are
families of TF genes, and nine are families of genes essential for
energy metabolism. Twenty-three gene families distinguish E. coli with PHX representatives, but these are not PHX in the other
three fast growers, including five CH genes and five TF genes.
E. coli and V. cholerae share 124 homologous
genes that are both PHX and in total 236 homologous genes with one or
both genes being PHX; the respective values for E. coli and
H. influenzae are 105 and 226, and the values for
V. cholerae and H. influenzae are 94 and
156. Paired PHX genes between fast-growing bacteria and
non-fast-growing bacteria are fewer in numbers (Table
14). Of homologous
genes among genomes with at least one PHX gene, the expression levels
for E. coli versus archaeal genomes and E. coli versus H. pylori and M. genitalium genomes are uncorrelated or negatively
correlated (Table 14). Similarly, V. cholerae,
H. influenzae, and B. subtilis expression
levels correlate negatively with homologous genes of archaeal genomes,
possibly reflecting differences in lifestyles, habitats, and energy
sources.
View this table:
[in this window]
[in a new window]
|
TABLE 14.
Numbers of pairs of homologousa
genes with one or both genes PHX and correlations between their
E(g) valuesb
|
|
Codon usages along the gene and expression levels.
For
relatively long genes (
600 codons long), we determined expression
levels with the gene length divided into three equal parts (5', middle,
and 3' parts). The pairwise correlations among the three parts of the
E. coli genes are high, 0.86, 0.85, and 0.88, respectively, indicating that expression levels calculated from
codon biases are effectively the same for the three parts of genes.
Independent of gene size, we observed (20) that the middle
and 3' end of the genes show quite similar codon frequencies, whereas the 5' third-codon ensemble possesses somewhat different codon frequencies. This finding may reflect differences in
translation initiation versus later stages of translation elongation. A
prominent example concerns encoding of arginine with major codons
(CGN) versus minor codons (AGR). The AGR codons are scarce in
E. coli genes and are restricted mostly to the 5'
end of the genes (especially to the initial 30 bp), whereas CGN
codons are preferred elsewhere in the genes (6).
PHX ORFs shared by the four fast-growing genomes.
Genes are
considered homologous if their SSPA (significant segment pair
alignment) score (percent similarity; see reference 5) is
40%. Examples include three ORFs (yaaH, yajC,
and yeeX) common to E. coli and V. cholerae, three similar ORFs (yfiD, yjjK, and yebC) present in the genomes of E. coli,
V. cholerae, and H. influenzae, respectively, and
one ORF (ybaB) common to E. coli and B. subtilis. These PHX genes of unknown function offer attractive candidates for mutagenesis and knockout studies to determine
their functions.
Distributions of PHX genes over the chromosomes.
Clusters of
PHX genes are displayed in Table 15.
Statistical significance was assessed using the r-scan
analysis protocol described elsewhere (18).
The PHX genes in each cluster generally possess the same transcription
orientation, mostly that of the leading strand. However, E. coli features the PHX fumarate reductase operon genes (kb
4380
4376) frdD, frdB, and frdA untypically
located in the lagging strand (the direction of transcription is
indicated by the arrow). The genes encoding the principal units of NADH
dehydrogenase I, N, L, I, G, F, and C cover positions 2402
2387 (about a 5-kb extent) on the leading strand.
The PHX gene clusters of E. coli, apart from the segments at
kb 450
447 and kb 4380
4376 of the cytochrome o
ubiquinol oxidase operon and the fumarate reductase
operon, respectively, are all located in the leading strand.
Note that the two RP clusters near oriC (kb 3476
3437 and kb 4174
4183) include a number of TF genes and some PPIase
genes. There are no extended intervals devoid of PHX genes in the
E. coli genome.
The V. cholerae large chromosome contains two significantly
long segments, at kb 43 to 327 and kb 1657 to 1985, each devoid of PHX
genes and positioned antipodal in the chromosome. The main PHX clusters
correspond to long RP operons located in the leading strand.
These descriptions indicate that PHX genes are irregularly distributed
in the V. cholerae chromosomes. The V. cholerae
genome has two chromosomes (chromosome I, 2.96 Mb, and chromosome II, 1.07 Mb) containing 138 PHX genes and 14 PHX genes, respectively. The
PHX genes in the large chromosome comprise 7% of its genes. V. cholerae has a single PHX RP gene on chromosome II.
In H. influenzae, the PHX clusters are of RP genes and
protein synthesis genes.
B. subtilis contains a PHX cluster which features a
conglomerate of 27 RP genes (kb 118
154) intermeshed with the
protein synthesis genes rpoB, rpoC,
fus, tuf, and rpoA. A compact
operon of PHX genes distinguishes five glycolysis genes (kb
3482
3475), enolase (eno), phosphoglycerate mutase
(pgm), triosephosphate isomerase (tpi),
phosphoglycerate kinase (pgk), and
glyceraldehyde-3-phosphate dehydrogenase (gap),
located in the leading strand. The cluster at kb 3475
3482 ostensibly renders the main glycolysis genes highly efficient,
putatively making it less important to express many respiration genes.
All clusters are located in the leading strand. B. subtilis
also has a 245-kb stretch devoid of PHX genes, at kb 35 to 280.
 |
DISCUSSION |
Gene expression can be evaluated in several ways. One currently
popular way centers on DNA microarrays (DNA chips) aiming to dissect
gene expression under varied physiological, clinical, and environmental
conditions. These DNA chips have been applied to the monitoring of
genes in different situations for the discovery of genes associated
with diseases; for assessment of gene expression under inducements from
drugs, chemicals, or toxins; for ascertainment of genes compensatory
for knockout mutations; and for profiling of gene expression patterns
in temporal and tissue-specific localizations. The current microarray
methodology is restricted to discriminating transcription levels and
not levels of translation or protein abundances (33, 42).
Also, DNA chip hybridizations are generally unable to detect
unambiguously low-abundance gene transcripts. Experimental evaluations
of protein abundances under different cellular conditions can be
assayed by 2D gel electrophoresis (reviewed in reference
46) supplemented by mass spectrometry (51),
by antibody associations, and by biochemical tests. Also, correlations of 2D gel proteomes and microarray assessments of transcriptomes generally appear to be weak (13). However, Futcher and
coworkers (11) reexamined these correlations in yeast and
found generally good agreement.
Codon choice is presumably influenced by protein structure via
evolutionary selection for the most accurately translated sequences at
structurally important locations. Codon choices may be different at the
beginning of a gene than at the central part of the gene (6). It has been suggested that translation pause sites,
especially early in the coding sequence, can slow translation
initiation (16). Accordingly, there appear to be
conflicting selection pressures imposed by constraints on ribosomal
binding for the rate of initiation, rate of elongation, and overall
translation fidelities. In rapidly growing cells, where ribosomes are
limiting for protein synthesis, a ribosome stalled at a rare codon
is unavailable for the synthesis of other proteins, and the higher the
molar abundance of the stalled protein, the greater the disruption of cellular growth (52). Protein structure may be correlated
with codon usage (e.g., see references 30 and
44). Thanaraj and Argos (44) argue the
rare-codon hypothesis for domains and secondary structures, in
which repetition of rare codons reduces translation rates and
introduces translation pauses, allowing time for protein domains and
secondary structures to fold into native structural conformations.
Codon usage offers another way to evaluate gene expression with a
different set of limitations. Our sequence methods are effectively complementary to the experimental procedures of 2D gel electrophoresis and DNA microarray analysis in assessing gene expression levels. By our
methods, genes similar in codon frequencies to RP, TF, and CH genes
but strongly deviant in codon usage from the average gene are
identified as PHX. Our analyses and data support the hypothesis that
each genome has evolved codon usage patterns indicating "optimal" gene expression levels for most situations of its
habitat, energy sources, and lifestyle. The three protein
families
ribosomal proteins, major translation/transcription
processing factors, and chaperone/degradation proteins
are fundamental
at many stages of the cell life in promoting growth and
stability. Generally, PHX genes exploit favorable codon usages,
tend to possess strong Shine-Dalgarno sequences, and putatively possess
strong promoter sequences (cf. reference 21). Some
limitations of our method result from an implicit assumption that the
codon usage of a gene is not affected by its location in the
genome, e.g., G+C-rich versus A+T-rich regions. The high variance of
G+C composition (isochores) along mammalian genomes may be prohibitive
with respect to predicting gene expression levels from codon
usages. However, the nucleotide compositions of bacterial genomes are
largely homogeneous. Some genes that deviate in G+C content (e.g.,
those for transposases or specialized pathogenicity islands) tend to be
detected as "putative alien" genes (22, 28).
What does the expression level E(g) for a gene
g reflect? Gene expression in prokaryotes is regulated at
initiation, elongation, and termination of transcription and of
translation, by different rates of transcription and translation, by
differential mRNA stabilities, by segmental stability differences
in polycistronic messages, by codon preferences, and by
interactions with chaperone and other proteins. Expression is also
influenced by lifestyle, habitat, and energy sources. The classes (RP,
TF, and CH) of proteins that we have chosen to represent highly
expressed genes are needed in high molar abundances when a high rate of
protein synthesis is essential.
Multifunctional proteins and PHX levels
A
protein that belongs to a PHX class and that performs several functions
might be expected to register higher E(g)
values than the average PHX gene. We offer several examples.
Polynucleotide phosphorylase (Pnp) is fundamental in RNA processing and
mRNA degradation, and the gene attains the highest E(g) value, 2.66, among all the E. coli genes. Pnp is also a component of the mRNA degradosome,
which involves RNase E, DnaK, RhlB helicase, and enolase
(27). RNase E is also PHX in E. coli, with an
E(g) value of 1.22, but it is not PHX in H. influenzae and V. cholerae and it is missing from
B. subtilis. As an important multifunctional protein, Pnp is
expected to be PHX at an increased level. The Pnp gene also has the
highest E(g) value among all the genes in B. burgdorferi. This gene is also significantly PHX in the
genomes of H. influenzae, V. cholerae, Synechocystis, M. tuberculosis, T. pallidum, Chlamydia pneumoniae, A. aeolicus, and
T. maritima.
Enolase obtains the very high E(g) values of 2.11 in E. coli, 1.93 in H. influenzae, 1.59 in
V. cholerae, and 1.92 in B. subtilis. Again,
enolase is multifunctional, acting in energy metabolism (glycolysis)
and partly in RNA degradation.
The enzyme aconitate hydratase (aconitase) interconverts citrate and
isocitrate in the TCA cycle. Aconitase also serves as a sensor,
detecting changes in the redox state and assaying iron content within
the cell (36). This protein can further function as a
transcriptional activator that specifically regulates gene expression
for the transferrin receptor and controls quantities of ferritin
(2). At its iron sulfur center, aconitase can be inactivated by oxidative stress or iron deprivation. Aconitase has the
highest E(g) value, 2.56, in D. radiodurans (see also reference 22), and its gene is
PHX in many genomes.
Apart from structural roles in ribosome formation, several ribosomal
proteins act in multifunctional capacities (50). For example, the S9 protein is an accessory protein functioning in DNA
repair (49). The E(g) values for S9
in the four fast-growing bacteria studied here are all >1.50,
particularly
1.90 in E. coli and V. cholerae,
significantly higher than the average ribosomal protein
E(g) value. The L25 ribosomal protein (93 aa in
E. coli) is homologous to the general stress protein (Ctc).
This protein achieves the very high E(g) values
of 1.90 in E. coli and 1.89 in D. radiodurans.
Ctc is PHX also in C. trachomatis, Campylobacter jejuni, H. pylori, T. maritima, and A. aeolicus, none of which carries the L25 gene. In contrast, Ctc is absent in E. coli, V. cholerae, and H. influenzae, but their genomes encode
the L25 ribosomal protein. In almost all genomes, the Ctc and L25
protein genes are mutually exclusive. The large ribosomal protein S1
gene is almost always among the top levels of eubacterial PHX
genes. We conjecture that the S1 protein (generally
500 aa) possesses multifunctional activity yet to be determined. Interestingly, the S1
protein is composed of repetitions of an 86-aa element, usually
involving six or more copies.
Other multifunctional PHX proteins from many genomes include
glyceraldehyde-3-phosphate dehydrogenase, acting primarily in the first step of the second phase of glycolysis. This protein is very
promiscuous, showing uracil DNA glycosylase activity, and binds to tRNA
and DNA and to proteins with glutamine repeats. In eukaryotes, it also
structurally binds filaments of actin and microtubules
(39).
The elongation factor EF-1
is an essential component of the
translation apparatus and also has a major function in severing microtubules (38). Phosphoglycerate kinase also functions
as a disulfide reductase (25). Many different metabolic
proteins serve as crystalline components for the lenses of different
animal eyes; these include PPIases, aldehyde dehydrogenase,
arginosuccinate lyase, enolase, and aldose reductase.
Contrasts in PHX levels among genes involved in energy metabolism
in E. coli and B. subtilis.
As
indicated earlier, certain genes of energy metabolism are
predominantly PHX in all four fast-growing bacteria and have high
expression levels [E(g), often >2.00]. This is
manifestly valid (Table 9) for glycolysis genes and for genes of
pyruvate oxidation. Why should most of the TCA genes of E. coli be PHX but not those of B. subtilis? We
suggest four possible contributing causes. (i) Perhaps
B. subtilis makes less use of the TCA cycle for ATP
production than E. coli. The principal glycolysis genes of
B. subtilis, unlike those of E. coli
(dispersed all over the E. coli genome), are encoded from a
single cluster (gap, pgk, tpi, pgm, and eno); see
our earlier discussion of PHX clusters. (ii) The TCA cycle has at least
two main tasks: the first, aerobic energy (ATP) production, and the
second, synthesis of carbon chain precursors to various essential
metabolites, such as amino acids. Can many of these precursors be more
easily acquired by other means in B. subtilis? B. subtilis, in marked contrast to E. coli, has four
PHX flagellin genes (flagellin [hap], flagellar hook protein [flgE], flagellar hook basal body
[fliE], and flagellin homolog [yvzB]),
whereas a single flagellin gene of E. coli is PHX
(21). Moreover, flagellar genes are strictly regulated and inducible in E. coli but constitutive in B. subtilis (40). Assuming that soil is the primary
B. subtilis habitat and that the human gut is the primary
habitat for E. coli, different metabolic patterns may be
appropriate. The swimming movements of B. subtilis
mediated by its PHX flagellar proteins may facilitate the acquisition
of nutrients, such as amino acids, from an assortment of soil sources. B. subtilis also excretes many digestive enzymes in
gathering macromolecular nutrients for possible predatory objectives
(1). (iii) There are also differences between E. coli and B. subtilis in energy pathways, which can
influence expression levels. For example, E. coli uses
succinyl-CoA as a precursor in the biosynthesis of lysine and
methionine, whereas B. subtilis uses acetyl-CoA for this
objective. E. coli possesses isocitrate lyase (AceA) in
competition with isocitrate dehydrogenase, the first enzyme of
the glyoxylate shunt pathway, which is very effective for
acquiring a net carbon gain in the metabolism of fatty acids, whereas
B. subtilis lacks AceA. The early genes in the TCA cycle of
B. subtilis, those for aconitase and isocitrate
dehydrogenase, are PHX, whereas the remaining genes are only
predicted moderately expressed. Apparently, the order of TCA genes can
be important. (iv) B. subtilis and E. coli
are both facultative aerobic organisms (29). For
anaerobic respiration, B. subtilis relies exclusively
on nitrate or nitrite as its terminal electron acceptor, whereas
E. coli has many alternative acceptors.
Highly expressed genes under varying conditions.
Can our
methods be applied in conjunction with microarray analysis? We cannot
change the codon usage of a given gene, but we can change the gene
class standards for discerning expression levels relative to these gene
classes (see Materials and Methods). Here, the gene class standards are
RP, TF, and CH. It is hypothesized that similarity of codon usages,
as characterized in Materials and Methods, for two or more natural gene
classes may identify new genes with similar properties, as in the
defining gene classes. Effectively, codon usage patterns provide a
means to correlate genes and functional categories (20).
By using several gene classes as standards, a figure corresponding to
Fig. 1 but in multiple dimensions, when coupled to a suitable
clustering analysis, may discriminate additional genes highly expressed
relative to the different gene class standards. For example, when we
compare codon usages of genes with respect to the B. subtilis sporulation genes versus the class of all genes, the two
coordinates plot a straight line. In another example, yeast
mitochondrial genes feature a melange of PHX genes, putative alien
genes, and average genes, and the genes for the ribosomal proteins
functioning in the mitochondrion tend to show codon usages akin to
average genes.
We thank G. Miklos, F. Neidhardt, A. L. Sonenshein, and A. Spormann for valuable discussions on the manuscript.
This work was supported in part by NIH grants 5R01GM10452-35 and
5R01HG00335-12.