In Silico Prediction and Functional Validation of σ28-Regulated Genes in Chlamydia and Escherichia coli

ABSTRACT σ28 RNA polymerase is an alternative RNA polymerase that has been proposed to have a role in late developmental gene regulation in Chlamydia, but only a single target gene has been identified. To discover additional σ28-dependent genes in the Chlamydia trachomatis genome, we applied bioinformatic methods using a probability weight matrix based on known σ28 promoters in other bacteria and a second matrix based on a functional analysis of the σ28 promoter. We tested 16 candidate σ28 promoters predicted with these algorithms and found that 5 were active in a chlamydial σ28 in vitro transcription assay. hctB, the known σ28-regulated gene, is only expressed late in the chlamydial developmental cycle only, and two of the newly identified σ28 target genes (tsp and tlyC_1) also have late expression profiles, providing support for σ28 as a regulator of late gene expression. One of the other novel σ28-regulated genes is dnaK, a known heat shock-responsive gene, suggesting that σ28 RNA polymerase may be involved in the response to cellular stress. Our σ28 prediction algorithm can be applied to other bacteria, and by performing a similar analysis on the Escherichia coli genome, we have predicted and functionally identified five previously unknown σ28-regulated genes in E. coli.

Genome sequencing has indicated that all Chlamydia species encode two alternative sigma factors, suggesting a role for alternative forms of RNA polymerase in chlamydial gene regulation. We have demonstrated that one of these alternative RNA polymerases, 28 RNA polymerase, transcribes hctB (24), a gene whose transcript is detectable only at late time points in the chlamydial developmental cycle (6,16). hctB encodes Hc2, one of two histone-like proteins in Chlamydia that have been shown to be responsible for the condensation of DNA during conversion of the metabolically active form of chlamydiae, known as a reticulate body, to the infectious extracellular form, the elementary body (8). To date, hctB is the only 28 -regulated gene that has been identified in Chlamydia, and it is not known whether the role of 28 RNA polymerase is confined to the regulation of late gene expression in the developmental cycle.
To identify additional 28 -regulated genes in Chlamydia, we have combined the use of bioinformatics, to predict 28 -regulated promoters in the chlamydial genome, with testing of promoter activity in a chlamydial 28 in vitro transcription assay. We used two in silico approaches, identifying candidate promoters on the basis of sequences that either resemble the consensus bacterial 28 promoter (9,10) or are predicted to be highly transcribed by 28 RNA polymerase based on functional studies (25). Using information from both approaches, we have a developed a computer algorithm to identify candidate 28 promoters in the chlamydial genome and have shown that five promoters are transcribed by chlamydial 28 RNA polymerase. This method can be applied to other bacterial genomes, and we have also identified five new 28 -regulated genes in Escherichia coli.

Development of a program for extracting sequences.
We developed a program called SequenceExtractor to extract user-defined DNA sequences from a genome. The program requires two input files, consisting of a genome sequence file and a file containing the start and stop coordinates for each gene within the DNA sequence being examined. We applied this program to extract two files in fasta format from each of the genomes of C. trachomatis serovar D, E. coli K-12, and Salmonella enterica serovar Typhimurium using sequences obtained from TIGR (http://www.tigr.org). For each organism, the first output file contained 200 bp of sequence upstream for each gene ("200 bp upstream"). The second output file was more restrictive and contained up to 200 bp of upstream sequence for each gene, provided that these sequences were in the intergenic region and not within the coding region of the nearest upstream gene ("200-bp nonoverlap").
Development of a program for predicting promoters. We also developed a program, called PromoterMatcher, that uses a probability weight matrix to predict promoters in a genome. We generated two probability weight matrices, each based on complementary information about 28 promoter structure, and applied them to an input file consisting of extracted sequences in fasta format. The frequency matrix was based on a set of 21 known bacterial 28 promoters and takes into account the frequency of occurrence of each nucleotide at each position within this promoter set. The activity matrix used functional data in the form of the relative promoter strength attributable to each nucleotide at each promoter position and was derived from a comprehensive mutational analysis of a 28 promoter (25). For each promoter position, the algorithm assigned a probability value to the four possible nucleotides, with a total probability of 1. Both matrices also contained probability-weighted models for the length of the spacer between the two promoter elements based on either nucleotide frequency or relative promoter activity. The final score for each candidate promoter was determined by summing the log of the probability at each position (which is the mathematical equivalent of multiplying the probabilities). Only the highestscoring promoter was recorded per upstream region. The predicted promoters were sorted by score, from best to worst.
Generation of the sequence logo. All sequence logos were derived using SEQLOGO, which is available online at http://ep.ebi.ac.uk/EP/SEQLOGO/. The format for data input into this site is a series of numbers representing either nucleotide frequency or relative promoter activity. The resulting sequence logo consists of stacks of letters at each position. The height of the stack indicates the importance of a particular position for promoter activity. The height of an individual letter within a stack indicates the relative preference for that nucleotide based on transcriptional activity or frequency (with a maximum height defined as 2 bits).
Cloning of transcription plasmids. Each candidate 28 -regulated promoter to be tested was cloned into a plasmid so that promoter activity could be measured with a 28 in vitro transcription assay. The promoter insert, consisting of sequence from approximately 300 bp upstream of the transcription start site to the ϩ5 position, was amplified by PCR using either C. trachomatis serovar D or E. coli K-12 genomic DNA and respective primers (see Table S1 in supplemental material). This PCR insert was cloned upstream of a synthetic G-less cassette transcription template in plasmid pMT1125 (23). Transcription from the predicted promoter by 28 RNA polymerase was expected to produce a 130-nt transcript.
In vitro transcription reactions. Transcription reactions were performed as previously described (24,25). C. trachomatis 28 RNA polymerase was reconstituted by mixing 1 l C. trachomatis recombinant His 6 -28 with 1 l heparinagarose-purified C. trachomatis RNA polymerase at 4°C for 15 min, immediately prior to the transcription reaction. E. coli 28 RNA polymerase was reconstituted from 1 l E. coli recombinant His 6 -28 and 0.03 units E. coli core enzyme (Epicenter, Madison, Wis.). For antibody inhibition reactions, 8 g of rabbit polyclonal antichlamydial 28 antibodies (24) was preincubated with the RNA polymerase for 20 min at room temperature prior to transcription.
Purification of C. trachomatis RNA polymerase from chlamydiae grown in tissue culture. C. trachomatis serovar L2 was grown in mouse L929 cells and harvested at 18 h postinfection (hpi). RNA polymerase was partially purified by heparin-agarose chromatography as previously described (21).
Purification of reticulate body RNA. L929 cells grown in suspension and infected with C. trachomatis serovar D were recovered by centrifugation and lysed by Dounce homogenization as previously described, with slight modifications (21). A second centrifugation step separated chlamydiae from host cellular debris. Chlamydial RNA was extracted using RNA STAT-60 (Teltest, Inc., Friendswood, TX).

Mapping transcription start sites by primer extension.
The primer was prepared from 100 ng of a DNA oligonucleotide that was labeled with 50 Ci of [␥-32 P]ATP in the presence of T4 polynucleotide kinase at 37°C for 1 h. Unincorporated free nucleotides were removed with a DNA mini-Quick Spin DNA column (Roche Diagnostics, Indianapolis, Ind.). Radioactive samples were counted with a scintillation counter. Fifty micrograms of reticulate body RNA and 5 ϫ 10 6 cpm labeled primer were preheated at 65°C for 10 min and chilled on ice. cDNA was synthesized by adding Superscript II reverse transcriptase (Invitrogen, Carlsbad, Calif.) and 10 mM deoxynucleoside triphosphates, followed by incubation at 42°C for 50 min. The reaction was stopped by the addition of distilled water and a 1/10 volume of 3 M sodium acetate to a total volume of 100 l, followed by phenol-chloroform extraction and chloroform extraction. cDNA was recovered by ethanol precipitation, dried, and resuspended in 9 l formamide stop solution (95% formamide, 20 mM EDTA, 0.05% bromophenol blue, 0.05% xylene cyanol). The primer extension products were electrophoresed on a 6% acrylamide-urea sequencing gel together with a single-stranded M13mp18 DNA sequence ladder and exposed to X-ray film. Each value in the matrix indicates how many of the 21 known 28 promoters in the training set contained the given nucleotide at that promoter position. A sequence logo depicting the relative nucleotide preference at each position in the promoter is shown below the matrix (25). (B) Activity matrix for the Ϫ35 and Ϫ10 promoter elements with values derived from a mutational analysis of the C. trachomatis hctB promoter as described in the text (25). At each promoter position, the values are proportional to the relative promoter activity attributable to that nucleotide for a total of 100%. The sequence logo for the promoter is shown below the matrix. Details of the sequence logo format are presented in Materials and Methods and Results. All sequence logos were derived using SEQLOGO, which is available online at http://ep.ebi.ac.uk/EP/SEQLOGO/.

RESULTS
Development of computer algorithms to identify 28 promoters. We developed two computer algorithms, which we used in parallel to identify candidate 28 promoters within a genome (Fig. 1). The first program, called SequenceExtractor, selects sequences from a genome for analysis by the second program, PromoterMatcher, which makes predictions on the basis of the 28 promoter structure and sequence. We used the structure of the extended bacterial 28 promoter (12) with eight positions in the Ϫ35 element and another eight positions in the Ϫ10 element separated by a spacer of variable length.
We focused our search on the intergenic region upstream of each predicted gene where promoters were most likely to be present. SequenceExtractor was used to select sequences up to 200 bp upstream of each gene, provided that they were not in a coding region (200-bp nonoverlap region). In Chlamydia, however, many intergenic regions are short, and promoters have been located within the upstream gene (13). Thus, we also separately examined all sequences in the region 200 bp upstream of each gene even if they were beyond the intergenic region (200-bp upstream region).
To identify candidate 28 promoters within these upstream sequences, we used the PromoterMatcher algorithm to apply a weighted matrix and assign probability scores for the 16 promoter positions and the spacer length. To increase the likelihood of identifying 28 promoters, we used two weighted matrices, each based on a different measure of the contribution of sequence to promoter activity. The first matrix, called the frequency matrix, was based on the occurrence of a given nucleotide at each promoter position within a compilation of 21 known 28 promoters, including 20 promoters from E. coli and Salmonella, and the C. trachomatis hctB promoter. For example, at the seventh position in the Ϫ10 element ( Fig. 2A), an A was present in 19 of the 21 promoters, and the remaining two promoters had a G at this position. Accordingly, an A was given a strong weighting of 19/21, while the weighting for a G was 2/21. As all known 28 promoters, with the exception of the chlamydial promoter, have a spacer of 11 nt, this spacer length was also heavily weighted.
A second weighted matrix, called the activity matrix, assigned a weighting to each of the four possible nucleotides for every position based on the promoter activity attributed to that nucleotide in a mutational analysis of the hctB promoter (25). For example, at the seventh position of the Ϫ10 element, hctB promoter activity with C. trachomatis 28 RNA polymerase was greatest when an A was present and was reduced by 2.6-, 3.4-, and 19.3-fold with a G, T, or C, respectively (25). We thus assigned probability scores for A (58/100), G (22/100), T (17/ 100), or C (3/100) that were proportional to these promoter activities (Fig. 2B). The probability weighting for the spacer length was based on the measured effect of a spacer length from 9 to 13 nt on hctB promoter activity (25).
By applying these two weighted matrices to the two sets of upstream sequences, we generated four lists of candidate 28   Table S2 (frequency matrix), and Table S3 (activity matrix) in the supplemental material. Five candidate chlamydial promoters were transcribed by 28 RNA polymerase. We chose 16 of the top candidate promoters (Table 1) for functional testing with our chlamydial 28 in vitro transcription assay. In general, these promoters were among the top-50-scoring promoters in at least two of the four prediction lists. Since the source of our core enzyme contains chlamydial 66 RNA polymerase activity (24), we tested for transcription in the absence and presence of recombinant chlamydial 28 as a measure of 66 -specific and 28 -specific activity, respectively. We also assayed for 28 -dependent activity by testing for inhibition of transcription by anti-28 antibodies.
Five of the 16 candidate promoters tested showed 28 -specific activity (Fig. 3), and an additional three promoters (yebL, yhbZ, and CT425) were weakly transcribed (data not shown). Four of the strongly transcribed promoters (tsp, dnaK, tlyC_1, and bioY) produced a transcript only when recombinant chlamydial 28 was added, as was the case with the hctB positive control promoter. Transcription of these four promoters was also abrogated by rabbit polyclonal anti-28 antibodies (Fig. 3,  lane 3). The results were less clear-cut with the pgk promoter, because although there was a large increase in transcription when 28 was added, there was baseline transcription in the absence of 28 , raising the possibility of some 66 -dependent activity. Anti-28 antibodies decreased transcription of the pgk promoter, but there was still residual transcript present. These results provide evidence that the promoters for tsp, dnaK, tlyC_1, and bioY are transcribed by 28 RNA polymerase and suggest that the pgk promoter is recognized by both 28 and 66 RNA polymerases.
For in vivo validation of these results, we used primer extension to map the transcription start sites for the three strongest promoters, hctB, tsp, and pgk, to within 6 nt of the predicted 28 Ϫ10 promoter element, at a location consistent with the predicted promoter (Fig. 4). A previously mapped transcription start site for dnaK (5) was located within 8 nt of the 28 promoter that we have predicted for this gene. pgk is regulated by two overlapping promoters. Analysis of the sequence in the pgk promoter revealed a possible 66 promoter overlapping the predicted 28 promoter. To confirm the presence of an active 28 promoter, without the confounding effect of a second promoter, we introduced substitutions predicted to disrupt the putative 66 promoter but not the 28 promoter (Fig. 5A). With this mutant promoter, there was no baseline 66 -dependent activity, and all transcription was dependent on the addition of chlamydial 28 (Fig. 5B, lanes 1 and  2). Transcription was specifically inhibited by anti-28 antibodies (Fig. 5B, lane 3). These results provide good experimental support for the predicted 28 -dependent pgk promoter and an overlapping 66 promoter.
Five predicted E. coli 28 promoters were transcriptionally active. As we also had functional data for promoter recognition  Tables S4 and S5 for E. coli and  Tables S6 and S7 for Salmonella (see the supplemental material). Many of these promoters are known 28 promoters in E. coli and Salmonella. We tested seven candidate 28 E. coli promoters that had not been previously studied ( Table 2) and found that five (modA, ynjH, yecF, yhiL, and yjcS) were functionally active in an E. coli in vitro 28 transcription assay (Fig. 6).

DISCUSSION
This study demonstrates how a combination of a bioinformatic analysis and functional validation can be used to identify previously unrecognized target genes of an alternative RNA polymerase. From a genome-wide search for sequences resembling known 28 promoters and sequences that have been shown to be highly transcribed by 28 RNA polymerase, we identified five novel 28 -regulated genes in Chlamydia and another five new 28 -regulated genes in E. coli. Although we did not test any of the predicted Salmonella 28 promoters, our list of top-scoring promoters includes three (STM3152, STM3216, and STM2314) of four newly identified 28 target genes in S. enterica serovar Typhimurium (7). These results demonstrate that our promoter prediction algorithm can successfully identify 28 promoters, and it is likely that additional 28 -regulated genes remain to be discovered in many bacterial genomes.
Our results show that the combination of a frequency matrix, derived from known 28 promoter sequences, and an activity matrix, based on a mutational analysis of a chlamydial 28dependent promoter, increased our chances of identifying additional 28 promoters. hctB and tsp had the two strongest promoters in terms of transcriptional activity and sequence conservation with the bacterial 28 consensus promoter, and both ranked equally high with the two matrices (see Tables S2 and S3 in the supplemental material). For promoters with weaker sequence conservation, such as bioY, the activity matrix may be a better predictor. For instance, bioY ranked in the top 10 using the activity matrix (Table 1) but was not in the top 50 with the frequency matrix.
In general, we found that a strict pattern-matching algorithm based only on the bacterial 28 consensus sequence would not be very sensitive as a means of identifying 28 -dependent promoters in Chlamydia. With the exception of the hctB promoter, the other chlamydial 28 promoters identified in this study are not well conserved with the bacterial consensus promoter. For example, while the dnaK promoter (TAAAGGAA-N11-AACGAAGA) contains the signature TAAA of the Ϫ35 promoter element, the Ϫ10 promoter element has only a 4/8 match with the consensus sequence. The CGA motif was the only recognizable sequence in the dnaK Ϫ10 promoter element, highlighting the importance of this motif for 28 promoter activity (25). In all, the CGA motif of the Ϫ10 element was present in five of the six transcriptionally active chlamydial 28 promoters.
In Chlamydia, two of the five newly identified 28 -regulated genes have been shown to be expressed late in the developmental cycle in a fashion similar to that of the original 28 target gene, hctB. Transcripts for hctB, tsp, and tlyC_1 were each first observed at 16 hpi by microarray expression analysis (3). These late expression profiles support a role for 28 RNA FIG. 6. In vitro transcription of predicted E. coli 28 -dependent promoters. Promoters were transcribed with E. coli 28 RNA polymerase reconstituted from E. coli core enzyme and recombinant E. coli 28 as described in the text.   (3). It is worth noting, however, that this microarray analysis measures only steady-state transcript levels and would not be able to distinguish between the temporal activity of multiple promoters, such as transcription of pgk by both 28 and 66 RNA polymerases. Thus, it is entirely possible that 28 -regulated expression of these target genes may also be restricted to late time points, and as yet, there is no definitive evidence that 28 RNA polymerase is transcriptionally active at earlier times in the chlamydial developmental cycle. In summary, there is accumulating evidence for 28 -dependent regulation of a subset of late genes in Chlamydia, distinct from the late genes transcribed by 66 RNA polymerase (6). pgk and dnaK are the first examples of genes in Chlamydia that can be transcribed by more than one form of RNA polymerase. With pgk, the promoters for 28 RNA polymerase and 66 RNA polymerase overlap and appear to have the same transcription start site ( Fig. 5 and 7A), which raises the question of how promoter occupancy by the two forms of RNA polymerase is regulated. dnaK is known to be transcribed as part of the hrcA-grpE-dnaK operon by 66 RNA polymerase (21) under the control of the HrcA repressor (23). We now show that dnaK has an independent promoter that is transcribed by 28 RNA polymerase ( Fig. 4 and 7B), and we predict that this promoter is responsive to heat shock. We base this prediction on the observation that elevated temperatures have been shown to upregulate levels of the dnaK transcript by greater than 10-fold, while hrcA and grpE mRNA levels were not similarly increased (5).
While we have identified a total of six 28 -regulated genes in Chlamydia, it is not clear whether these genes belong to a specific functional group. hctB encodes Hc2, a histone-like protein that causes DNA condensation (1,2,4,14,15). Tsp is a predicted protease with similarity to CPAF, a secreted chlamydial protease that cleaves host transcription factors involved in major histocompatibility complex class I and class II antigen expression (18,26). tlyC_1 encodes a hypothetical protein, which may be involved in hemolysis (20). Of the remaining three target genes, dnaK encodes a heat shock chaperone, pgk encodes a phosphoglycerate kinase, and bioY encodes a hypothetical protein with homology to a predicted biotin synthase in Bacillus subtilis and Treponema pallidum. Thus, unlike other bacteria, where 28 RNA polymerase regulates particular classes of genes involved in chemotaxis, motility, and flagellum synthesis, it is not clear how these 28 target genes in Chlamydia are related.
Our studies support a role for 28 as a developmental regulator of late gene expression in Chlamydia, but little is known about how 28 activity is itself regulated. Although Chlamydia encodes a predicted anti-sigma factor, RsbW, as part of a partner-switching mechanism, doubts have been raised about its ability to regulate 28 (11). The discovery that one of the target genes of 28 , dnaK, is a known heat shock gene is intriguing and may provide clues about the signal for 28dependent transcription. Perhaps 28 RNA polymerase is involved in the general stress response in Chlamydia, as supported by the finding that 28 transcript levels were increased under conditions of heat shock (19). By extension, 28 -regu-lated transcription late in the developmental cycle may be triggered in response to cellular stress, such as nutrient deprivation or other conditions within the chlamydial inclusion, although the details remain to be elucidated.
Our promoter search algorithm is versatile and can be applied to predict 28 promoters in other bacteria or promoters for other forms of RNA polymerase. 28 promoter recognition appears to be conserved among bacteria (25), and thus, our existing frequency-and activity-weighted matrices can be readily used for other prokaryotic genomes. With the appropriate probability weight matrix, the algorithm can also be used to identify promoters recognized by different forms of RNA polymerase. More generally, this same algorithm could be applied to any DNA sequence, such as a protein-binding site, as long as examples are available to build the weighted matrix. As our results have shown, however, an essential component of this bioinformatic approach is the validation of the in silico predictions with functional testing.