Previous Article | Next Article ![]()
Journal of Bacteriology, July 2005, p. 4928-4934, Vol. 187, No. 14
0021-9193/05/$08.00+0 doi:10.1128/JB.187.14.4928-4934.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegen, 6525ED Nijmegen, The Netherlands,1 Wageningen Centre for Food Sciences, Wageningen, The Netherlands,2 NIZO Food Research, Ede, The Netherlands3
Received 19 January 2005/ Accepted 17 April 2005
|
|
|---|
|
|
|---|
One class of covalently bound surface proteins is characterized by a cell wall-sorting motif called LPxTG (based on the main conserved residues). The motif is located at the C terminus of the protein, followed by a stretch of hydrophobic residues and a number of positively charged amino acids (13, 29). The hydrophobic domain and the charged tail probably keep the protein from being secreted into the medium, thereby allowing recognition of the LPxTG motif by a membrane-associated transpeptidase called sortase. Sortase cleaves the LPxTG motif between the T and G residues and covalently attaches the threonine carboxyl group to the peptidoglycan (23).
Not all proteins that have been experimentally verified to be sortase substrates contain a cell wall-sorting motif that fits the pattern LPxTG. The sortase SrtB from Staphylococcus aureus recognizes the motif NPQTN (21), and Bierne and coworkers showed that a protein with an NAKTN motif is attached to the cell wall of Listeria monocytogenes by a sortase-like enzyme (8). Recently a protein with the strongly deviating QVPTGV motif was discovered to be a sortase substrate (5).
Many bacterial genomes encode more than one sortase (28), and five distinct subfamilies can be distinguished among these transpeptidases (10). It has been suggested that it is possible to predict the specificity of a sortase for a group of substrates based on the amino acid sequence of the sortase, the cell wall-sorting signal of potential substrates, and the relative positioning of genes encoding sortases and substrates on the bacterial chromosome (10). Genome context in particular seems a strong indicator of functional relationship, as sortases and their substrates are often encoded in gene clusters on bacterial chromosomes.
In this study, a comprehensive set of putative sortase substrates was identified by in silico analysis of 199 sequenced bacterial genomes.
Since the sortase recognition sequence LPxTG itself is very short, searching only for this motif (and its variants) will lead to many incorrect predictions which, based on other characteristics of these hits such as predicted number of transmembrane helices and predicted protein function, are probably not sortase substrates. Therefore, we have applied a combination of methods, including secondary structure prediction, pattern detection, genome context, and homolog detection, to reduce the number of incorrect predictions. Some bacteria preferentially encode sortase substrates that contain target sequences deviating slightly from the canonical LPxTG motif. The predicted sortase substrates of Lactobacillus plantarum, for example, contain an LPQTxE motif instead of an LPxTG motif (17). Because of this variation, optimization of the sequence pattern used for the detection of sortase substrates for a specific bacterium increases the sensitivity and selectivity of the analysis (16). We have applied species-specific hidden Markov models (HMMs) to identify putative sortase substrates and have determined the extent and nature of the species-specific variation for the LPxTG motif. Use of the hframe algorithm allowed us to detect putative sortase substrates on the DNA level that were not detected by the other methods, for example, due to errors in open reading frame calling.
|
|
|---|
Sequence analysis. Sequence similarity was detected with BLAST (1), while multiple sequence alignments were made with T-Coffee (27). Transmembrane helices were predicted with TMHMM 2 (18), and signal peptides were predicted with SignalP 2.0 (26).
The HMMER package (12) was used to construct HMMs based on these alignments and to scan protein sequences with HMMs. Pattern recognition analysis was performed with FindPatterns (32). Conserved sequence patterns were identified with MEME and MAST (3, 4). The hframe algorithm provided by Paracel was used to scan translated nucleotide sequences with protein-based HMMs.
Identification of sortase enzymes. Two HMMs from the Pfam database (7) were used to detect sortases: the sortase A HMM (PF04203, sortase) and the sortase B HMM (PF07170, sortase_B). All protein sequences were scanned with these HMMs, and all proteins with an E-score below 1e-05 were considered putative sortases. A search of the NCBI bacterial genome database for proteins annotated as sortases did not yield any additional hits.
Identification of sortase substrates. The identification of putative sortase substrates was performed as described below and is depicted in Fig. 1 (an in-depth description of these methods can be found at http://bamics3.cmbi.kun.nl/sortase_substrates/supplementary).
![]() View larger version (37K): [in a new window] |
FIG. 1. Detecting sortase substrates. The steps shown in the dashed rectangle were carried out for each of the 154 genomes individually. Gray arrows indicate that all proteins meeting the selection criteria described in the box were taken to the next step. Black arrows indicate that the proteins had to meet additional criteria, as follows. (i) Proteins should have a transmembrane helix following the sortase recognition motif LPxTG. (ii) This helix should be followed by positively charged amino acid residues. (iii) Proteins should have three or fewer transmembrane helices in their complete precursor sequence. (iv) Proteins should not have a predicted function indicating intracellular localization.
|
The second method involved the use of MEME and MAST to predict sortase substrates. The last 60 amino acids of all proteins containing a signal peptide were used as input for a MEME motif search. From the resulting list of motifs, the pattern with the highest resemblance to the C terminus of known sortase substrates was used in a genome-wide MAST search. For each organism, no more than one pattern was found that fit the characteristics of a cell wall-sorting signal. The results of the FindPatterns-HMM and MEME-MAST methods were combined to create an improved set of predicted sortase substrates.
Additional substrates were found (i) by identifying proteins homologous to the putative sortase substrates of the improved set and (ii) by checking all proteins in gene clusters containing at least one sortase substrate or sortase enzyme. Then, on the basis of the resulting complete set, a new HMM was created which was used to rescan all protein sequences and to scan all chromosomal DNA sequences using the hframe algorithm, resulting in a final set of putative sortase substrates.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. Predicted sortase substrates in original set of 154 genomes
|
|
View this table: [in a new window] |
TABLE 3. Newly identified CDSsa
|
Inspection of putative sortase substrate sequences showed that many proteins are detected with one or more mismatches in the LPxTG-like motif. Nevertheless, these proteins all met the criteria for sortase substrates as outlined in Materials and Methods. We evaluated the sensitivity of our method by searching the literature for proteins that were experimentally verified to be attached to the bacterial cell wall in a sortase-dependent manner. All of the 24 proteins for which we found experimental verification (5, 6, 8, 9, 19, 21, 25) were present in our data set of predicted sortase substrates, including those with highly deviating LPxTG-like motifs, illustrating the high sensitivity of our methods. These substrates are listed at http://bamics3.cmbi.kun.nl/sortase_substrates/supplementary.
Newly identified sortase substrates. The first set of putative sortase substrates we found by the initial FindPatterns-HMM and MEME-MAST methods was similar to the set of putative substrates that others have identified using methods very similar to the FindPatterns-HMM method (10). However, in the same set of genomes we found 65 additional putative sortase substrates (11% more) that were not identified by their methods. Most of the additional 65 putative substrates were identified with the help of homology, genome context, and the use of the hframe algorithm. Manual inspection showed that the main reasons why these additional substrates were not detected by the FindPatterns-HMM and MEME-MAST methods were either (i) the deviation of some organism-specific sortase cleavage motifs from the generic LPxTG motif, (ii) the lack of a signal peptide (caused, for example, by the incorrect prediction of translation starts), or (iii) substrates not previously being recognized as protein-encoding genes.
Eight genomes contain at least one predicted sortase gene, while no sortase substrates were predicted by either the FindPatterns-HMM or MEME-MAST methods (Tables 1 and 2). In six of these genomes, one or two sortase substrates could be predicted by one of our other methods. One of these proteins, the single putative sortase substrate of Bradyrhizobium japonicum, had not been previously identified. The other proteins were already classified as putative sortase substrates by Interpro (22). In the two genomes without predicted sortase substrates (Methanobacterium thermoautotrophicum and Corynebacterium glutamicum), the role of the sortase-like transpeptidases remains unclear.
|
View this table: [in a new window] |
TABLE 2. Predicted sortase substrates in 45 recently sequenced genomes
|
With a bit-score threshold of 5, the LPxTG-HMM predicted 34 potential sortase substrates not identified by any of the other methods. Only four of these fulfilled the criteria of sortase substrates as described in Materials and Methods and unpublished data and hence were added to Tables 1 and 2. The other 30 proteins (5% of the total number of hits) did not meet these criteria, for example, due to the presence of too many predicted transmembrane helices. The bit score threshold of 5 was determined empirically: a higher threshold causes many proteins fitting the criteria for sortase substrates as outlined in Materials and Methods to be missed, while a lower threshold of 4 leads to the inclusion of many proteins with a C-terminal membrane helix, followed by positively charged residues, but without an LPxTG-like motif.
As mentioned earlier, application of the hframe algorithm revealed 13 additional genes encoding putative substrates (Table 3). Furthermore, the hframe algorithm identified another six sequences with all of the characteristics of sortase substrates, but for which no correct translation start could be identified without introducing a frameshift or removing an internal stop codon. In some cases, the introduction of a frameshift or the removal of a stop codon would merge a novel CDS encoding a putative sortase substrate (i.e., not previously recognized as a CDS) with a CDS already identified on the chromosome. It remains to be established whether these six additional sequences represent pseudogenes or sequencing errors.
Compared to the gram-positive anchor HMMs and suggested thresholds of the Pfam (7) and TIGRFAM (http://www.tigr.org/TIGRFAMs/) databases, LPxTG-HMM detects many more putative sortase substrates. Although the LPxTG-HMM slightly overpredicted the number of sortase substrates, the incorrectly identified substrates (i.e., proteins not fitting the criteria for sortase substrates as outlined in the methods section) were easily filtered out by application of the simple additional criteria mentioned in Materials and Methods. Furthermore, LPxTG-HMM outperformed the other methods in the detection of sortase substrates with a sortase recognition signal deviating from the consensus signal. For example, only 2 of the 17 sortase substrates of Streptomyces coelicolor are detected by the Pfam and TIGRFAM HMMs.
To determine whether or not cell wall-sorting-like signals are only present in the C termini of proteins, we scanned the complete sequences of all of the proteins taken from the NCBI bacterial genome database with the LPxTG-HMM. We identified only three proteins with a putative cell wall-sorting signal at a position other than the C terminus: two proteins with orthologs in Streptococcus pneumoniae R6 and S. pneumoniae TIGR4 and one protein with orthologs in L. monocytogenes EGD-e, L. monocytogenes 4b F2365, and Listeria innocua. The presence of orthologs in different strains indicates that these proteins are not the results of a sequencing anomaly (e.g., a frameshift caused by a sequencing error, leading to the fusion of two CDSs). All three proteins contained an N-terminal signal peptide, and the predicted function of the two pneumococcal proteins was consistent with an extracellular localization: one of the proteins was predicted to be a zinc metalloprotease, and the other was predicted to be an immunoglobulin A1 protease. The unusual position of the LPxTG motif in these sequences could be the result of a gene fusion event. The N-terminal parts of the three proteins did not have significant sequence homology to any sequence in the UniProt protein database (2).
Signal peptides.
Each protein that is destined to become attached to the peptidoglycan via the LPxTG anchor should also have an N-terminal signal peptide with consensus cleavage motif AxA
A (30, 31) for initial translocation of the protein across the cell membrane. Nevertheless, of our final list of 568 putative sortase substrates identified, 56 did not appear to have a signal peptide (as predicted by SignalP). However, upon closer inspection we were able to identify an N-terminal signal peptide for 43 of them (http://bamics3.cmbi.kun.nl/sortase_substrates/supplementary). In 25 cases, this required the selection of a different start codon than the one specified by the NCBI genome annotation; in 5 cases, this required the removal of a stop codon; and in 13 cases, it required the introduction of a frameshift. To determine whether or not such a stop codon or frameshift could be the result of a sequencing error would require access to the trace files of the sequencing projects. The gene identifiers and suggested changes to the CDSs for the 56 predicted sortase substrates without a signal peptide are shown at http://bamics3.cmbi.kun.nl/sortase_substrates/supplementary.
Species-specific anchoring motifs. Closely related organisms have similar sortase recognition consensus sequences, leading to similar HMMs. For instance, the organism-specific HMMs of B. anthracis Ames and B. cereus ATCC 10987 detect the same set of 10 putative sortase substrates in the B. anthracis genome. As expected, HMMs from less-similar organisms have less overlap; when the HMM based on the putative sortase substrates of S. coelicolor is used to scan the B. anthracis genome, only two putative substrates were recognized.
A graphic representation of the species-specific LPxTG consensus of every bacterium with two or more predicted sortase substrates can be found in our LPxTG-DB database (http://bamics3.cmbi.kun.nl/sortase_substrates). In some organisms, many putative sortase substrates have a cleavage motif that is highly conserved, but which deviates significantly from the generic LPxTG consensus and the motifs found in other organisms. Examples of such organisms and the frequency with which specific motifs are found in these organisms are shown in Fig. 2. The fact that these motifs are highly conserved suggests that these sortase substrates are species specific and also implies they have not been acquired through horizontal gene transfer or are rapidly optimized due to selective pressure.
![]() View larger version (20K): [in a new window] |
FIG. 2. Organism-specific cleavage motifs. The consensus sortase cleavage sites of L. plantarum (LPQTxE, found in 23 of 27 predicted sortase substrates), Lactobacillus johnsonii (LPQTG, found in 12 of 16 substrates), L. monocytogenes (LPxTGD, found in 33 of 42 substrates), and S. coelicolor (LAxTG, found in 15 of 17 substrates) are organism-specific variations on the generic LPxTG consensus. The overall height of each stack indicates the sequence conservation at that position (measured in bits), whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino acid at that position. The Weblogo software (11) was used to visualize the motifs.
|
Searching in new genomes. Finally, we used the LPxTG-HMM to identify putative sortase substrates in the 45 new genomes that were made public after the date on which we took our original set of genomes from GenBank. For 10 of these 45 additional genomes, all from gram-positive bacteria, we predicted a total of 164 sortase substrates (Table 2), 7 of which had not been identified as CDSs in the GenBank annotation. The other 35 genomes did not encode any putative sortase substrates or sortases. The results of this analysis can also be found in our database of sortase substrates, LPxTG-DB.
Concluding remarks. We developed an HMM which quickly and reliably recognizes the putative sortase substrates in any sequenced genome. Although the model does not incorporate explicitly all of the information available, when used together with the hframe algorithm it recovers >99% of the putative substrates detected by several other methods combined. When the combination of methods we have described in this research is used, an average of 11% additional putative sortase substrates can be identified compared to previously used methods.
Our sortase-substrate website contains information on the species-specific sortase recognition sites identified, the LPxTG-HMM, and brief instructions on its use.
This work was supported by a grant from The Netherlands Organization for Scientific Research (NWO-BMI project 050.50.206).
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»