Previous Article | Next Article ![]()
Journal of Bacteriology, April 2006, p. 2761-2773, Vol. 188, No. 8
0021-9193/06/$08.00+0 doi:10.1128/JB.188.8.2761-2773.2006
Copyright © 2006, American Society for Microbiology. All Rights Reserved.
M. Leena Priya,2,
A. Tamil Selvan,2
Martin Madera,3
Julian Gough,4
L. Aravind,1 and
K. Sankaran2*
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894,1 Centre for Biotechnology, Anna University, Chennai 600025, India,2 MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom,3 RIKEN Genomic Sciences Centre, W121 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan4
Received 10 May 2005/ Accepted 28 October 2005
|
|
|---|
|
|
|---|
Bacteria, the major class among prokaryotes, possess an interesting N-terminal lipid modification, N-acyl-S-diacylglyceryl-Cys (Fig. 1A), which is unique and ubiquitous among its known members. More than 2,000 such proteins have been identified currently. Three fatty acyl groups at the N terminus which are derived from bacterial phospholipids provide tight anchorage to the membrane surface, allowing the rest of the protein to perform relevant biochemical functions in the aqueous or aqueous-membrane interface. Since its discovery in 1969 (5) in a major outer membrane protein of Escherichia coli called Braun's lipoprotein (named after the discoverer), the same modification in different proteins was seen in a variety of bacteria. The primary structural features required for this modification and the biosynthetic pathway containing three enzymes (the first enzyme in the pathway attaches the diacylglyceryl group from phosphatidyglycerol to the thiol of Cys, the first amino acid after the signal peptide; the second enzyme cleaves off the signal peptide after the initial lipid modification; and the third enzyme acylates the N-terminal amino group with a fatty acid from any available phospholipid) have been elucidated since then (16, 26, 31, 46, 47, 59, 60).
![]() View larger version (14K): [in a new window] |
FIG. 1. (A) The structure of the lipid modification in lipoproteins. The sulfhydryl group of N-terminal cysteine is modified with a diacylglyceryl group attached through a thioether linkage, and the amino group is acylated with a fatty acid. (B) Tripartite structure of the lipoprotein signal sequence. The n-region is made up of five to seven residues and has at least two positively charged residues; the h-region, or the hydrophobic region, is made up of 7 to 22 predominantly hydrophobic and uncharged residues; and the c-region, which has the consensus [LVI][ASTVI][GAS] sequence, along with C, the invariant lipid-modified N-terminal residue in all bacterial lipoproteins, is referred to as the lipobox.
|
There is in fact a renewed interest in lipoproteins from the point of view of their roles in bacterial pathogenesis, as these lipid-modified proteins play a variety of roles in host-pathogen interactions, which necessarily take place in the solid-aqueous interface, from surface adhesion to translocation of virulence factors into the host cytoplasm. Those aiding pathogenesis include PsaA in Streptococcus pneumoniae (4); MxiM, a lipoprotein of the type III secretory pathway in Shigella flexneri important for translocation of invasins (48); MAA1 of Mycoplasma arthritidis, required for adherence to joint tissues early in the infectious process (62); and a gamut of surface lipoproteins specifically expressed by mycoplasmas upon infection (45). Recently, an lsp mutant of Listeria monocytogenes was found to be ineffective in phagosomal escape of bacteria during infection (44). Those that help to activate inflammatory response or evade host defense include lipoproteins released from Enterobacteriaceae that induce cytokine production in the macrophage (66); a 19-kDa lipoprotein of Mycobacteria that elicits antibody and T-cell responses in human and mice and induces innate immune response in dendritic cells and neutrophils (40, 56); LipL41, a surface-exposed lipoprotein of pathogenic Leptospira species (52); and LpK, a lipoprotein from Mycobacterium leprae that induces human interleukin 12 (36). Owing to the above roles in bacterial pathogenesis, lipoproteins are also attractive candidates in vaccine development. For example, Lpp20, a lipoprotein, is a vaccine candidate against Helicobacter pylori (30). In the case of Lyme disease, vaccines based on lipoproteins OspA and DbpA of spirochete Borrelia burgdorferi have been demonstrated to be effective in several animal models (7, 14, 15, 22).
One of the initial focuses of bacterial lipoprotein study was to analyze the signal peptides of experimentally verified lipoproteins and derive primary structure determinants for posttranslational lipid modification. Limited sequence analysis of precursors of only 26 distinct lipoproteins by Hayashi and Wu (23) already indicated a characteristic four-amino-acid sequence at the C-terminal end of the signal peptide including the modifiable Cys. Appropriately this was called the "lipobox," and site-directed mutagenesis in the region further helped to define the roles of individual amino acids. Later, similar analysis of 75 lipoproteins by Braun and Wu (6) revealed the lipobox consensus sequence L[AS][GA]C. With more reports of experimentally verified lipoproteins, the roles and composition of the lipobox and the signal sequence features such as a stretch of positively charged n-region and uncharged h-region became more accurately defined (Fig. 1B). Accordingly, more robust predictive rules evolved to recognize lipoproteins from the amino acid sequences, mainly deduced from genomic sequences. The first such predictive rule was adapted by the Prosite pattern (PS00013), and later a refined one with better predictive capability was used in the maiden version of DOLOP, the first dedicated website for bacterial lipoproteins (34).
In the past few years there has been intensive bioinformatic analysis of bacterial lipoproteins and comparison of different predictive algorithms (3, 13, 18, 19, 28, 34, 53, 57). Predictive rules that work better for gram-positive bacterial lipoproteins were proposed as G+LPP (53), and recently a trained set of predictive rules was used and an algorithm called LipoP (28) was proposed to predict membrane proteins, lipoproteins, and cellular proteins by looking for signal sequence features. In the last year a detailed comparative analysis of DOLOP and other algorithms was carried out on experimentally verified lipoproteins from one model taxon, E. coli K-12, and a highly fine-tuned algorithm with the best predictive ability was proposed (19). As a result of all these efforts, in the last decade, the numbers of bacterial lipoproteins would cross several thousand, thanks to reliable predictive rules, which are today applied for identifying lipoproteins.
One of the intriguing aspects in the biosynthesis of lipoprotein is its targeting to either the inner or outer membrane. Initial sequence analysis of inner and outer membrane lipoproteins suggested a targeting role for Asp or Ser at the +2 position in the mature sequence (50, 64); Asp led to inner membrane localization, whereas Ser led to outer membrane localization. A series of recent elegant studies by Tokuda and coworkers have led to the identification of outer membrane localization (LOL) machinery for lipoproteins and the effect of amino acids in the vicinity of the modifiable Cys in the mature sequence in their recognition (37, 39, 54, 58, 63, 65). Accordingly, it was realized that Asp at position 2 is not the sole inner membrane retention signal, and amino acid residues at +3 and +4 positions were found to affect the membrane localization (55). The rules for membrane localization are not as straightforward as those of lipid modification to obtain by simple sequence comparison. However, a large database with experimentally verified data on localization could help.
Each bacterium has a common as well as a unique set of lipoproteins, whose numbers vary widely, and their proteomics would be interesting as well as challenging. To aid this study, we have introduced a new feature which provides domain assignments to identified lipoproteins in the updated version of DOLOP, and this paper is meant to (i) propose the refined lipoprotein identification algorithm based on a larger data set, (ii) highlight the updated list of genome-wide prediction of lipoproteins, and (iii) introduce readers to the new feature in the domain search, as it would give a better idea about the relatedness of various lipoproteins in terms of function between themselves and with nonlipoproteins. A case study, where integration of other external information such as gene expression data with information on predicted lipoproteins leads to the identification of differentially expressed lipoproteins under quorum-sensing conditions in Pseudomonas aeruginosa, will also be discussed.
|
|
|---|
Statistical analysis of the lipoprotein signal sequence. The first 45 amino acids from each of the 278 lipoprotein sequences were aligned using the T-Coffee multiple sequence alignment tool (41) to identify the consensus sequence. Additionally, in-house PERL scripts were written to calculate the various statistics such as the amino acid charge distribution in the n-region (Fig. 1), the length of the hydrophobic region, and the amino acid choices available in the lipobox sequence.
Prediction of lipoproteins from completely sequenced bacterial genomes. The complete genome sequences of the 234 organisms listed in Table 1 were downloaded from the NCBI website. A PERL script incorporating the algorithm discussed in Results was developed to predict potential lipoproteins. The script also calculates the fraction of the genome encoding potential lipoproteins. It should be noted that the predicted list does not contain entries that have been predicted to be lipoproteins by the authors of the original study describing the genome sequence, for it can give rise to false positives. This is because the procedure used to assign function by the authors relies on sequence similarity of the mature sequence, and a protein which is lipid modified in one organism need not be modified in another organism. Thus, the predicted lipoproteins were identified purely based on the presence of the lipoprotein signal sequence as discussed above.
|
View this table: [in a new window] |
TABLE 1. Number of predicted lipoproteins from 234 predicted bacterial lipoproteins
|
|
|
|---|
With the advent of genomic study and discovery of new lipoproteins, a large-scale bioinformatics analysis to define the lipoprotein signal sequence was performed to obtain the 278 distinct clusters, where each cluster represents proteins with the same function (34). Our results corroborated the general observations made by previous investigators and also helped to define a more accurate lipoprotein signal sequence. Our studies show that the n-region contains five to seven residues with two positively charged Lys or Arg residues (Fig. 2A). The length of the h-region varies between 7 and 22 residues, with a modal value of 12 residues. The c-region has a consensus [LVI][ASTVI][GAS]C sequence. It is important to mention here that the PS00013 signature provided by Prosite (25) was one of the first available prediction algorithms to identify bacterial lipoproteins. However, the amino acid choices available at each position in the signature sequence are quite broad, thus resulting in a large number of false positives. The results of the statistical analysis of the lipobox are shown in Fig. 2B. The lipid-modifiable Cys (+1 position) is invariant. In about 70% of the cases, the 3 position is Leu (71%), followed by Val (9%) and I (6%). We also see A, F, G, C, and M in the 3 position, but at low frequencies (<5%); therefore, we do not include it in the algorithm. The 2 position is more flexible and can accommodate uncharged, polar, and nonpolar residues Ala (30%), Ser (28%), Thr (12%), Val (10%), and Ile (8%). Again, we do find G, L, and M at low frequencies in this position, but we have not included these amino acids in the predictive algorithm. The 1 position is shared equally by Gly (45%) and Ala (39%); significantly, Ser has been observed in 16% of the cases.
![]() View larger version (18K): [in a new window] |
FIG. 2. (A) Positive charge distribution in the n-region. This graph shows that most lipoproteins have at least two positively charged amino acids in their n-region. (B) Amino acid distribution in the lipobox. Leucine has the highest propensity to occur at the 3 position; alanine and serine at the 2 position; alanine, glycine, or serine at the 1 position; and the invariant cysteine that gets lipid modified. Please refer to the text for details.
|
A predictive algorithm based on these rules has been incorporated in the website http://www.mrc-lmb.cam.ac.uk/genomes/dolop/analysis.shtml to analyze a user-given query sequence and to pull out probable lipoproteins from completely or partially sequenced bacterial genomes.
Predicted lipoproteins in the completely sequenced bacterial genomes. In the past few years, the genomic data available have increased enormously, and therefore one of the major updates in DOLOP is the inclusion of a list of predicted lipoproteins from 234 genomes. Since other lipoprotein-predicting tools have also been made available in the literature, we have included a comparative analysis and provided the data in a tabular form (Table 1). There is generally a fair agreement in the number of predicted lipoproteins in a genome between the two methods, with LipoP predicting 20% more in general (it should be noted that our algorithm is more conservative in predicting the lipoprotein signal sequence in comparison to the Prosite pattern or LipoP). For genomes with more than 1,000 open reading frames (ORFs), it was interesting to note that the number of predicted lipoproteins varied enormously between the various bacteria: from as many as 223 lipoproteins for Bacteroides thetaiotaomicron VP3-5482 to as little as 8 to 9 in the case of Aquifex aeolicus VP5, Prochlorococcus marinus subsp. pastoris CCMP 1378. In the case of smaller genomes, two species of Buchnera had no predicted lipoprotein and the third had only one. In others, the number varied from 2 to 180. The plot of the proteome size against the number of predicted lipoproteins revealed a weak, linear correlation (Fig. 3). We had worked out another index of comparison, the percentage of genome coding for lipoproteins, and found that there was no correlation between the proteome size and the fraction of the proteome coding for lipoproteins. In fact, we observed that within the same proteome, the fraction of proteins encoding lipoproteins was fairly conserved. For example, Mycoplasma penetrans showed the highest ratio of 5.79%, followed by Mycoplasma pneumoniae with 5.52%. The ratio of 4.67% is high in the case of Bacteroides, especially from the point of view of its large genome size (4,500 ORFs). For many, the ratio varied typically from 1 to 3%. In E. coli CFT073 and K-12, even though the former has about 1,000 additional genes compared to the latter, there were no additional lipoproteins. Both have 86 predicted lipoproteins. In the case of E. coli O157:H7 and O157:H7 EDL933, for the same genome size there were nine additional lipoproteins. Rhodopirellula baltica is one of the bigger genomes (7,325 ORFs) but contains only 46 lipoproteins.
![]() View larger version (18K): [in a new window] |
FIG. 3. Plot of the proteome size against the number of predicted lipoproteins for the 234 completely sequenced bacterial genomes used in our analysis. Note that there is a positive correlation between the genome size and the number of lipoproteins encoded. Organisms whose predicted number of lipoproteins falls way above or below the linear trend fitted for the observed data are marked on the graph. The large number of lipoproteins seen in Bacteroides corresponds, in large part, to a lineage-specific expansion of predicted lipoproteins with an N-terminal beta-propeller domain, which may form a specialized adhesion module. In Bdellovibrio, several lipoproteins appear to belong to an expansion of peptidases.
|
In the SCOP classification scheme, proteins are split into domains as minimum functional and evolutionary units, i.e., all domains are observed either on their own or in combination with more than one different partner. The superfamily level of classification groups domains for which there is structural, sequence, and/or functional evidence for a common evolutionary ancestor. The expertly built HMMs in the SUPERFAMILY library are able to detect remote homologies, and they assign known structural domains to half of the total lipoprotein sequence.
The information provided by this analysis reveals the composition of domains, which evolution has selected for use in lipoproteins, and the architectures show how these domain units have been shuffled and recombined to form the larger, more complicated multidomain proteins.
In the example shown in Fig. 4A, we show a predicted lipoprotein represented by its domain architecture as determined above. The individual domains, which go to make up the whole protein, are each independent units, which have been combined in this particular order during evolution, and selected for, to carry out the function of the complete protein. For this particular example shown, there are ten such proteins in the database, all with the same architecture, all in the set of "predicted" lipoproteins. This particular architecture is detected in every staphylococcal genome only once, which suggests that it could be an essential protein with a specific functional role.
![]() View larger version (44K): [in a new window] |
FIG. 4. (A) Domain architecture for the protein gi 21284057 gb NP_647145.1 from Staphylococcus aureus MW2. This architecture contains two domains: a periplasmic metal-binding protein domain and a lipocalin fold metal-binding domain (in that order). In this case the assignments span the entire protein and provide a complete picture of the protein. (B) Screen shot of the output from PSATool. The program calculates molecular weight, amino acid frequency, composition, and weight composition and displays charge distribution and the nature of the sequence. This tool is available for predicted and verified lipoproteins.
|
To highlight how one can gain a better understanding about which lipoproteins are differentially expressed in bacteria during the different conditions, we performed the following calculation. (i) Using our method, we first identified the predicted list of Pseudomonas aeruginosa proteins that could potentially be lipid modified. (ii) Next, we identified up-regulated and down-regulated genes in P. aeruginosa under quorum-sensing conditions using the data set that was previously published (49). In their study, Schuster et al. obtained the set of differentially expressed genes under quorum-sensing conditions using microarrays. (iii) By integrating the above two lists of proteins, we predict that at least 10 lipoproteins are up-regulated preferentially under quorum-sensing conditions (Pseudomonas aeruginosa gene identifiers: PA1324, PA1664, PA1666, PA1745, PA1888, PA2414, PA3677, PA3692, PA4208, and PA4876). Since quorum sensing has been shown to be important for the formation of biofilms (10), and hence important during the course of infection in the case of Pseudomonas (8), studying these up-regulated lipoproteins can help us understand the process of biofilm formation much better, and it may eventually lead to a better understanding of the whole process of infection.
|
|
|---|
Features of the databasegenome-wide predicted lipoproteins are useful in proteomics. The number of current, characteristic lipoproteins has gone up from 199 in the previous version (34) to 278 in this version. Compared to the increase in the number of lipoproteins reported as well as predicted from the genome data, this increase in unique lipoproteins is not high. To make the database functionally relevant, these have been classified as in the previous version according to the information gained from the literature into antigens, adhesins, binding proteins, enzymes, transporters, toxins, surface proteins, interesting factors, and hypothetical. We performed several analyses, one of which was to refine the rule to predict which proteins can be lipid modified. Using this rule, we predicted potential lipoproteins for the 234 completely sequenced bacterial organisms, many of which are important pathogens. When we applied the current DOLOP prediction algorithm to the 81 experimentally verified lipoproteins from E. coli K-12, published by Gonnet et al. (19), 71 are predicted correctly (the number cited by the authors, however, is 51 even though 60 can be readily counted from the data provided in their table and another 11 are predicted correctly when we performed the analysis). Many of the 10 that are not predicted were due to our stringent cutoff applied at the 2 and 3 positions to reduce the false positives as defined previously. Thus, inclusion of minor amino acids like M and A in these positions obviously improved prediction to near 100%, except one in which the lipobox was more internal (51 amino acids inside). The fact that it is an experimentally verified lipoprotein and such internalized lipoboxes were found to be modified in the early investigations does suggest the relevance of increasing the length of the N-terminal sequence for query. But, for the sake of keeping the false positives low, we maintain it at 40 residues. The same analysis with a gram-positive database of experimentally verified lipoproteins reported by Juncker et al. predicted 26 out of 32, and by introducing M and A in the 3 and 2 positions, all were predicted correctly. With such refinements, the new predictive rule used in the current version of DOLOP would be able to predict at an extent seen with the other available algorithms. Though taxon-specific algorithms are obviously the best way to go after prediction, they would require structural data from many lipoproteins belonging to individual taxons, which is a farfetched proposition and beats the necessity for prediction. Therefore a reasonably accurate predictive algorithm as presented here to handle sequence data from a variety of different bacteria is a good first-level bioinformatic tool.
Our analysis shows that there are a large number of uncharacterized lipoproteins even in thoroughly studied bacterial systems. Our results on the comparison of genome size against the predicted number of lipoproteins show that there is a weak positive correlation, indicating that organisms have evolved their own set of lipoproteins to meet their needs. In the case of pathogenic variants, the number could be more or less, but their pathogenic association gives another dimension and a reason to look at them more carefully, as whatever cases have been characterized showed that they were essential for pathogenesis. As illustrated by an example in Results, using comparative proteomics in silico by integrating information about the predicted lipoproteins contained in DOLOP for an organism with other external data, such as gene expression by microarray analysis, one can come up with meaningful predictions. In this regard, the superfamily domain prediction would further aid in short-listing those activities related to the pathogenic aspect being studied.
Features of the databasedomain predictions help in functional assignments. Though lipid modification of proteins is an essential function, not much is known about individual lipoproteins in bacteria in terms of biochemical functions, and their proteome is not adequately investigated. To enhance the utility of the database in terms of functional correlation, a link to the SUPERFAMILY structural domain assignment prediction tool has been provided for each predicted lipoprotein. Information about a protein domain directly provides clues about the actual molecular function and also helps in identifying functionally important residues involved in performing the function. Thus, this feature should help at the first level in obtaining useful information for a suspected biochemical function that may account for an observed phenotype or function or for planning mutation experiments to define the roles. For researchers interested in obtaining basic properties of the predicted lipoprotein, a link to PSAtool has also been provided, which provides information like molecular weight, amino acid composition, and charge distribution for a given sequence (Fig. 4B). This feature, we believe, will help experimental biologists in designing experiments to purify proteins of interest.
Extended structure-function relationship of lipoprotein signal sequences. Previous studies involving detailed site-directed mutagenesis studies of residues in the lipoprotein signal sequence have already led to the elucidation of roles of individual regions as well as the amino acids in the modification. The positive charge at the N-terminal region was found to be important in phospholipid-signal sequence interaction, leading to a complex that is important for the recognition and transport across the inner membrane of gram-negative bacteria (61). Replacement of Gly at the 14 position (inside the h-region) in murein lipoprotein signal sequence with Asp, Glu, or Arg underlined the importance of the uncharged nature of the h-region (27). The 1 position tolerated Ala as well as Gly. Substitution by Ser slowed down lipid modification, and Thr sets the limit (42). In this context, the presence of 16% of lipoproteins in our data set with Ser at the 1 position may be relevant to the homeostasis of bacterial lipid modification in bacteria. The 2 position is the most variable among the lipobox sequences. However, inclusion of charged residues in this region has resulted in deficient lipid modification. In certain mutation studies, it has been found that the unmodified prolipoprotein has been transported and even processed by signal peptidase I, specific for nonlipoprotein signal sequences (17). In certain instances, wherein DOLOP has given false-positive results, a signal peptidase I cleavage sequence was found to lie in the vicinity of the lipobox. As pointed out earlier, the structural determinants required for inner and outer membrane targeting have not yet been fully understood and it is firmly believed that such signals come from the mature sequence in the vicinity of the cleavage site. It is also quite possible that distant primary and secondary structure elements might have a role, as the transport across the two membranes in gram-negative bacteria requires protein machinery and additional protein-protein interactions between the machinery and the lipoprotein. The large set of lipoprotein signal sequences and the genome-wide mature sequence information available in DOLOP should provide a good data set for future analysis.
We see several ways in which our results can be helpful to experimental biologists for carrying out novel research and for prioritizing their experiments. A few instances where our results can be useful include (i) identification of lipoproteins unique to a particular strain; (ii) identification of lipoproteins present in a particular group of pathogens, or organisms which colonize the same ecological niche; (iii) designing microarray experiments focusing on lipoprotein gene expression during different stages of infection; (iv) rapid identification of lipoproteins from two-dimensional gel experiments and mass spectrometric studies; and (v) identification of novel virulence factors.
In conclusion, there is still a huge untapped potential and tremendous scope for analysis and characterization of lipoproteins, and we believe that the results presented here and the database with the various features will serve as useful resources for experimental biologists to address some important questions. In addition, we also offer the possibility for researchers to submit information about newly characterized lipoproteins to our database. This feature also allows researchers to exchange information with the scientific community.
We thank the anonymous referees for helpful comments.
These authors contributed equally. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»