| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Previous Article | Next Article ![]()
Journal of Bacteriology, January 2007, p. 377-387, Vol. 189, No. 2
0021-9193/07/$08.00+0 doi:10.1128/JB.00999-06
Copyright © 2007, American Society for Microbiology. All Rights Reserved.
,
CIRAD, UMR PVBMT, Saint Pierre, La Réunion F-97410, France,1 CNRS-INRA, UMR LIPM, Castanet Tolosan F-31326, France2
Received 7 July 2006/ Accepted 16 October 2006
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
The use of comparative genome hybridization is particularly fruitful in the case of Ralstonia solanacearum to tentatively define the evolutionary scenario based on the distribution of the variable genes. R. solanacearum is a gram-negative ß-proteobacterium that is the causative agent of bacterial wilt, one of the most severe and devastating vascular plant diseases in the world. This bacterium is characterized by a high level of phenotypic and genotypic diversity. Based on nucleotide sequence analysis of four genes, four monophyletic groups of strains, termed phylotypes, have been distinguished (13). These phylotypes correlate with the geographical origin of the strains: phylotype I includes strains originating primarily from Asia, phylotype II from America, phylotype III from Africa and surrounding islands in the Indian Ocean, and phylotype IV from Indonesia (13, 34). Studies of DNA-DNA hybridization have revealed that the identity between R. solanacearum genomes is often less than the 70% threshold level commonly expected within a bacterial species (31). This high genetic variation between isolates was used to define R. solanacearum as a "species complex," a term first used by Gillings and Fahy (18). Taghavi et al. (39) then expanded the concept of the R. solanacearum species complex by including two closely related species from Indonesia, Ralstonia syzygii (a pathogen from cloves) and the agent of blood disease of banana, as both of these organisms were found to fall within the phylotype IV of R. solanacearum as defined by 16S rRNA gene sequence analysis. However, the nature of the genes responsible for such divergence is still mostly unknown, as are the molecular mechanisms which generated such diversity, which is often associated with high variation of phenotypic traits.
One characteristic of R. solanacearum that could explain this high level of diversity is its ability to naturally develop a state of competence and to exchange genetic material by horizontal gene transfer during the infection process (2, 3). The acquisition of numerous genes by horizontal gene transfer was further supported by the complete genome sequence analysis of the R. solanacearum GMI1000 strain (36). This analysis revealed the mosaic structure of both the 3.7-megabase chromosome and the 2.1-megabase megaplasmid that constitute the bacterial genome. On these two replicons, genomic regions with a biased G+C composition and alternative codon usage regions (ACURs) are dispersed within regions of standard composition. These ACURs encompass 7% of the genome and are evenly distributed over the two replicons (36).
In the present study, using strain GMI1000 as the reference strain, we establish the repertoire of genes that constitute the core genome. Based on the distribution of the variable genes among a collection of strains representative of R. solanacearum biodiversity, we also propose a tentative scenario for evolution of this species, with special attention to pathogenicity determinants.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
DNA labeling and hybridization. Genomic DNA was extracted from fresh bacterial cultures as described by Chen and Kuo (7) and labeled with either Cy3 or Cy5 fluorescent dye (Amersham, Biosciences) by using the BioPrime DNA labeling system kit (Invitrogen) according to the manufacturer's recommendations. For a 50-µl reaction mixture, 2 µg of genomic DNA in 23 µl of sterile water was heated at 95°C for 10 min, combined with 20 µl of 2.5x random primers solution, heated again at 95°C for 5 min, and chilled on ice. Remaining components were added to the following final concentrations: 0.12 mM dATP, dGTP, and dTTP; 0.06 mM dCTP; 0.02 mM Cy3- or Cy5-dCTP (Amersham Biosciences); 1 mM Tris-HCl (pH 8.0); 0.1 mM EDTA; and 40 units of Klenow fragment (Invitrogen). The solution was incubated at 37°C for 2 h before the reaction was stopped by adding EDTA (pH 8.0) to a final concentration of 45 mM. The fluorescence-labeled DNA was purified using the CyScribe GFX purification kit (Amersham Biosciences) and dissolved in 60 µl of elution buffer.
Hybridizations were carried out using a Lucidea automated slide processor (Amersham Pharmacia Biotech). Each experiment was run as a competitive hybridization by using Cy3-labeled DNA from one of the 18 strains of interest and Cy5-labeled DNA from GMI1000. No dye swapping was performed, since preliminary experiments had demonstrated that this had no significant impact on the final results. Microarrays were prehybridized for 1 h at 42°C in Dig Easy buffer (Roche) containing 385 µg ml1 of salmon sperm DNA. Hybridization was done for 15 h under the same conditions after 1 µg each of Cy3- and Cy5-labeled DNA were added. Following hybridization, microarrays were washed in 1x SSC (0.15 M NaCl plus 0.015 M sodium citrate)-0.1% sodium dodecyl sulfate for 5 min at 60°C and then in 0.1x SSC for 5 min at room temperature, dried at 37°C for 5 min, and then washed by immersion in isopropanol and dried again at 37°C for 5 min. Hybridizations were systematically duplicated.
Array scanning and analysis. Hybridized microarrays were scanned using a GenePix 4000A dual-channel (635 nm and 532 nm) confocal laser scanner (Axon Instruments) with a resolution of 10 nm per pixel. The laser power was set at 100, and the photomultiplier tension was adjusted to between 680 and 800 V according to the average intensity of the hybridization of each slide in order to optimize the dynamic range of measurements. Quantification of the signals from individual arrays was done using ImaGene 5.6.1 software and analyzed using Genesight 3.5.2 software (BioDiscovery, Inc.). Empty spots and spots with impurities, high local background fluorescence, or weak intensity compared to the signal observed for hybridization to the negative controls were excluded from analysis. For each spot, the ratio of the hybridization signal of the tested strain to that of the reference strain GMI1000 was calculated and log2 transformed, and the values thus obtained were normalized by subtracting the mean log2 ratio value calculated on a set of 1,144 conserved genes in R. solanacearum strains. These conserved genes were designed from Blastp results between the amino acid sequence of each individual gene from the reference strain GMI1000 and the genome draft of the phylogenetically distant strain IPO1609. Conserved genes were selected as having a Blastp hit covering 100% of the query with at least 90% identity. Finally the average log2 ratio of the four spots representing each gene (two slides with two spots for each gene) was calculated and used for further interpretations. Lists of the GMI1000 genes that are conserved in each tested strain were established by selecting the genes for which the average log2 ratio value thus calculated was above 2 (in other words, by excluding the genes for which the hybridization signal of the tested strain was at least four times weaker than the hybridization signal with the reference strain GMI1000). This cutoff value was chosen based on empirical optimization (see Results).
Hierarchical clustering. Hierarchical clustering was performed with the final data set consisting of three different values (0, absent; 1, present; or ?, missing data), using Genesight 3.5.2 software (BioDiscovery, Inc.). The Ward technique was used for cluster linkage and the Euclidian method for the distance metric. Phylogenetic trees were built using DARwin 4.0.290 software (32). Genetic distances were calculated based on the Sokal-Michener index: D(i,j) = u/(m + u), where m is the number of genes with the same status (present or absent) in strains i and j and u is the number of genes with different status in strains i and j. The distance matrix thus generated was used to build unweighted neighbor-joining trees, and 1,000 bootstraps were performed.
In silico genome comparisons. In the comparisons of the R. solanacearum proteome (or proteome subsets) with other proteomes, the presence of an R. solanacearum ortholog gene in a test genome was assimilated to the occurrence of best reciprocal hits with a minimum expected value of below 106 on at least 50% of both protein lengths in Blastp alignments.
Microarray data accession numbers. All primary data from microarray experiments as well as experimental protocols used are available from the ArrayExpress depository (accession numbers A-MEXP-152 and E-MEXP-851 at http://www.ebi.ac.uk/arrayexpress/).
| RESULTS |
|---|
|
|
|---|
In the first step, we determined the list of genes that are conserved in the two strains. This was based on Blastp comparison of the amino acid sequence of each gene from strain GMI1000 with the genome sequence of IPO1609. A list of 2,963 conserved genes was defined, for which the Blastp hit covered at least 80% of the length of the query sequence with at least 80% identity at the amino acid level between the two strains. We also established a list of 488 genes from GMI1000 that are absent from IPO1609, for which the corresponding best Blastp hit covered less than 2% of the query sequence.
In the second step, we established a list of the oligonucleotides designed from strain GMI1000 that share identity with strain IPO1609. This was conducted based on Blastn comparison of the sequence of each oligonucleotide with the genome draft of IPO1609. The score for the best hit of each oligonucleotide was defined as the sum of a +1 value for each base match and a 1 value for each mismatch. This identified a list of 3,463 oligonucleotides having a minimal score of 45 (ranging from 84% identity over the entire oligonucleotide length to 92% identity over 53 consecutive base pairs) that were considered to be sufficiently conserved to give a positive signal when the microarray was hybridized with genomic DNA from strain IPO1609.
In the last step, we determined the intersections between the different lists thus generated. The list of the 3,463 oligonucleotides sharing identity with IPO1609 overlaps with 2,828 genes out of the 2,963 highly conserved genes, indicating that the oligonucleotides present on the microarray are potentially suitable for the detection of over 95% of the orthologous genes. The list of conserved oligonucleotides also overlaps with two genes out of the 488 GMI1000 genes known to be absent from strain IPO1609, therefore providing an estimation of 0.4% for the frequency of oligonucleotides that could lead to potential false-positive detection of a gene. The remaining 633 oligonucleotides correspond to genes that are present in strain IPO1609 although they are more divergent.
In conclusion, with 95% representativity and 0.4% lack of specificity, the GMI1000 microarray is well suited to investigate the distribution of the GMI1000 orthologous genes in a distant R. solanacearum strain such as IPO1609.
Calibration of the CGH methodology used in this study. Investigation of the presence of a specific gene in a test strain by using hybridization on microarrays is based on the comparison of the intensity of the hybridization signal obtained with the genomic DNA of the tested strain to that of the hybridization signal obtained with the genomic DNA of the reference strain. For this analysis it is thus essential to define a relative cutoff value under which a particular gene will be classified as absent (or sufficiently divergent). Based on genome sequence comparisons, absolute lists of conserved and absent genes can easily be established and compared to experimental lists drawn from hybridization experiments using different cutoff values in order to maximize the detection of conserved genes while maintaining an acceptable number of false positives. For that purpose, we performed comparative genomic hybridization with genomic DNAs of strains GMI1000 and IPO1609. The lists of "detected" and "nondetected" genes in strain IPO1609 were established based on three different cutoff values for the log2 ratio of hybridization signals (1.5, 2, and 2.5). Each list of "detected" genes thus obtained was compared with the list of the 2,828 "conserved" genes previously identified based on in silico analysis and properly represented by an oligonucleotide. Results of these comparisons are shown in Table 2. Similarly, the proportion of false positives (genes that are not present in IPO1609 but that give a positive hybridization signal with IPO1609) was estimated by comparing the lists of the "detected" genes with the list of "absent" genes identified based on Blastp analysis. Results of these comparisons are also shown in Table 2. Together, these comparisons led to the choice of the cutoff value of 2 to be used in further experiments, since this value provides 95% detection for conserved genes while the proportion of "false-present" genes remains close to 5%.
|
|
-proteobacterium Escherichia coli K-12 (accession no. NC_000913), and in three plant pathogens representative of the major groups of plant-pathogenic gram-negative bacteria, i.e., Erwinia carotovora (accession no. NC_004547), Pseudomonas syringae (accession no. NC_004632, NC_4633, and NC_84578), and Xanthomonas campestris (accession no. NC_7086). This comparison identified a first set of 677 genes that are conserved in the eight organisms (see Table S2A in the supplemental material). The vast majority of these genes encode basic cell constituents, machineries, and metabolic pathways and therefore correspond to essential housekeeping genes. Among the remaining ones, 809 (referred as ß specific) were found only in the ß-proteobacteria (see Table S2B in the supplemental material). Close to 60% of the ß-specific genes code for uncharacterized regulators, for transporters, or for proteins of unknown functions and are therefore most likely involved in adaptation of bacteria to specific ecological niches; 202 of them are conserved in the three ß-proteobacteria tested. With regard to pathogenicity, 152 genes from the R. solanacearum core genome have an ortholog in at least one of the other plant pathogens and are absent from E. coli and R. eutropha (see Table S2C in the supplemental material). These 152 genes include a set of established pathogenicity determinants such as those encoding constituents of the type III secretion machinery, a plant cell wall-degrading enzyme, and 10 genes known to be under the control of the hrpB or hrpG pathogenicity regulon (27, 41).
Variable genome definition and analysis. A total of 2,338 genes representing 46% of the GMI1000 genome are absent (or too divergent to be detected) in at least one of the tested R. solanacearum strains (see Table S3 in the supplemental material). This corresponds to an approximation of the set of variable genes that are present in any particular strain from this species. Among these genes, 95% of the genes encoding elements of external origin (genes of class V) and 94% of the ACUR genes detected in the GMI1000 genome are represented (Fig. 2). About 30% of the genes predicted to encode proteins that fall into the functional categories I to IV were also classified in the variable genome. A large proportion (55%) of the variable genes encode hypothetical proteins (genes of class VI) (Fig. 2).
|
|
|
In fact, variations in the distribution of known or suspected pathogenicity determinants were observed for two classes of genes: those encoding several hemagglutinin-related proteins (a class of surface proteins reported to be important for adhesion to plant surfaces) (35) and those encoding type III secretion system (TTSS) effectors. For example, the distribution of several hemagglutinin-encoding genes (RSc0887, RSc3188, RSp1073, and RSp1545) appears to be variable even in phylotype I strains, which are closely related to GMI1000. Our analysis also clearly reveals the existence of either an important variation in type III effector gene content or gene sequence divergence among taxonomically close strains belonging to the same phylotype. This is illustrated in Table 3 for the five strains from phylotype III tested in this study, but it is also observed between strains grouped in the other phylotypes. Table 3 shows that at least 35 effector genes out of the 80 candidates described in the species (10, 25, 27) show a variable distribution pattern in only five taxonomically related strains originating from the same geographical area. Interestingly, only 9 out of these 35 effector genes appear to belong to ACURs, thus suggesting that a significant part of the effector set may be commonly subjected to acquisition/deletion events or may be fast-evolving genes with high intragenic sequence divergence.
|
-proteobacteria revealed that an ortholog of one of them, an AvrE/DspA family member (RSp1281), can be found in all these species. This ubiquitous TTSS effector has been shown to be critical for pathogenicity in Erwinia amylovora and Pseudomonas syringae (4, 11). Variable gene distribution correlates with R. solanacearum phylogeny. Analysis of the distribution of the GMI1000 genes among the other strains included in our experiments demonstrates that CFBP2968 is the most similar to GMI1000, with 98% of GMI1000 genes conserved (Fig. 1). CFBP2968 belongs to phylotype I as does GMI1000. Consistently, the two other test strains from phylotype I, MAFF211266 and PSS190, are also the most similar to GMI1000, with about 90% of GMI1000 genes conserved. Strains from phylotype II are found to be the most distant from GMI1000, with only about 70% of GMI1000 genes conserved. Finally, in R. syzygii and R. pickettii, 69% and 46% of GMI1000 genes are conserved, respectively. To further evaluate the relationships among all 19 strains, we performed a hierarchical clustering based on genes that were detected in each strain (Fig. 1). Surprisingly, this clustering is fully consistent with the classification into four phylotypes previously established based on nucleotide sequence analysis of the internal transcribed spacer region and of the hrpB, mutS, and eglA genes (13, 33). Four clusters are distinguished, corresponding to the four phylotypes. R. pickettii is located outside of R. solanacearum group, and, as previously reported, R. syzygii appears within phylotype IV (Fig. 1). The same data set was further used to build neighbor-joining trees and to calculate bootstraps values. The same consistency is observed between trees based upon analysis either of partial mutS gene sequences or of the presence/absence of variable genes (Fig. 5a and b).
|
| DISCUSSION |
|---|
|
|
|---|
High genomic variability within the species. We found that only 2,690 (53%) of the 5,074 GMI1000 genes spotted on the array yielded a positive signal for the 17 R. solanacearum strains examined and thus represent the core genetic content of the species. This percentage is identical to the percentage obtained for Salmonella enterica, in which the core genetic content represented 54% of the 4,169 open reading frames of the reference genome when 24 strains were examined (6). However, this proportion appears rather low compared to the 93% obtained for the opportunistic pathogen P. aeruginosa, which has a similar genome size (5,549 open reading frames) and a similar ability to thrive in a broad range of environments (42). The other GMI1000 genes (2,338 genes) represent part of the variable genes within the R. solanacearum species. This is of course an underrepresentation of the overall repertoire of potential variable genes that can be found in this species, and the number of genes in this class will increase with the sequencing of additional strains (15; our unpublished data).
It is interesting to note the existence of a bias in the distribution of core genes on the two GMI1000 replicons, with a clear overrepresentation of these genes on the chromosome. This observation, together with the strong bias for housekeeping genes among core genes, supports our previous hypothesis that the chromosome is the ancestral replicon (36).
A large proportion of variable genes is organized in genomic islands which are dispersed over the two replicons, a situation that confirms the mosaic structure of the GMI1000 genome (17, 22, 36). We could distinguish two types of genomic islands of variable genes. The first type, the most frequent, includes genomic islands that are often flanked with mobile genetic elements and that either (i) have a GC content of 55% or less with no counterparts in the core genome or (ii) are included in ACURs. These genomic islands could originate from acquisition of foreign genes through lateral gene transfers. The second type corresponds to genes that do not significantly differ from the core genes in term of base composition. These second blocks of variable genes either could be ancestral genes that were simultaneously lost by deletion from a particular phylum during evolution or could originate from acquisition by horizontal gene transfers.
Gene distribution and phylogeny. A major conclusion of the present study is the demonstration of congruence between the pattern of distribution of variable genes and the phylogenic position of strains previously established (33) based on the nucleotide sequences of four markers that the present study identifies as belonging to the core genome. This is also true for the R. syzygii strain included in this study, which is classified in phylotype IV, therefore confirming the close relationship between the two species (13).
The present data also demonstrate the congruence of the two phylogenetic trees independently constructed based on distribution of variable genes on the chromosome and the megaplasmid. In addition to the presence of essential genes on the megaplasmid and a similar average codon usage with the rest of the genome, as previously established (36), this result strongly supports our previous hypothesis of a long coevolution for these two replicons.
The same phylogeny is found when the clustering is restricted to the distribution of variable genes encoded within ACURs. This was rather surprising considering that many of these genes were probably acquired through lateral gene transfers and might be expected to destructure populations. The fact that such a destructuring is not observed is an indication that these genes must have been acquired by ancestral strains and were then transmitted vertically within phylotypes. The same situation has been observed in
-proteobacteria by assessing the history of every gene family. It has been shown that gene acquisition is a major factor contributing to genomic diversity of these bacteria, but paradoxically, once acquired, these genes are rarely transferred among lineages (23).
In contrast, the lack of congruence between the distribution of prophages/insertion sequences and phylogeny is an indication that these genetic elements are still active and that they probably still spread horizontally within populations.
Evolution of pathogenicity determinants. Results from our study indicate that a large majority of the genes encoding pathogenicity functions are part of the core genome, a status that is in agreement with their base composition and codon usage, which fit the general pattern of characteristics of the species (36). This strongly suggests that pathogenicity is an ancestral trait in R. solanacearum. Beside the basal set of core pathogenicity genes, two sets of candidate pathogenicity genes are variable from strain to strain. These correspond to genes encoding hemagglutinin-related proteins and a subclass of TTSS-dependent effectors. Both groups appear to constitute a dynamic population of genes which are predominantly either heterogeneously distributed in the species and/or subjected to a diversifying selection which results in a sufficient sequence divergence to avoid detection through microarray hybridization. Interestingly, the phylogenetic analysis based on the distribution of the TTSS effector genes revealed a remarkable degree of congruence with the rest of the genome. This distribution suggests that these genes might have two different origins: either (i) they are ancestral or ancestrally acquired pathogenicity determinants that follow the same evolution pattern as other genes or (ii) they were independently acquired in the different phylotypes during the evolution and were never exchanged between phylotypes.
Our analysis suggests that members of the filamentous hemagglutinin family of adhesins and TTSS effector pools represent prime candidates for identifying determinants controlling host specificity, since most of the phyla in the R. solanacearum species can be distinguished on the basis of that trait, varying from having relatively narrow host ranges (race 2 and 3 strains) to a wide range of hosts which can overlap several botanical families (race 1 strains). TTSS effectors identified as "avirulence" factors are known to restrict the host ranges of plant pathogens (1, 24), and it is also plausible that the collective effect of certain TTSS effector genes could lead them to overcome the plant defense reactions on a given host(s). Hemagglutinin-related proteins could also account for host specificity, since bacterial attachment to host tissues by adhesins is a first step in pathogenesis and it is conceivable that a certain degree of specificity may exist between various adhesins and different host cell surface structures. However, the identification of host specificity factors is hampered by three limitations: (i) the number of strains tested is not yet sufficient and their host specificity is not yet well enough defined to establish robust correlations between presence/absence of candidate genes and host specificity, (ii) the in silico versus microarray hybridization comparison of absent/present genes in strains GMI1000 and IPO1609 provided evidence that some oligonucleotides are not suited for detection of orthologous genes in other strains, and (iii) this approach enables detection only of genes that are present in the reference strain GMI1000. The completion of the R. solanacearum microarray with oligonucleotides representative of the novel gene sequences identified in the recently sequenced race 3 strains IPO1609 and UW551 (15; our unpublished data) will therefore improve such analyses.
| ACKNOWLEDGMENTS |
|---|
We express our thanks to Xavier Nesme, who participated in the phylogenetic analysis of our microarray data, and to Lionel Gagnevin for phylogenetic tree construction. We also acknowledge Jerome Gouzy for providing help in the preparation of the figures and Nemo Peeters and Vincent Daubin for helpful comments.
| FOOTNOTES |
|---|
Published ahead of print on 3 November 2006. ![]()
Supplemental material for this article may be found at http://jb.asm.org/. ![]()
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Appl. Environ. Microbiol. | Infect. Immun. | Eukaryot. Cell |
|---|---|---|
| Mol. Cell. Biol. | J. Virol. | Microbiol. Mol. Biol. Rev. |
| ALL ASM JOURNALS |