Previous Article | Next Article ![]()
Journal of Bacteriology, January 2009, p. 32-41, Vol. 191, No. 1
0021-9193/09/$08.00+0 doi:10.1128/JB.01084-08
Copyright © 2009, American Society for Microbiology. All Rights Reserved.

Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
Received 5 August 2008/ Accepted 1 October 2008
|
|
|---|
|
|
|---|
But how are we to discover functional novelty in the exponentially increasing amounts of sequenced genes and habitats (Fig. 1)? The naïve method, which is to search for homology to known molecules and mark everything else as novel, is prone to errors due to the existence of paralogous sequences, i.e., homologs with likely different functionalities, as well as paralogous domains within an otherwise homologous sequence that may lead to divergent function (17). To address these challenges, three major, nonexclusive concepts have been successfully used to establish functional similarity and, conversely, to identify functional novelty: (i) operons and conserved gene neighborhoods, (ii) protein domain architectures, and (iii) protein subfamilies. Operon and gene neighborhood methods assume that if multiple genes are adjacent on a chromosome or contig, they are more likely to participate in the same cellular function (19, 48). The neighborhood approach is especially suitable when homology-based methods fail to detect sequences below the threshold for similarity (30). Domain-based methods infer the functions of similar segments within otherwise different sequences and are currently utilized by curated databases of known domains (see, e.g., reference 18). This approach is useful for the analysis of multidomain proteins that evolve in a modular fashion such that each domain may have high sequence similarity to a different gene and its evolution cannot be traced by homology alone (55). Finally, subfamilies can be identified within a family of homologous sequences by abstracting the information from the family's multiple sequence alignment into a generalized statistical profile (e.g., using hidden Markov models or support vector machines) (11) and then searching for shared properties (e.g., amino acids and hydrophobicity). This technique has been successful at identifying novel biological function (2, 29, 36) and even novel species (13).
![]() View larger version (45K): [in a new window] |
FIG. 1. Trends in the increase of genomic data and represented habitats. The number of sequenced ORFs continues to increase exponentially, accompanied by an increase in the number and complexity of represented habitats. In 1995, two sequenced organisms (Haemophilus influenzae and Mycoplasma pneumoniae) contributed just a few thousand genes to the public databases and represented a single habitat (organism associated). By 2008, well over 10 million genes from over 150 distinct habitats have been sequenced. Raw habitat and sequence data were collected from the Genomes Online database (43), and habitats were classified into the categories mentioned above using the Habitat-Lite terms of Environment Ontology (32). When an organism was reported to have multiple habitats, the primary one was used. "Extreme environment" corresponds to the Environment Ontology categories "hot spring," "hydrothermal vent," and "extreme environment." "Sediment/sludge" corresponds to the Environment Ontology categories "sediment," "sludge," and "biofilm." Note that (i) dates reported for each sequence are publication dates, even if the genome was released to the public earlier in database form, and (ii) the numbers for 2008 represent the available data until June 2008.
|
In addition to problems with the size and nature of metagenomic data, computational tools must be adapted for reproducibly handling gigabytes or terabytes of data, leading to constraints in memory, central processing unit (CPU), and network bandwidth at every level of analysis. Tools must be adapted to process, filter, assemble, and align the sequence data; identify genes; annotate the genes with function; map genes or sequences to taxonomy; estimate species evenness and richness; construct phylogenetic trees; perform multivariate analyses against ecological metrics; build and validate population or metabolic models where time series data are available; and visualize the results. Even as computational biologists adapt standard tools to complete these tasks, mathematicians and statisticians must rigorously reassess the suitabilities of different methods to large data sets, identify sources of analytical and numerical errors, and revise estimates of sensitivity and specificity.
Even if novel techniques such as single-cell sequencing reduce some of above-described problems in the future, certain challenges will remain unless our entire planet has been genetically explored in sufficient depth. This is because the remaining challenges are conceptual by nature. For example, the identification of orthology is already extremely difficult with complete genomes in hand due to chromosomal inversions, gene fusions, alternative splicing, retrotranscription, and a variety of genetic processes that dilute the necessary information. This genetic uncertainty is matched by a functional one: because the term "function" remains in use with an operational rather than absolute definition (8, 9, 30), annotation processes will remain of insufficient depth for quite some time.
Despite these limitations, the benefits of "bioprospecting" for natural and naturally derived products are considerable, with potential to cure genetic and infectious diseases, arrest environmental destruction, and offset global energy shortages. Here, we hope to raise awareness of the potentials and pitfalls of using environmental sequence data to discover novelty and illustrate the promise of our methods to discover novelty in light-mediated microbial pathways functioning in sensing, repair, and adaptation.
|
|
|---|
Calculating homologs and protein abundances in metagenomes. Metagenome sequence data from five metagenomes (6,109,937 ORFs from surface seawater from the Global Ocean Survey [70] including the Sargasso Sea [64], 46,771 ORFs from northern California acidic mine drainage [63], 121,927 ORFs from deep-sea Pacific whalefall [62], 183,159 ORFs from Minnesota farm soil [62], and 135,756 ORFs from a Mexican hypersaline microbial mat [39]) were BLASTed against the entire set of 1,510,991 proteins (representing 373 sequenced organisms) in the STRING 7.0 database (67) using wu-blastall with the parameters: –a 1 –p blastp –mformat 2 –filter seg –E 1 –V 17000000 –B 17000000. From this data set, the number of hits for each of the 20 query light-mediated proteins (Table 1) were counted, discarding hits less than 60 bits, which has been previously estimated to correspond roughly to an E value of <108 (30). The abundances of genome orthologs were counted based on the size of clusters of orthologous groups and nonsupervised orthologous groups previously identified by the STRING database. Because each data set has a different total number of ORFs, protein abundances were normalized with Matlab (The Mathworks, Natick, MA) as follows. For each row (genome or metagenome) of Fig. 3, the absolute number of hits to each query protein was divided by the total number of predicted ORFs in that data set. This percentage is reported in Fig. 3B.
|
View this table: [in a new window] |
TABLE 1. Query genes used in metagenome searchesa
|
![]() View larger version (46K): [in a new window] |
FIG. 3. Abundances of light-sensing proteins in metagenomes. (A) Total number of proteins orthologous to 20 query proteins (columns) in 373 sequenced genomes (top row) and five metagenomes (remaining rows). (B) Number of proteins as a percentage of the total number of predicted proteins per environment. Rows labeled in gray are subsamples. Columns are labeled as follows (see Table 1 for details): psa, photosystem I subunits ABC; psb, photosystem II subunits ABDEHIJKLF; pet, photosynthetic electron transfer subunits A123; apc, allophycocyanin; cpc, phycocyanin; kaiAB, circadian clock regulators; bluf, blue-light flavin adenine dinucleotide-binding domain-containing proteins; slr0359/plpA, blue-light-absorbing phototropins; cph1 and cph2, red- and far-red-absorbing phytochromes; taxD1, photoreceptor for phototaxis; cry, DNA photolyase and cryptochrome families; carot, water-soluble carotenoids as intracellular UV sunscreen; scyto, scytonemin as extracellular sunscreen; taxP1, phototaxis putative regulatory element; taxY1, phototaxis CheY-like protein; taxAY1, phototaxis histidine kinase; rcaE, complementary chromatic adaptation protein. Growth and repair proteins are more abundant in high-light environments than invariable-light ones, whereas sensing and adaptation proteins are more abundant in variable-light environments than in high-light ones. In particular, photolyase DNA repair proteins are overrepresented in the high-UV environment of surface seawater compared to all other environments. BLUF domain blue-light-sensing proteins are extremely rare in both genomes and environments, although the majority are found in surface rather than deep water. The red-light sensors Cph1 and Cph2 are overrepresented in deep water rather than primarily blue surface water. RcaE chromatic adaptation proteins are overrepresented in variable-light environments, such as the deep sea and lower (darker) layers of the microbial mat.
|
Search for neighborhoods and domains. Gene neighborhoods for each of the metagenome hits to the 20 query proteins were calculated as previously described (30), counting genes as neighbors only if they were adjacent on the contig in the same transcription direction. We used cotranscribed gene neighbors (as opposed to bidirectional or convergently transcribed gene neighbors) because their existence was previously established to be most predictive of related function (38). Protein domains for metagenome hits were obtained by searching against the SMART, version 5, database (41) with default parameters.
All analyses were carried out on a dedicated 256-node supercomputing cluster with 1,320 CPU cores communicating via a Gigabit-Ethernet network, each running a 64-bit Linux operating system with 1 G of memory.
|
|
|---|
![]() View larger version (36K): [in a new window] |
FIG. 2. Overview of light-mediated processes in biology. Organisms sense visible and UV light that they use for growth, adaptation, and defense/repair. Light sensing is carried out by antenna molecules with a photoactive pigment, such as carotenoids, phycocyanin, phycoerythrin, or rhodopsins. Photosynthetic bacteria can process the light energy through a reaction center and store it as ATP via a proton gradient. Bacteria living in high-light environments must also protect against and repair UV damage. Extracellular and intracellular UV-absorbing compounds such as scytonemin and mycosporine-like amino acids act as a natural sunscreen, while photolyase enzymes reverse point mutations in UV-damaged DNA by using a photon of blue light to catalyze the repair reaction. Finally, bacteria living in variable-light environments can adapt to the changing light conditions in a number of ways, e.g., by moving to a more favorable environment via phototaxis, reconfiguring the wavelength specificity of light-sensing antennae via adaptation proteins, or providing their own light via luminescence.
|
Novel light-mediated sensing. (i) Neighborhood approach. For a candidate sensing process mediated by light, we chose proteins containing the blue-light flavin adenine dinucleotide binding (BLUF) domain (26, 45), as it is rather rare in genomes, and we expected a limited variety in operon organization. BLUF domain proteins are part of the larger family of blue-light photosensors that use flavin chromophores, which together with the phytochromes, rhodopsins, and UV receptors make up the four major classes of bacterial light-sensing proteins (14). Proteins containing a BLUF domain have been shown to function as sensors upstream of phototaxis (24), nucleotide metabolism (35), and repression of anoxygenic photosynthesis (28). The domain is extremely well conserved among the Proteobacteria and Cyanobacteria, absent from the Archaea, and absent from eukaryotes except for the protist Euglena gracilis. As expected, BLUF domain-containing proteins are relatively rare not only in the genomes (34 instances, or 0.002% of all proteins) (Fig. 3) but also in the metagenomes (73 instances, 46 of which are from surface seawater, accounting for 0.0008% of that data set) (Fig. 3).
The vast majority of BLUF-containing proteins in the metagenomes do not contain additional domains, which precludes a domain-based analysis as described above. Furthermore, the BLUF domain is short (98 amino acids) and highly conserved in sequence (70% identity of the multiple sequence alignment) so that constructing phylogenetic trees with robust statistical support is practically impossible. Thus, a tree- or subfamily-based analysis is ruled out as well. However, since BLUF domain proteins are known to function in sensing and the stress response, we surmised that either their expression or the expression of their functional partners would be inducible and thus correlated with the expression of nearby genes on the chromosome. This made it a good candidate for gene neighborhood analysis.
For the 73 environmental BLUF domain proteins, we identified 36 functionally characterizable neighborhoods (32 neighborhoods from surface seawater and 4 from deep-sea whalefall) (Fig. 4). We rediscovered the known functions of BLUF in phototaxis (two neighborhoods), nucleotide metabolism (five neighborhoods), and the repression of anoxygenic photosynthesis (five neighborhoods). Interestingly, we also discovered neighborhoods of BLUF with novel function, including luciferase synthesis (four neighborhoods), nitrate metabolism (three neighborhoods), and quorum sensing (three neighborhoods). These neighbors are promising candidates for the experimental elucidation of BLUF's cellular role.
![]() View larger version (42K): [in a new window] |
FIG. 4. BLUF operons from genomes and metagenomes. BLUF domain proteins are shown in blue (center), with none containing additional domains. Genome neighbors include genes that function in phototaxis, nucleotide metabolism, repression of anoxygenic photosynthesis, and virulence, primarily from the Alphaproteobacteria (Rhodopseudomonas and Rhodobacter), Betaproteobacteria (Ralstonia and Chromobacterium), and Gammaproteobacteria (Shewanella and Psychrobacter). Novel metagenome neighbors include genes that function in luciferase synthesis, nitrate metabolism, and quorum sensing, primarily from Rhodopseudomonas and Comamonaceae.
|
![]() View larger version (82K): [in a new window] |
FIG. 5. RcaE domain variations in sequenced bacteria and plants. Whereas the majority of bacterial proteins have the conserved domain architecture of GAF-PAS-PAC-HisKA-HATPase-REC, many additional architectures with different signal transduction domains, multiple sensing domains, and multiple receiver domains exist.
|
This sample of 650 environmental sequences contained 50 unique domain arrangements, 16 of which were novel and not seen before in any genome. All 16 novel arrangements preserve the pattern of "specific sensing domain(s)-PAS-PAC-kinase-receiver" but vary in the numbers and types of domain repeats. Two arrangements include repeats in PBPb (periplasmic solute binding) as one of the sensing domains and another has PBPp with PAS repeats without a PAC domain. Another seven arrangements include three to six repeats of PAS-PAC; three arrangements have duplicated REC domains. This is a surprising result because domain repeats are generally less common in bacteria than in eukaryotes, where they are thought to encode increased variability to compensate for longer eukaryotic generation times (4). However, the conservation of the overall domain pattern of the protein, together with the remarkable number of PAS-PAC repeats, allows us to speculate that this domain architecture provides increased substrate affinity and a tuning switch for the sensitivity of the response.
(iii) Subfamily approach. For a candidate repair process mediated by light, we focused on photolyases, an intriguing family of light-activated DNA repair enzymes that are virtually ubiquitous in bacterial species. Photolyases reverse T<>T cyclobutane dipyrimidine dimers (CPDs) formed by UV damage to DNA using a photon of light to transfer electrons from a catalytic flavin chromophore to the damaged DNA (53). While the structure and function of photolyases were being characterized, an additional family of homologs, the cryptochromes, were discovered (12, 42, 53, 56). Cryptochromes are similar to photolyases in sequence and three-dimensional structure but lack catalytic activity for DNA repair and have unclear function. To date, two kinds of photolyases (CPD-I and CPD-II) and three kinds of cryptochromes have been identified (plant cryptochromes, animal cryptochromes, and CRY-DASH proteins, which are named after the representative four genera in which they were identified, Drosophila, Arabidopsis, Synechocystis, and Homo). Thus, the photolyase-cryptochrome family in sequenced genomes is quite large, spanning the inclusive gene family COG0415 (328 proteins in 209 species) but also including COG3046 (56 genes in 53 species), COG4338 (35 genes in 33 species), and NOG16378 (22 proteins in 19 species). In the metagenomes, the photolyase-cryptochrome homologs are overrepresented in surface seawater (9,703 proteins, or 0.1% of the total) and the top two layers of the microbial mat (8 proteins, or 0.6% of the total) compared to all other environments together (84 proteins). This is consistent with the large amount of UV radiation incident on surface waters of the open ocean or the top layers of the microbial mat but does not account for the other possible functions of cryptochromes in remaining niches.
To tease apart the functional diversity of this protein family, we undertook a subfamily analysis by constructing high-quality alignments, feeding them into a hidden Markov model, and using the resulting hidden Markov model profile to refine the alignment and construct a phylogenetic tree. Although this approach is now standard practice when small gene families are analyzed, it foundered when fed with roughly 10,000 sequences, and our phylogenetic tools of choice (phyml [27 and tree-puzzle [54]) often took weeks to estimate a tree when running on dedicated supercomputing clusters, even without statistical bootstraps. We therefore added several filtering steps to our protocol. First, we removed sequences shorter than 250 amino acids as well as sequences that were >80% identical to any other sequence in the data set. This approximately halved the number of photolyase hits from 9,703 to 4,828. Next, we constructed phylogenetic trees by randomly subsampling the 4,828 sequences in batches of 1,000 sequences and compared the resulting trees for topology and grouping. Finally, we combined the genomic and metagenomic sequences, constructed trees again, and checked whether the same groupings resulted.
Because bootstraps on the photolyase trees could not be calculated for the entire data set of approximately 10,000 sequences, we report here 1,196 photolyase-cryptochrome orthologs from the sequenced genomes and four metagenomes: surface seawater from Sargasso sea samples 1 to 4, farm soil, acidic mine runoff, and deep-sea whalefall (Fig. 6). Although the bootstrap at the deeper branches is somewhat low (<25%), it is consistently high near the leaves (>80%), indicating that the relationships between the subfamilies are poorly resolved but that the clustering within subfamilies is strong. Most notably, our tree recovers the four known groups of photolyases (CPD-I, CPD-II, DASH cryptochromes, and animal cryptochromes) and additionally identifies two novel deep-branching groups of photolyases/cryptochromes. The deepest-branching "novel family I" represents a new family of 34 photolyases/cryptochromes of unknown function never seen before in the genomes. Because the tree covers photolyases from all known species from all three domains of life, the novel family must include newly detected enzymes of unknown function related to the cryptochrome superfamily. The taxonomic origins of these enzymes are a mixture of Pelagibacter/SAR11-like species and other Alphaproteobacteria (69%) and Cyanobacteria dominated by Prochlorococcus (31%). "Novel family II," which is clearly grouped between CPD-II photolyases and animal DASH cryptochromes, is an additional uncharacterized diverse subfamily with 54 sequences from Alphaproteobacteria (80%) and Cyanobacteria (20%). The species compositions are as expected, since both Alphaproteobacteria and Cyanobacteria are the dominant marine microbial species. Both newly discovered photolyase/cryptochrome families are exciting candidates for further computational and experimental characterization.
![]() View larger version (65K): [in a new window] |
FIG. 6. Photolyase/cryptochrome subfamilies representing 1,196 sequences from sequenced genomes and four metagenomes. The tree recovers the known groupings of CPD-I and CPD-II photolyases and animal and DASH cryptochromes (plant cryptochromes, the fifth known group, are not shown here). In addition, we discovered two novel subfamilies of photolyases with 38 (family I) and 54 members (family II) that appear to originate from diverse members of the Alphaproteobacteria and Cyanobacteria.
|
|
|
|---|
While these results serve as a proof of principle for the possibility to infer novel functionality by using the three different concepts described above and represent the opportunities inherent in those huge data sets, they also implicitly illustrate the challenges of mining environmental sequence data to discover novel functions. The difficulty is due to the nature of the data itself (vast amount, fragmented, uniform coverage, and shotgun sequence); the lack of appropriate methods and analysis tools together with bottlenecks in CPU, memory, and network bandwidth; and ongoing conceptual difficulties with defining homology/paralogy and novel function. Indeed, while the sequencing of environment after environment continues to generate gigabytes of data, there has been little corresponding investment in the analysis of these data, pointing to an urgent and immediate need for methods and tool development. For example, we would have been unable to derive bootstrap values for some of the phylogenetic trees had we included more environments, not to mention the enormous challenges for the CPU to compute all the data. Our previous work demonstrated that even a slightly better function assignment protocol could lead to a near doubling of the number of functional annotations for gene fragments, from 40% to 70% (30), suggesting that with improved analysis, perhaps only half the sequence data are really needed. The saved effort could be redirected at gathering time series and spatial data, which would help to interpret functional novelty and allow the development of dynamic models to explore larger concepts in ecology and evolution such as species succession, pathway evolution, or metabolic flux.
In summary, we have demonstrated the use of computational analysis techniques for discovering molecular functional novelty in environmental snapshots of bacterial communities. Our results indicate that information on gene neighborhood, protein domains, and subfamilies can be successfully used to discover functional novelty, although various challenges hamper the analysis considerably and will continue to do so as more data are generated in the future.
Published ahead of print on 10 October 2008. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»