This Article
Right arrow Full Text (PDF)
Right arrow Supplemental material
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowReprints and Permissions
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Suen, G.
Right arrow Articles by Welch, R. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Suen, G.
Right arrow Articles by Welch, R. D.

 Previous Article  |  Next Article 

Journal of Bacteriology, December 2006, p. 7999-8004, Vol. 188, No. 23
0021-9193/06/$08.00+0     doi:10.1128/JB.01195-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.

GUEST COMMENTARY

Bacterial Postgenomics: the Promise and Peril of Systems Biology{triangledown}{dagger}

Garret Suen,1 Jimmy S. Jakobsen,2,{ddagger} Barry S. Goldman,3 Mitchell Singer,4 Anthony G. Garza,1 and Roy D. Welch1*

Department of Biology, Syracuse University, Syracuse, New York 13244,1 Departments of Biochemistry and Developmental Biology, Stanford University, Stanford, California 94305,2 Monsanto Company, St. Louis, Missouri 63167,3 Section of Microbiology and Center for Genetics and Development, University of California, Davis, California 956164

A postulate of systems biology is that, within the sum total of its genomic data, a network map of all functional interactions exists for each organism (5), and this "interaction map" can be assembled from that data agglomeration through the application of integration algorithms (9, 10, 22). The current iteration of these algorithms focuses on probabilistic formulations for integrating multiple genomic data types; a recent example succeeded in recapitulating the galactose utilization pathway in Saccharomyces cerevisiae by combining no fewer than 18 different genomic datasets (12, 13).

To achieve this result, fundamentally different types of genomic data must be constrained within a unified contextual framework. Any perceived benefit is based on the assumption that, although each type of data can be used independently to predict an interaction map, a more accurate map can be derived from a unified set of predictions. The legitimacy of this assumption is difficult to gauge because its effect must be considered along with the effects of systemic bias prevalent in many genomic datasets (18) as well as the unresolved issues of reliability (19, 23) and reproducibility (3, 17) that underlie high-throughput data. For example, a recent study (4) of microarray experiments demonstrated that the reproducibility of results across laboratories is generally poor, with correlation scores as low as 0.11, even for experiments performed using the same RNA sample. Much of the publicly available microarray data are generated under less rigorous conditions, and "real world" reproducibility is almost certainly lower.

To make biological interactions amenable to applications of network theory, most systems biology presumes a formalized definition of the interaction map as a modular network (11) composed of pairwise protein-protein interactions. Each specific protein-protein interaction within this map is considered a functional interaction, and, as a result, the nuance and subtlety of each interaction is ignored; the discussion is forced into a binary language where each pair of proteins either interacts (1) or does not (0). Within this context, an interaction map becomes a clearly defined mathematical construct: it is a stable, nonredundant subset of all possible pairwise protein-protein interactions that exist within a cell. Temporarily setting aside its obvious flaws, this definition is statistically rigorous and its formalized structure facilitates the application of predictive algorithms.

What relationships exist between this definition and real functional interactions? Testing the accuracy of an interaction map is nontrivial, as no complete, correct, and verifiable interaction map exists as a positive control. The most common measure of predictive accuracy, recapitulating known pathways, tests only a small subset of the predictions. One study that tested an entire set of predictions compared two independently derived yeast two-hybridization interaction sets for S. cerevisiae, revealing an overlap of only 20%, which is alarmingly small considering that both studies analyzed the same type of data (14). Although this study was completed several years ago, no conclusive explanation for this disparity was proffered and it remains a "puzzle" within the proteomics field (24).

Although we cannot test the accuracy of all the predictions, we can test the definition of an interaction map and some of its requisite assumptions. Specifically, by dividing it into two testable hypotheses, we can test the assumption that more genomic data will improve the correctness of a predicted interaction map (8). First, as the quantity of genomic data approaches saturation, two independently predicted interaction maps will converge upon a stable set, presumably the "correct" interaction map. The second hypothesis is implicit in the first: as the quantity of genomic data approaches saturation, a predicted interaction map based on a single type of genomic data will also approach a stable set, and this set will have the highest degree of overlap with the "correct" interaction map.

To test these hypotheses, we performed a meta-analysis on four prokaryotic model organisms, Escherichia coli, Mycobacterium tuberculosis, Myxococcus xanthus, and Streptomyces coelicolor, by constructing predictive interaction maps based on the two most common forms of genomic data: microarray and genome sequence data. We observed the ability of these interaction maps to converge upon a stable set as the amount of genomic data incorporated into each map was increased.

Constructing predictive interaction maps.

For each organism studied, we constructed two independent predictive interaction maps using microarray and genome sequence data (see the supplemental material). Microarray data was used to construct gene expression interaction maps (16), which predict functional interactions based on the principle of coexpression; genes that share similar expression patterns are considered likely to be functionally linked. Using microarray data from each organism, we computed an intensity ratio matrix and applied Spearman's correlation to rank all pairwise interactions as shown in Fig. 1a and b. Each gene expression interaction map was constructed by retaining the top 50 nonredundant correlates for each gene in the genome (Fig. 1b, inset). Similarly, genome sequence data was used to construct phylogenomic interaction maps (20), which predict functional interactions based on the principle of coinheritance; genes which are consistently coinherited across evolutionary lines are considered likely to be functionally linked. Using the genome of each organism, we performed a sequence alignment of each protein against a local database of publicly available sequenced bacterial genomes using BLASTP (1). From these data, we computed a raw bit score matrix for each organism and used it to construct a phylogenomic interaction map; this method is identical to the one used for the gene expression interaction maps. The resulting gene expression and phylogenomic interaction maps are comparable, as shown in Fig. 1c.


Figure 1
View larger version (34K):
[in this window]
[in a new window]
 
FIG. 1. Schematic representation for the construction of predictive interaction maps. Microarray and genome sequence data (a) are used to generate raw data matrices which are processed using Spearman's rank correlation to generate similarity matrices (b). The top 50 closest correlates for each protein in the genome are retained, as shown in the inset table (b). This retention produces gene expression and phylogenomic interaction maps, which can be represented as a network (c). Interactions predicted by both gene expression and phylogenomic interaction maps are shown in green, interactions predicted by gene expression alone are represented in orange, and interactions predicted by phylogenomics are represented in purple. In this figure, the construction of gene expression and phylogenomic interaction maps for M. xanthus is shown.

Measuring stability in interaction maps: retention and convergence.

We simulated the approach toward data saturation (i.e., the "evolution" of each map) by repeatedly compiling each predicted interaction map with increasing amounts of data. We added additional microarray experiments to each iteration of the gene expression maps and additional genomes to each iteration of the phylogenomic maps. We employed two metrics, "retention" and "convergence," to quantify the changes within and between interaction maps. Retention is the rate at which a predictive interaction map, based on a single genomic data type, moves towards a stable set; convergence is the rate at which two or more independently derived predictive interaction maps move toward the same set, presumably the "correct" interaction map. To quantify retention and convergence, we used Jaccard's coefficient of similarity expressed as A {cap} B/A {cup} B. The Jaccard coefficient measures similarity by dividing the number of predicted interactions that exist between two sets (A {cap} B) by the number of predicted interactions that exist in either set (A {cup} B) and expressing this ratio as a number between 0 (no similarity between the sets) and 1 (sets are identical). All ratios were compared to those of randomized controls (see the supplemental material).

Retention within genomic data.

For both gene expression and phylogenomic data, we constructed four iterations of each map, incorporating 25, 50, 100, and 200 randomly selected microarray experiments or genomes, respectively. We compared subsequent iterations (25 to 50, 50 to 100, and 100 to 200) by calculating the similarity between pairs, as shown in Fig. 2. For both gene expression and phylogenomic maps, comparisons between subsequent iterations showed an increase in the Jaccard coefficient, indicating that each iteration is moving the interaction map toward a stable set. This observed retention was consistent for all organisms in this study.


Figure 2
View larger version (18K):
[in this window]
[in a new window]
 
FIG. 2. Similarity within evolving sets of predictive interaction maps. Two independent predictive interaction maps, gene expression (GE) and phylogenomics (PG), were compared across four different prokaryotes. Four iterations of each predictive interaction map were constructed by incorporating 25, 50, 100, and 200 randomly selected datasets. The similarity between each pair of doubled sets was determined by calculating the Jaccard coefficient. A similarity randomization control (average of 10 trials) for both gene expression (SR-GE) and phylogenomics (SR-PG) is also shown (for a more detailed explanation of this control, please see the supplemental material).

Convergence between genomic data.

According to the first hypothesis, gene expression and phylogenomic maps should approach the same "correct" interaction map as the amount of incorporated data approaches saturation. To test this hypothesis, we measured the convergence between iterations of gene expression and phylogenomic maps for each organism, as shown in Fig. 3. We observed an extremely small increase in convergence for all organisms but M. xanthus, which showed almost no change.


Figure 3
View larger version (15K):
[in this window]
[in a new window]
 
FIG. 3. Observed convergence between predictive interaction maps. Four iterations of phylogenomic (PG) and gene expression (GE) predictive interaction maps were constructed using 25, 50, 100, and 200 randomly selected datasets for each prokaryote in this study. Similarity comparisons were made by calculating the Jaccard coefficient between each iteration of both predictive interaction maps (PG versus GE). A similarity randomization (SR) control (average of 10 trials) is also shown (for a more detailed explanation of this control, please see the supplemental material).

Contributors to convergence.

We then tested the second hypothesis, that retention measures a predicted interaction map's movement toward the "correct" interaction map, by examining the relationship between retention and convergence. If this hypothesis is correct, then the predicted interactions that contribute to retention within a map should also contribute to an increase in convergence between two independently derived maps. We subdivided each iterated pair of interaction maps in Fig. 1 into two categories: interactions that are retained as the amount of incorporated genomic data increased and those that are not. We then calculated the convergence between gene expression and phylogenomic retained and nonretained interaction sets as shown in Fig. 4. For each organism, we observed an increase in convergence for retained interactions and a corresponding decrease in convergence between nonretained interactions. Therefore, retained interactions are the major contributors to the increased convergence observed between evolving gene expression and phylogenomic predicted interaction maps. This provides strong evidence that retention measures a predicted interaction map's movement toward a single "correct" interaction map.


Figure 4
View larger version (20K):
[in this window]
[in a new window]
 
FIG. 4. Similarity between retained and nonretained linkages between evolving predictive interaction maps. Four iterations of phylogenomic (PG) and gene expression (GE) predictive interaction maps were constructed using 25, 50, 100, and 200 randomly selected datasets for each prokaryote used in this study. For each pair of iterations, the retained (R) and nonretained (NR) links were derived and the similarities between these sets were computed by calculating the Jaccard coefficient. A similarity randomization (SR) control (average of 10 trials) for both retained and nonretained is also shown (for a more detailed explanation of this control, please see the supplemental material).

Discussion.

Interaction maps often incorporate multiple types of genomic data generated by researchers from several laboratories using different experimental conditions and protocols. Our purpose was to examine the most common types of data used in systems biology meta-analysis: genome sequence and microarray gene expression data. We incorporated all available prokaryotic genome sequences, disregarding the existing sequencing bias toward certain phylogenetic groups, and we combined microarray data from multiple sources (see Table S1 in the supplemental material) using their respective normalization methods, rather than reapplying a standard normalization algorithm to the raw data. We did not employ any probabilistic integration algorithms in the construction of interaction maps, as this does not address the relevant issue: either the genomic data being generated by the scientific community is predicting a stable set of interactions or it is not.

At the point of data saturation, a map presumably specifies a set of interactions that remains stable, even as more data are incorporated. The retention metric quantifies this type of stability. As shown in Fig. 2, for all four organisms, both gene expression and phylogenomic retention metrics show a significant increase between 25 and 200 microarrays and genomes, respectively. Given the rapid increase in both the use of microarrays and the sequencing of prokaryotic genomes, it is entirely possible that, for some model organisms, sufficient data for near saturation of gene expression and phylogenomic interaction maps could exist within a reasonable time frame.

As shown in Fig. 3, the convergence metric is always exceedingly small; at its current rate, it is unlikely to increase significantly, even as the retention metric approaches saturation. However, the results in Fig. 4 show that the small increase in convergence is based largely on the subset of retained predictions, indicating that the intersection of maps selects for stable interactions; this provides support for the use of probabilistic integration algorithms to accentuate these intersections. Nevertheless, the universally small convergence metric raises serious doubts regarding the concept of genomic data integration and, by logical extension, the very definition of a "functional interaction."

Glaringly absent in systems biology is a consensus regarding a rigorous definition for the term "functional interaction." Does a functional interaction include all of the proteins that participate in a specific response or pathway? If so, then how broadly do we define a specific response? For example, do all of the proteins that participate in a stress response share a functional interaction or only the subset that physically interacts? How about two proteins that pass signals without physically interacting? Do they share a functional interaction? Ontological frameworks such as gene ontology (2), clusters of orthologous groups (21), KEGG (15), and Pfam (7) all utilize different schema in an attempt to address this issue; however, none yet represent a standard.

A true understanding of a functional interaction requires genomic data to be translated back into the "real world," where interactions may not be pairwise or binary and where spatial and temporal variables contribute significantly. Ultimately, the data that are removed to manifest an interaction map must later be reacquired to experimentally verify its predictions, one interaction at a time. Although this complicates the interpretation of systems biology data, it does not negate the validity of the process, as the repetitive nature of molecular biology lends itself to systematic analysis. While some systems biologists argue that experimental research on single genes will soon be "road kill on the systems biology highway (25)," others now openly question whether high-throughput experiments are worth continuing due to their high probability of error (6). The future lies somewhere between these extremes: molecular and systems biologists must work together to define the interaction maps of model organisms and thereby transform the life sciences into a more integrative discipline.

ACKNOWLEDGMENTS

We thank R. Alves, B. Arshinoff, D. Kaiser, H.-J. Kim, M. Savageau, R. Raina, R. Taylor, L. Welch, and members of the Welch Lab for helpful discussions. We thank the Monsanto Company and The Institute for Genomics Research for providing access to the genome sequence of M. xanthus DK1622. We thank N. Caberoy, M. Diodati, and I. Jose for help in performing M. xanthus DK1622 microarray experiments.

This work was supported in part by grant GM54592 to M.S. from the National Institutes of Health and NSF grant MCB-0444154 to A.G.G.

G.S. and R.D.W. conceived the idea and wrote the manuscript. G.S. performed all computational experiments. R.D.W., J.S.J., and B.S.G. were involved in the sequencing and annotation of the M. xanthus DK1622 genome. R.D.W. and J.S.J. constructed the microarray for M. xanthus DK1622. A.G.G. and M.S. performed all microarray experiments for M. xanthus DK1622. All of the authors contributed to the editing of the manuscript.


arrow
FOOTNOTES
 
* Corresponding author. Mailing address: Syracuse University, Department of Biology, 130 College Place, Syracuse, NY 13244. Phone: (1) 315 443 2159. Fax: (1) 315 443 2012. E-mail: rowelch{at}syr.edu. Back

FOOTNOTES

The views expressed in this Commentary do not necessarily reflect the views of the journal or of ASM.

{dagger} Supplemental material for this article may be found at http://jb.asm.org/. Back

{ddagger} Present address: MPI für terrestrische Mikrobiologie, Karl-von-Frisch-Straße, D-35043 Marburg, Germany. {triangledown}Published ahead of print on 22 September 2006. Back

REFERENCES

    1
  1. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.[Abstract/Free Full Text]
  2. 2
  3. Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25:25-29.[CrossRef][Medline]
  4. 3
  5. Bader, G. D., and C. W. Hogue. 2002. Analyzing yeast protein-protein interaction data obtained from different sources. Nat. Biotechnol. 20:991-997.[CrossRef][Medline]
  6. 4
  7. Bammler, T., R. P. Beyer, S. Bhattacharya, G. A. Boorman, A. Boyles, B. U. Bradford, R. E. Bumgarner, P. R. Bushel, K. Chaturvedi, D. Choi, M. L. Cunningham, S. Deng, H. K. Dressman, R. D. Fannin, F. M. Farin, J. H. Freedman, R. C. Fry, A. Harper, M. C. Humble, P. Hurban, T. J. Kavanagh, W. K. Kaufmann, K. F. Kerr, L. Jing, J. A. Lapidus, M. R. Lasarev, J. Li, Y. J. Li, E. K. Lobenhofer, X. Lu, R. L. Malek, S. Milton, S. R. Nagalla, P. O'Malley, J., V. S. Palmer, P. Pattee, R. S. Paules, C. M. Perou, K. Phillips, L. X. Qin, Y. Qiu, S. D. Quigley, M. Rodland, I. Rusyn, L. D. Samson, D. A. Schwartz, Y. Shi, J. L. Shin, S. O. Sieber, S. Slifer, M. C. Speer, P. S. Spencer, D. I. Sproles, J. A. Swenberg, W. A. Suk, R. C. Sullivan, R. Tian, R. W. Tennant, S. A. Todd, C. J. Tucker, B. Van Houten, B. K. Weis, S. Xuan, and H. Zarbl. 2005. Standardizing global gene expression analysis between laboratories and across platforms. Nat. Methods 2:351-356.[CrossRef][Medline]
  8. 5
  9. Cusick, M. E., N. Klitgord, M. Vidal, and D. E. Hill. 2005. Interactome: gateway into systems biology. Hum. Mol. Genet. 14(Spec. No. 2):R171-R181.[Abstract/Free Full Text]
  10. 6
  11. Fields, S. 2005. High-throughput two-hybrid analysis. The promise and the peril. FEBS J. 272:5391-5399.[CrossRef][Medline]
  12. 7
  13. Finn, R. D., J. Mistry, B. Schuster-Bockler, S. Griffiths-Jones, V. Hollich, T. Lassmann, S. Moxon, M. Marshall, A. Khanna, R. Durbin, S. R. Eddy, E. L. Sonnhammer, and A. Bateman. 2006. Pfam: clans, web tools and services. Nucleic Acids Res. 34:D247-D251.[Abstract/Free Full Text]
  14. 8
  15. Fraser, A. G., and E. M. Marcotte. 2004. A probabilistic view of gene function. Nat. Genet. 36:559-564.[CrossRef][Medline]
  16. 9
  17. Ge, H., A. J. Walhout, and M. Vidal. 2003. Integrating "omic" information: a bridge between genomics and systems biology. Trends Genet. 19:551-560.[CrossRef][Medline]
  18. 10
  19. Gerstein, M., N. Lan, and R. Jansen. 2002. Proteomics. Integrating interactomes. Science 295:284-287.[Abstract/Free Full Text]
  20. 11
  21. Hartwell, L. H., J. J. Hopfield, S. Leibler, and A. W. Murray. 1999. From molecular to modular cell biology. Nature 402:C47-C52.[CrossRef][Medline]
  22. 12
  23. Hwang, D., A. G. Rust, S. Ramsey, J. J. Smith, D. M. Leslie, A. D. Weston, P. de Atauri, J. D. Aitchison, L. Hood, A. F. Siegel, and H. Bolouri. 2005. A data integration methodology for systems biology. Proc. Natl. Acad. Sci. USA 102:17296-17301.[Abstract/Free Full Text]
  24. 13
  25. Hwang, D., J. J. Smith, D. M. Leslie, A. D. Weston, A. G. Rust, S. Ramsey, P. de Atauri, A. F. Siegel, H. Bolouri, J. D. Aitchison, and L. Hood. 2005. A data integration methodology for systems biology: experimental verification. Proc. Natl. Acad. Sci. USA 102:17302-17307.[Abstract/Free Full Text]
  26. 14
  27. Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98:4569-4574.[Abstract/Free Full Text]
  28. 15
  29. Kanehisa, M., S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32:D277-D280.[Abstract/Free Full Text]
  30. 16
  31. Kim, S. K., J. Lund, M. Kiraly, K. Duke, M. Jiang, J. M. Stuart, A. Eizinger, B. N. Wylie, and G. S. Davidson. 2001. A gene expression map for Caenorhabditis elegans. Science 293:2087-2092.[Abstract/Free Full Text]
  32. 17
  33. Larkin, J. E., B. C. Frank, H. Gavras, R. Sultana, and J. Quackenbush. 2005. Independence and reproducibility across microarray platforms. Nat. Methods 2:337-344.[CrossRef][Medline]
  34. 18
  35. Mrowka, R., A. Patzak, and H. Herzel. 2001. Is there a bias in proteome research? Genome Res. 11:1971-1973.[Abstract/Free Full Text]
  36. 19
  37. Sprinzak, E., S. Sattath, and H. Margalit. 2003. How reliable are experimental protein-protein interaction data? J. Mol. Biol. 327:919-923.[CrossRef][Medline]
  38. 20
  39. Srinivasan, B. S., N. B. Caberoy, G. Suen, R. G. Taylor, R. Shah, F. Tengra, B. S. Goldman, A. G. Garza, and R. D. Welch. 2005. Functional genome annotation through phylogenomic mapping. Nat. Biotechnol. 23:691-698.[CrossRef][Medline]
  40. 21
  41. Tatusov, R. L., E. V. Koonin, and D. J. Lipman. 1997. A genomic perspective on protein families. Science 278:631-637.[Abstract/Free Full Text]
  42. 22
  43. Vidal, M. 2001. A biological atlas of functional maps. Cell 104:333-339.[CrossRef][Medline]
  44. 23
  45. von Mering, C., R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. 2002. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417:399-403.[Medline]
  46. 24
  47. Werner-Washburne, M., B. Wylie, K. Boyack, E. Fuge, J. Galbraith, J. Weber, and G. Davidson. 2002. Comparative analysis of multiple genome-scale data sets. Genome Res. 12:1564-1573.[Abstract/Free Full Text]
  48. 25
  49. Werner, E. 2005. Meeting report: the future and limits of systems biology. Sci. STKE 2005:pe16.[Abstract/Free Full Text]


Journal of Bacteriology, December 2006, p. 7999-8004, Vol. 188, No. 23
0021-9193/06/$08.00+0     doi:10.1128/JB.01195-06
Copyright © 2006, American Society for Microbiology. All Rights Reserved.





This Article
Right arrow Full Text (PDF)
Right arrow Supplemental material
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrowReprints and Permissions
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Suen, G.
Right arrow Articles by Welch, R. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Suen, G.
Right arrow Articles by Welch, R. D.