**DOI:**10.1128/JB.185.9.2692-2699.2003

The recent flood of genomic (13), transcriptomic (29), and other high-throughput data (21, 33, 45) makes the need to interpret this information in a systemic fashion increasingly pressing. The construction of in silico models represents a way to interpret these data and place them in the context of cellular physiology. A variety of in silico modeling approaches in biology have been developed, including detailed kinetic models, cybernetic models, stochastic models, metabolic control analysis, biochemical systems theory, and constraint-based methods. Many of these methodologies have been reviewed elsewhere (5, 20, 25, 40, 46, 52, 62).

Modern modeling approaches in biology need to be easily scalable and able to integrate available “-omics” data (38) that may contain tens of thousands of measurements. A constraint-based modeling approach (5, 14, 62) meets these criteria and at present is the only methodology by which genome-scale models have been constructed. The few parameters used in a constraint-based framework enable models to be built quickly and to encompass a larger portion of biochemical reaction networks than the portion currently encompassed by other modeling methodologies. To date, constraint-based models account for the largest metabolic models in terms of numbers of genes and reactions and have proven to be predictive of some types of data, including phenomic data (15, 26, 63), qualitative transcriptomic data (9), and gene knockout data (16, 54).

*Escherichia coli* is a well-studied organism, and much is known about its metabolism, regulation, and physiology. Constraint-based models of *E. coli* have been under development for the past 13 years. The continual growth in the size and scope of constraint-based *E. coli* models, as shown in Fig. 1, illustrates the iterative nature of in silico model building and how such models expand in scope and completeness over time. While many modeling approaches have been used to study *E. coli* (2, 12, 57, 65), in this minireview we focus on the development of successive constraint-based models that have been formulated to describe the metabolic network of *E. coli* and summarize the models' abilities to predict or explain phenotypic behavior. The principles that have been developed and the experiences that have been gained from modeling *E. coli* can be directly applied to modeling other organisms; this process has begun for *Haemophilus influenzae* (17), *Helicobacter pylori* (48), *Saccharomyces cerevisiae* (8, 23), and *Methylobacterium extorquens* (58), and more models are expected to emerge soon. The scope of constraint-based in silico models should continue to grow, and these models are likely to have a variety of uses in the near future.

## PRINCIPLES OF CONSTRAINT-BASED MODELING

Reviews that describe the constraint-based modeling procedure have appeared previously (5, 14, 19, 62). While kinetic models may ultimately provide a detailed understanding of integrated cellular functions, they are limited by the current availability of the information needed to construct them and by the fact that kinetic constants can vary across a population and change over time through evolution. The constraint-based modeling procedure does not strive to find a single solution but rather finds a collection of all allowable solutions to the governing equations that can be defined. Solutions that violate any of the imposed constraints are excluded from the collection, which mathematically is called a solution space. The subsequent application of additional constraints further reduces the solution space and, consequently, reduces the number of allowable solutions that a cell can utilize. The constraints that have been used in the first generation of constraint-based models include stoichiometric constraints (mass balance), thermodynamic constraints (regarding the reversibility of a reaction), and enzymatic capacity constraints (using an appropriate *V*_{max} value).

Stoichiometric constraints can be represented by the matrix equation ** Sv** = 0, where

**is the stoichiometric matrix describing all the reactions in the network and**

*S**v*is a vector describing the fluxes through each of the reactions. Each column of

**corresponds to an individual reaction, and the rows of**

*S***correspond to the different metabolites. The stoichiometric coefficients of a reaction are then represented as elements in column (i.e.,**

*S**S*corresponds to the stoichiometric coefficient of the

_{ij}*i*th metabolite in the

*j*th reaction). The equation

**= 0 imposes the restriction that the total rate of production for any metabolite must equal the total rate of consumption for that metabolite. In addition to stoichiometric or mass balance constraints, thermodynamic constraints and enzyme capacity constraints place limits on the range of values for individual fluxes (**

*Sv**v*) in the network. Enzyme capacity constraints place an upper limit on the values that a given flux can take. Application of thermodynamic constraints further restricts the range of flux values. If a reaction is irreversible, the corresponding flux must be greater than or equal to zero; however, reversible reactions can have positive or negative flux values. A more systematic representation of thermodynamic constraints has appeared (3, 44). Stoichiometric, enzyme capacity, and thermodynamic constraints represent hard inviolable physicochemical constraints that cells must abide by.

_{j}Given the governing constraints, the next step involves characterizing the allowable solution space and predicting which solution a cell is likely to use. Different techniques exist under the constraint-based modeling framework, including extreme pathway analysis (50), elementary mode analysis (52, 53), flux balance analysis (FBA) (5, 14, 19, 62), and minimization of metabolic adjustments (54), which aid in this process. Figure 2 illustrates the different constraint-based modeling techniques used to characterize the solution space defined by the network and the applied constraints.

Characterizing the solution space.Extreme pathways and elementary modes represent sets of vectors that describe the solution space and are themselves biochemically valid flux distributions through a defined metabolic network. Calculation of elementary modes and extreme pathways has been described elsewhere (50, 53). Elementary modes are unique vectors that characterize the solution space. An elementary mode is defined as a “minimal set of enzymes that could operate at steady-state with all irreversible reactions proceeding in the appropriate direction” (52). Extreme pathways are related to elementary modes and correspond directly to the edges of the convex solution space. Both approaches have been applied to core metabolic networks of *E. coli* whose sizes are listed in Table 1. Positive linear combinations of these vectors can be used to generate any valid steady-state flux solution under the governing constraints. These analysis methods are useful for characterizing the solution space, and the next step is to try to determine what solution in the solution space the cell actually chooses to use.

Calculating optimal phenotypes.Linear optimization has been used to find a particular solution within the allowable solution space that maximizes or minimizes a particular objective function. The objective function can be used to explore the content of a solution space, to determine capabilities (such as optimal metabolic by-product yields), or to guess at likely phenotypic functions. This approach is referred to as FBA and has been reviewed elsewhere (5, 14, 19, 62). Some commonly used objective functions include production of ATP, production of a desired by-product, or growth rate (as defined by the weighted consumption of metabolites needed to make biomass). The size of the metabolic networks used to model *E. coli* metabolism by FBA has expanded over time from 14 to 929 metabolic reactions, as shown in Table 2.

Construction of constraint-based models is iterative in nature, and while correct predictions are desirable, the incorrect predictions are the ones that lead to an iterative refinement of the models (37). Over the past 13 years the constraint-based modeling approach has been used to study, explain, and predict the phenotypic behavior of *E. coli*. The individual models can be characterized by the type of analysis performed and by the time frame in which they were developed.

## ELEMENTARY MODE AND EXTREME PATHWAY ANALYSES

A representation of *E. coli*'s central metabolic network was constructed and analyzed by using elementary mode analysis (31). For this study modes that produce 3-deoxy-d-arabinoheptulosonate 7-phosphate, a precursor of the aromatic amino acids, and/or ATP were calculated with xylose or glucose as the carbon source. When glucose was the carbon source, all possible steady-state flux distributions (i.e., solutions which obey stoichiometric and thermodynamic constraints) could be written as the sum of 11 calculated elementary modes. Examination of the properties of the calculated elementary modes, such as 3-deoxy-d-arabinoheptulosonate 7-phosphate yields, provided insight into experimental results (31).

The extreme pathways for a larger network containing 78 reactions and 53 metabolites were calculated for two different carbon sources, glucose and succinate. The extreme pathway vectors calculated for the conditions when the biomass precursors were included in the network as a growth reaction were correlated with results from FBA when growth was used as the objective function (49). More recently, the elementary modes in a larger representation of *E. coli*'s metabolic network (containing 110 reactions and 89 metabolites) were calculated and analyzed (56). In this study five different carbon sources were used, and again a reaction representing the drain of metabolic precursors needed for growth was added to the network. Elementary mode analysis was used to determine which of 90 genes were essential for growth; a reported 90% of the predictions agreed with experimental data when phenotypes were classified as either growth or no growth. The number of elementary modes varied depending on the carbon source and ranged from 598 when acetate was the carbon source to 27,099 for glucose. The modes were further analyzed to identify which enzymes would most likely be regulated for changing growth conditions (i.e., different carbon sources). A good correlation between regulatory predictions and measured mRNA expression data was found (56).

The metabolic networks studied by elementary mode or extreme pathway analysis are smaller than the networks studied by FBA. This limitation is due to the computational complexity associated with calculating these vectors and not from limitations on known reaction stoichiometry (28).

Direct determination of optimality properties can be accomplished by using optimization procedures that circumvent the exhaustive enumeration of extreme pathway analysis or elementary mode analysis. Linear programming has been used extensively to determine the optimality properties of reconstructed *E. coli* networks.

## PREGENOME ERA MODELS

Initial model of core metabolism.In 1990, Majewski and Domach formulated an FBA model that included 14 metabolic reactions (tricarboxylic acid [TCA] cycle, partial glycolysis, and respiratory chain) and four load fluxes that served as drains for the metabolic precursors needed for cell growth (oxaloacetate, α-ketoglutarate [αKG], pyruvate, and acetyl coenzyme A) (32). In their analysis production of high-energy phosphate bonds on ATP and GTP was used as the objective function.

Majewski and Domach studied the optimal behavior of the metabolic network under two different types of constraints, enzymatic capacity constraints and electron transport chain constraints. Both types of constraints placed an upper limit on the value for fluxes through different reactions in the network. By maximizing the utilization of the network for the production of high-energy phosphate bonds under either an enzymatic constraint (αKG dehydrogenase) or an electron transport chain capacity constraint, equations describing the onset of acetate overflow and the rate of acetate production could be derived. The experimentally determined secretion patterns for acetate overflow in *E. coli* agreed with the network operating under enzymatic capacity constraints (32). The model led to the conclusion that the electron transport chain capacity is a constraint only when the growth rate approaches the maximum achievable growth rate (32).

Model built on Neidhardt's compendium.Varma and Palsson's reconstruction of *E. coli*'s metabolic network contained both anabolic and catabolic reactions based on previously published information (34, 35). The model contained 53 catabolic reactions and 94 biosynthetic reactions that produce the amino acids, nucleic acids, and cell membrane and cell wall constituents found in cell biomass (59, 60). Several network properties were calculated from this model, such as optimal production of cofactors (ATP, NADPH, NADH) (60), optimal production of metabolic precursors (such as pyruvate or succinate) (60), maximal theoretical yields for amino acid and nucleic acid production (59), and evaluation of constraints (energy, redox, or stoichiometric) that restrict production of metabolites or cofactors (59, 60), as well as optimal flux distributions for biomass production (61). Model predictions agreed with experimental results when *E. coli* was grown on glucose minimal medium under aerobic and anaerobic conditions and if the metabolic network was optimized for the production of biomass constituents (63). Sensitivity analysis of the predicted growth rate with respect to the biomass composition was conducted, and it was found that the sensitivity of the biomass yield to changes in metabolite requirements was low (61).

Effects of growth rate-dependent biomass composition.Pramanik and Keasling's metabolic network was an expansion of Varma and Palsson's model and consisted of 300 reactions and 289 metabolites (an additional 17 reactions and 16 metabolites were added later) (42, 43). The biomass composition of *E. coli* varies with the carbon source and the growth rate (36). Varma and Palsson's model had used a fixed biomass composition based on previously published data, while Pramanik and Keasling's model used derived equations relating growth rate to biomass requirements, allowing for changes in biomass composition in a growth rate-dependent manner (43).

To analyze the accuracy of the model, 13 experimentally measured flux values (64) were compared to fluxes calculated by using Pramanik and Keasling's model (43). Three experimental conditions were examined: anaerobic growth on glucose, aerobic growth on acetate, and aerobic growth on acetate plus glucose. For aerobic growth on acetate plus glucose the average difference between experimental flux measurements and model predictions was 16%. Similar results were found for aerobic growth on acetate (17%). No experimental flux values were available for comparison for anaerobic glucose, but a branched TCA cycle (with an oxidative branch producing αKG and a reductive branch producing succinyl coenzyme A) had been observed experimentally; the model's prediction agreed with this observation. Further analysis of the model also indicated that the predicted flux values were sensitive to biomass composition (42, 43). Other studies have been conducted by using slight modifications of Pramanik and Keasling's model to investigate the effects of large-scale gene deletions or additions (6) and to calculate a minimal gene set (7).

## POSTGENOME ERA MODELS

Finally, a genome-scale model.The complete genome of *E. coli* K-12 strain MG1655 was sequenced and annotated in 1997 (4). Genomic data allow systematic assessment of the capabilities of the organism; in addition, genes encoding metabolic enzymes that have not yet been biochemically characterized can be included in a model based on sequence similarity. For less-studied organisms the genome plays a more significant role in network reconstruction, and many of the enzymes are assigned based on sequence homology and await biochemical characterization. In the case of *E. coli*, the genomic information, coupled with physiological and biochemical data (10), enabled reconstruction of a genome-scale metabolic model describing the *E. coli* metabolic network (16).

Edwards and Palsson's model included 720 reactions and 436 metabolites involved in glycolysis, the TCA cycle, the pentose phosphate pathway, respiration, anaplerotic reactions, fermentative reactions, amino acid biosynthesis and degradation, nucleotide biosynthesis and interconversions, fatty acid biosynthesis and degradation, phospholipid biosynthesis, cofactor biosynthesis, and metabolite transport. This model was validated with mutant data (16), was used to design quantitative experiments (15), and was found to predict the outcome of adaptive evolution (26).

The results of in silico gene deletion studies were compared with growth data obtained with known mutants. The in vivo growth characteristics of a series of *E. coli* mutants on several different carbon sources were examined and compared to the in silico deletion results. In this analysis, 68 of 79 (86%) of the in silico predictions were consistent with the experimental observations (16). The predictions of the in silico *E. coli* model were highly consistent with phenotypes of known mutants and knockouts.

Phenotypic phase plane analysis was developed (18) and applied to Edwards and Palsson's *E. coli* model. Quantitative predictions regarding optimal usage of a carbon source and oxygen to maximize the growth rate were made and tested for a variety of substrates. Growth on M9 minimal medium with acetate, malate, or succinate as the primary carbon source agreed with the computational hypothesis (15, 26); however, glycerol supported only suboptimal growth of *E. coli*. Repeated exponential balanced growth batch cultures on glycerol were then incubated for 60 days (around 900 generations), and the cells reproducibly evolved towards the a priori predicted optimal growth behavior (26).

Incorporating effects of transcriptional regulation.A shortcoming of the purely stoichiometric metabolic models is that they do not account for transcriptional regulation, so all the gene products are assumed to be available to the cell to optimize its performance in a defined environment. This assumption is based on the rationale that *E. coli* would have evolved its regulatory network to allow optimal growth under conditions to which the microorganism was already adapted. Some instances where this assumption might not be true are for *E. coli* mutants or for growth on multiple carbon sources. It has also been found that some carbon sources do no initially support optimal behavior, although limited adaptive evolution data do suggest that over time *E. coli* adjusts its metabolic fluxes to find the optimal solution (26; S. S. Fong and B. O. Palsson, unpublished data).

To address this issue, the effects of transcriptional regulation have recently been included in a constraint-based model of *E. coli* central metabolism (9). The method for modeling transcriptional regulation is based on Boolean logic, where the genes can be either on or off, and their status is evaluated based on conditional if statements. The regulatory network has been reconstructed for central metabolism in *E. coli* by using this method. Covert and Palsson's metabolic and regulatory network includes 149 genes (16 are regulatory genes), which take part in 113 metabolic reactions, 45 of which are regulated by 16 regulatory proteins (9).

Covert and Palsson's regulated model has been used to compare regulated model predictions of gene deletions with mutant data and predictions from an unregulated model (9). The unregulated network correctly predicted 97 of 116 cases correctly (83.6%), while the regulated network predicted 106 of 116 cases correctly (91.4%); thus, incorporating regulation improved the accuracy of in silico knockout predictions. The model was also used to calculate time courses of batch growth. The in silico predictions were in agreement with the experimental by-product (acetate, formate, and ethanol) secretion patterns, as well as the glucose uptake and biomass production patterns (9).

## FUTURE DIRECTIONS

Expanding current models.Annotation updates to the *E. coli* genome (55), coupled with previously published data and organism-specific databases, including EcoCyc (27), have enabled expansion of Edwards and Palsson's metabolic model. So far, an additional 242 genes have been added to Edwards and Palsson's model, and the metabolic network now contains 929 different metabolic reactions (J. L. Reed and B. O. Palsson, unpublished data). Genome-scale maps of *E. coli* metabolism have been constructed and are available (http://gcrg.ucsd.edu/organisms/ecoli/html
). Work is now focused on filling in the gaps in the metabolic network and validating the model. This model should continue to grow as more metabolic genes are identified and characterized.

Recent work has also been done to include the regulatory constraint-based model of *E. coli* (9). The expansion of this regulatory model to include the effects of regulation on larger genome-scale models of metabolism is one of the next foreseeable steps in building more accurate constraint-based models of *E. coli*. Tools that should aid in this effort include RegulonDB, a database containing information about gene regulation in *E. coli* that is publicly available (47), and methods that extract gene regulatory networks from transcriptomic data (24).

Finding objective functions.There are many issues that remain involving the selection of an objective function. Biomass compositions have often been used to compute the basis for optimal growth objective function. The effects of growth rate-dependent changes in biomass composition have already been accounted for (43). In addition to these advances, work is being done to back-calculate objective functions based on measured flux data, so that utilization of a calculated objective function yields a solution that minimizes the error between predicted and experimentally measured fluxes (6a).

Alternative solutions.A single linear optimization identifies one solution in the solution space; however, alternate optimal solutions can exist in the allowable solution space. These equivalent solutions can be calculated by using a variety of techniques, such as mixed-integer linear programming (30, 41), extreme pathway analysis (39), and elementary mode analysis (52); which optimal solution is actually used by the cell is still not known.

The in silico finding that the same phenotype can be attained in more than one way with the same underlying network gives rise to the possibility that it may be difficult to determine the true state of a cell. Preliminary experimental data support this expectation since strains that evolve to have the same growth phenotype (26) are not identical (Fong and Palsson, unpublished). Further, evolution of phosphotransferase system knockouts in *E. coli* also support this expectation (22).

Gene knockouts.The in silico representation of biological associations among genes, proteins, and reactions is important when the effects of gene deletions are modeled. Enzyme subunits and enzyme complexes need to be taken into account when associations among genes, enzymes, and reactions are made (Fig. 3). Deleting a gene in constraint-based models results in removing the reactions associated with the protein from the network, unless other isozymes are present. Removal of reactions changes the solution space, and some wild-type solutions might be eliminated (Fig. 2). Knocking out essential genes from the model produces no solutions which allow for cellular growth under the governing constraints.

In previous studies researchers have focused on determining if genes or reactions are essential using FBA (16, 48), elementary mode analysis (56), and extreme pathway analysis (51). Methods for predicting suboptimal solutions in gene knockout studies are being developed because an optimal solution might not be biologically achievable due to regulatory or kinetic effects. One such method, diagrammed in Fig. 2, has already been developed and finds the solution in the knockout solution space by minimizing the difference in fluxes between the wild-type optimal solution and a solution residing in the knockout solution space (54).

Application of additional constraints.Significant progress has been made in the last 13 years towards building constraint-based models of *E. coli*; they are now genome-scale models. Enhancing the predictive capabilities of these models in the future should be accomplished by broadening the scope of the models (including other cellular processes), as well as exploring the use of additional constraints. The utilization of other physicochemical constraints, such as the conservation of energy, kinetic constraints, osmotic balances, or electroneutrality, should further reduce the allowable solution space, resulting in more accurate predictions. A framework for implementing energy balance constraints (3, 44) has already been developed and applied to the central metabolic network of *E. coli* (3).

Integrated models.Other cellular processes can be described in a constraint-based modeling framework based on the genome sequence, such as transcription (1), translation (1), and DNA replication. These processes place direct metabolite and energy demands (i.e., through the objective function) on the metabolic network. These processes are coupled, and metabolism affects the rates of transcription, translation, and replication; these processes, in return, direct metabolism (Fig. 4). The development of constraint-based models that include metabolism, regulation, and protein synthesis should allow simultaneous reconciliation of diverse “-omics” data (such as proteomic, metabolomic, transcriptomic, and phenomic data) and back-calculation of biological parameters (such as promoter strengths).

## CONCLUSION

A shift in biology from a component-based perspective to a systems view of the cell is occurring as a result of high-throughput data generation. This shift in thinking requires construction of in silico models in order to understand systemic behavior of complete biological processes. Genome-scale models need to be built to evaluate and study the intrinsic biological properties that emerge from the system as a whole. The challenge is to develop methodologies to construct and study such models. Constraint-based models are the first step towards achieving this goal. Within this approach, models can be easily scaled up, and “-omics” data can be integrated. *E. coli* has proven to be a useful biological model organism, and constraint-based models have been developed for this organism over the last 13 years. The current *E. coli* models already include almost 1,000 genes, and 2,000-gene models are within reach (Fig. 1). An integrated 2,000-gene model (Fig. 4), accounting for the core functions of metabolism, transcription, translation, and regulation, can serve as a foundation on which other biological processes, such as growth and motility, can be simulated. In the foreseeable future, such constraint-based in silico models are expected to be commonly used in microbiology for hypothesis building and testing, driving both in silico and in vivo experimentation.

## ACKNOWLEDGMENTS

We acknowledge the support of grants from NSF (grant BES 01-20363) and NIH (grant GM 57089).

- Copyright © 2003 American Society for Microbiology