Next Article 
Journal of Bacteriology, January 2004, p. 267-269, Vol. 186, No. 2
0021-9193/04/$08.00+0 DOI: 10.1128/JB.186.2.267-269.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
Digging with Experimental Pick and Computational Shovel: a New Addition to the Histidine Kinase Superfamily
Igor B. Zhulin*
School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30332-0230

INTRODUCTION
Experimental microbiology in the postgenomic era has survived
the first wave of uncertainty: a perspective of being replaced
by in silico microbiology. Although it is clear now that experimental
science is irreplaceable, we are facing yet another wave: the
frustration of having a half million microbial genes in databases
but lacking simple ways of getting biologically relevant information.
Therefore, it is reassuring to see some clear water between
the waves: examples of experimental research that takes advantage
of a comparative genomic approach one step at a time. The paper
by Karniol and Vierstra in this issue (
10) describing a new
family of histidine kinases shows how the use of simple bioinformatics
tools by microbiologists can (i) identify new targets for experimental
work and (ii) provide important feedback for improving these
tools.

HISTIDINE KINASES: MAJOR ENVIRONMENTAL SENSORS IN BACTERIA
Environmental sensing by histidine kinases is a fundamental
property of a microbial cell. Extensive research on this topic
during the last fifteen years has been recently summarized in
two books (
7,
8), a major review (
15), and dozens of more-specialized
reviews. Histidine kinases appear to be the major class of environmental
sensors in bacteria. A histidine kinase is a perfect sensor
because of its modular architecture. The sensing capabilities
lie within an input module (
14), which contains one or more
sensory domains that detect various physicochemical parameters.
The input module communicates the information to a transmitter
module within the same protein (
14), which in turn sends the
signal in the form of a phosphoryl group to another protein,
a cognate response regulator. The activated response regulator
triggers a cellular response, usually on the level of transcription.
Figure
1 shows a domain representation of FixL from
Bradyrhizobium japonicum (
5), a typical histidine kinase where the input module
contains two sensory PAS domains and the transmitter module
contains dimerization (HisKA) and ATP-binding (HATPase_c) domains.

HISTIDINE KINASES: EASY TARGETS IN MICROBIAL GENOME ANNOTATION
Histidine kinases are easy targets for genome annotators because
of the significant sequence conservation within the dimerization
and ATP-binding domains. Profile hidden Markov models (HMMs)
were designed for these domains, enabling their rapid detection
and visualization in protein sequences in two primary domain
databases, Pfam (
2) and SMART (
11). Recent implementation of
Pfam and SMART domains into the conserved domain database (
12)
and InterPro (
13) tools results in identification of histidine
kinases in any routine similarity search of primary protein
sequence databases, nr-NCBI and SWISS-PROT. Using SMART and
Pfam domain models and conventional BLAST (
1) searches, hundreds
of histidine kinases have been identified in completely sequenced
prokaryotic genomes (
6). I (as well as many other experimental
and computational scientists) was under impression that if you
have a protein sequence, you will know in a few seconds whether
or not it is a histidine kinase.

THE HWE HISTIDINE KINASE FAMILY: A DIFFICULT CASE
The paper by Karniol and Vierstra (
10) describes a new family
of histidine kinases exemplified by a well-studied protein,
the BphP2 light sensor histidine kinase from
Agrobacterium tumefaciens (
9). The family is named HWE after uniquely conserved histidine,
tryptophan, and glutamate residues. The family consists of dozens
of homologs, and many of them have no obvious characteristics
of histidine kinases. Figure
1 shows that scanning a protein
SMa2063 from
Sinorhizobium meliloti, a member of the newly identified
HWE family (
10), against the SMART database results in a prediction
of "protein of unknown function" because no known domain can
be detected. SMART is a professionally curated domain database
specialized in signal transduction; therefore, its opinion regarding
this protein can be viewed as an expert one. For example, SMART
easily detects all domains, not only those that are well conserved,
such as HATPase_c, but also those that are poorly conserved,
such as HisKA and PAS, in an experimentally characterized histidine
kinase FixL (Fig.
1). Searching the Pfam database with the SMa2063
sequence results in ambiguous results, where the HATPase_c domain
is detected at the borderline of statistical significance and
is overlapped with another unrelated domain predicted with a
similar unreliable statistical score. Both SMART and Pfam utilize
the HMMer program for domain detection (
3). Changing the program
to reverse-position-specific BLAST (
1), a domain search tool
implemented in the conserved domain database (
13), does not
improve the situation: no conserved SMART or Pfam domains can
be detected in SMa2063. Thus, it comes as no surprise that although
37 histidine kinases were annotated in the completely sequenced
genome of
S. meliloti (
4), the SMa2063 protein was not among
them; it received the familiar label of "hypothetical protein."
Nevertheless, Karniol and Vierstra (10) were able to convincingly predict that SMa2063 is a histidine kinase. This has been done by detecting the SMa2063 protein in BLASTP searches (1) initiated with the ATP-binding domain of the BphP2 light sensor histidine kinase from A. tumefaciens (9) followed by a thorough analysis of its alignment with homologous domains. Special attention was paid to conserved motifs that have a functional role in the histidine kinase activity. The reader is referred to the paper for details of this analysis and experimental results demonstrating that SMa2063 and other proteins similar to BphP2 are indeed histidine kinases.
The importance of the results obtained by Karniol and Vierstra for microbial signal transduction is obvious. The new family of histidine kinases, which includes many previously unrecognizable members, is a big step forward. These sensor molecules initiate important regulatory cascades, and their discovery will facilitate experimental research aimed at understanding cellular properties that are controlled by two-component systems in a given microbial species. New findings also prompt new questions for experimental research. How significant is the deviation of the kinase module in terms of structure and function? Is there any distinct feature in cognate response regulators, etc.? The impact of this finding on bioinformatics is less obvious, but it is as important. It shows that current domain detecting tools need significant improvement when it comes to signal transduction proteins. The failure of SMART and Pfam to recognize a version of a conserved domain, such as HATPase_c, clearly calls for adjustments to current HMMs. New (for the HWE family) and improved (for the entire superfamily) models will ensure better automated detection of histidine kinases in all available and newly sequenced genomes.

HOW TO STORE KNOWLEDGE IN THE SILICON AGE
The main difference between biological research in the pre-
and postgenomic eras is in the numbers. Traditionally, an exciting
finding of a novel function for a given protein meant that it
was for this protein only. Nowadays, this finding can be extrapolated
to numerous homologs that have never been (and most of them
will never be) in the hands of experimentalists and exist mainly
in the virtual realm of databases. It is vitally important for
the future of biology to make sure that such extrapolation is
(i) applied and (ii) applied correctly. The second point is
a subject of serious discussions and debates; however, even
the first one is not very clear. Postgenomic biology is experiencing
a normal disease of growth, where experimental and genomic information
exist in parallel, rarely interacting worlds (Fig.
2). The scientific
community has a traditional way of learning from reports published
in peer-reviewed journals. Searching genomic databases with
a BLAST program (
1) became the second way of learning for many
biologists (the BLAST paper has been cited

10,000 times, and
many more publications refer to BLAST without citation). However,
the real problem is that much (if not most) biological information
in the databases is not only poorly peer reviewed but also annotated
by people who cannot possibly be experts in all areas of biological
research or are not biologists at all. Experimental scientists
have little control over this process. For example, how will
the important finding reported by Karniol and Vierstra (
10)
make it into the primary databases? One way is that curators
at the National Center for Biotechnology Information, Swiss-Prot,
Pfam, and SMART will find the paper and take time to connect
each relevant record in the database to their publication (current
automated tools usually fail in doing so) or convert printed
alignments into models and make appropriate descriptions for
domain databases. What if they miss the paper or are not willing
to deal with the alignments from scratch (currently, most curators
will ask for a ready-to-go alignment file)? Well, then SMa2063
and dozens of other proteins in these databases will remain
hypothetical. There is a way out of this situation. For example,
authors who wish to publish a new protein family can be asked
to submit their alignments and descriptions that they feel are
appropriate to primary domain databases if the paper is accepted
for publication. This will ensure that the new finding, which
has been peer reviewed, finds its place in the database and
that its description will be provided by experimentalists themselves
and not by database curators.
Another attempt to bridge experimental and genomic information
is the development of specialized knowledge databases rather
than sequence databases. At this time, most of such databases
are organism oriented. For example, EcoCyc (
http://ecocyc.org/)
and CYORF (
http://cyano.genome.ad.jp/) serve the
Escherichia coli and cyanobacterial research communities, respectively.
There is an urgent need, however, for the creation of databases
of functions that span many different organisms. Figure
2 illustrates
the idea of such an interactive knowledge environment for those
interested in signal transduction. Having such databases would
ensure that all relevant biological discoveries and their extrapolation
on the genomic data are stored, managed, and available to the
scientific community in a peer-reviewed, user-friendly form.
Wouldn't it be nice to do less BLASTing and more reading?

FOOTNOTES
* Mailing address: School of Biology, Georgia Institute of Technology, 310 Ferst Dr., Atlanta, GA 30332-0230. Phone: (404) 385-2224. Fax: (404) 894-0519. E-mail:
igor.zhulin{at}biology.gatech.edu.

The views expressed in this Commentary do not necessarily reflect the views of the journal or of ASM.

REFERENCES
1 - Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.[Abstract/Free Full Text]
2 - Bateman, A., E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. R. Eddy, S. Griffiths-Jones, K. L. Howe, M. Marshall, and E. L. Sonnhammer. 2002. The Pfam protein families database. Nucleic Acids Res. 30:276-280.[Abstract/Free Full Text]
3 - Eddy, S. R. 1998. Profile hidden Markov models. Bioinformatics 14:755-763.[Abstract/Free Full Text]
4 - Galibert, F., T. M. Finan, S. R. Long, A. Puhler, P. Abola, F. Ampe, F. Barloy-Hubler, M. J. Barnett, A. Becker, P. Boistard, G. Bothe, M. Boutry, L. Bowser, J. Buhrmester, E. Cadieu, D. Capela, P. Chain, A. Cowie, R. W. Davis, S. Dreano, N. A. Federspiel, R. F. Fisher, S. Gloux, T. Godrie, A. Goffeau, B. Golding, J. Gouzy, M. Gurjal, I. Hernandez-Lucas, A. Hong, L. Huizar, R. W. Hyman, T. Jones, D. Kahn, M. L. Kahn, S. Kalman, D. H. Keating, E. Kiss, C. Komp, V. Lelaure, D. Masuy, C. Palm, M. C. Peck, T. M. Pohl, D. Portetelle, B. Purnelle, U. Ramsperger, R. Surzycki, P. Thebault, M. Vandenbol, F. J. Vorholter, S. Weidner, D. H. Wells, K. Wong, K. C. Yeh, and J. Batut. The composite genome of the legume symbiont Sinorhizobium meliloti. Science 293:668-672.
5 - Gilles-Gonzalez, M. A. 2001. Oxygen signal transduction. IUBMB Life 51:165-173.[Medline]
6 - Grebe, T. W., and J. B. Stock. 1999. The histidine protein kinase superfamily. Adv. Microb. Physiol. 41:139-227.[Medline]
7 - Hoch, J. A., and T. J. Silhavy (ed.). 1995. Two-component signal transduction. ASM Press, Washington, D.C.
8 - Inouye, M., and R. Dutta (ed.). 2003. Histidine kinases in signal transduction. Academic Press, New York, N.Y.
9 - Karniol, B., and R. D. Vierstra. 2003. The pair of bacteriophytochromes from Agrobacterium tumefaciens are histidine kinases with opposing photobiological properties. Proc. Natl. Acad. Sci. USA 100:2807-2812.[Abstract/Free Full Text]
10 - Karniol, B., and R. D. Vierstra. 2004. The HWE histidine kinases, a new family of two-component sensor kinases with potentially diverse roles in environmental signaling. J. Bacteriol., 445-452.
11 - Letunic, I., L. Goodstadt, N. J. Dickens, T. Doerks, J. Schultz, R. Mott, F. Ciccarelli, R. R. Copley, C. P. Ponting, and P. Bork. 2002. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 30:242-244.[Abstract/Free Full Text]
12 - Marchler-Bauer, A., J. B. Anderson, C. DeWeese-Scott, N. D. Fedorova, L. Y. Geer, S. He, D. I. Hurwitz, J. D. Jackson, A. R. Jacobs, C. J. Lanczycki, C. A. Liebert, C. Liu, T. Madej, G. H. Marchler, R. Mazumder, A. N. Nikolskaya, A. R. Panchenko, B. S. Rao, B. A. Shoemaker, V. Simonyan, J. S. Song, P. A. Thiessen, S. Vasudevan, Y. Wang, R. A. Yamashita, J. J. Yin, and S. H. Bryant. 2003. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31:383-387.[Abstract/Free Full Text]
13 - Mulder, N. J., R. Apweiler, T. K. Attwood, A. Bairoch, D. Barrell, A. Bateman, D. Binns, M. Biswas, P. Bradley, P. Bork, P. Bucher, R. R. Copley, E. Courcelle, U. Das, R. Durbin, L. Falquet, W. Fleischmann, S. Griffiths-Jones, D. Haft, N. Harte, N. Hulo, D. Kahn, A. Kanapin, M. Krestyaninova, R. Lopez, I. Letunic, D. Lonsdale, V. Silventoinen, S. E. Orchard, M. Pagni, D. Peyruc, C. P. Ponting, J. D. Selengut, F. Servant, C. J. Sigrist, R. Vaughan, and E. M. Zdobnov. 2003. The InterPro database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31:315-318.[Abstract/Free Full Text]
14 - Parkinson, J. S., and E. C. Kofoid. 1992. Communication modules in bacterial signaling proteins. Annu. Rev. Genet. 26:71-112.[CrossRef][Medline]
15 - Stock, A. M., V. L. Robinson, and P. N. Goudreau. 2000. Two-component signal transduction. Annu. Rev. Biochem. 69:183-215.[CrossRef][Medline]
Journal of Bacteriology, January 2004, p. 267-269, Vol. 186, No. 2
0021-9193/04/$08.00+0 DOI: 10.1128/JB.186.2.267-269.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
This article has been cited by other articles:
-
Ashby, M. K., Houmard, J.
(2006). Cyanobacterial Two-Component Proteins: Structure, Diversity, Distribution, and Evolution. Microbiol. Mol. Biol. Rev.
70: 472-509
[Abstract]
[Full Text]