Transcription factors (TFs) are important components of gene regulatory networks. They bind to short degenerate DNA motifs, activating or inhibiting their target genes by either recruiting the transcriptional machinery or blocking it. The general principles of transcription regulation were first deduced from studies of the lac operon system in E. coli by Jacob and Monod. They showed that the lac operon has cis-acting elements in a promoter region that are controlled by trans-acting proteins that bind to these regulator elements, either activating or repressing transcription.
TFs make up a significant percentage of the human genome, about 6 to 10%. However, only a small percentage of the DNA motifs that these TFs bind have been determined. TFs also play an important role in disease. For instance, TFs are over represented among oncogenes. A third of the genes linked to birth defects in OMIM are transcription factors.
There are many different types of transcription factors that utilize a wide variety of protein folds to bind DNA and recognize specific sites. The two largest transcription factor families in most metazoan genomes are the C2H2 zinc finger (hereafter referred to as ZF) family and the homeodomain transcription factor family.
Both of homeodomain transcription factors and zinc finger families utilize a relatively small set of key residues to bind to DNA specifically, and they are capable of binding to a wide variety of different types of binding sites. It has been proposed that this ability to bind a wide array of possible DNA motifs is why these families have expanded so quickly. The reasoning is that if only a few mutations at key residues can lead to a wide variety of different binding specificities, than a TF family should be able to expand through duplication successfully.
The homeodomain transcription factor and ZF are both relatively small domains, 30 and 60 amino acids long, respectively. Both families use an alpha helix, or recognition helix, to bind in the major groove. Homeodomain transcription factor proteins also bind in the minor groove via an N-terminal arm. ZF proteins generally contain tandem repeats of ZF domains that bind to overlapping subsites. The homeodomain transcription factor family is a subfamily of the HTH class of proteins, which are very abundant in prokaryotic genomes.
Insights into the mechanisms of sequence-specific DNA binding by homeodomains have been provided by the three-dimensional structures of individual protein-DNA complexes coupled with directed mutagenesis and biochemical analysis. The homeodomain consists of approximately 60 amino acids that fold into a stable 3-helix bundle preceded by a flexible N-terminal arm. Interactions with a 5 to 7 base pair DNA binding site are formed by positioning a single “recognition” helix in the major groove and the N-terminal arm in the minor groove. Despite a common DNA-binding architecture, there is significant variation in the sequence composition within the homeodomain family; for example the two superclasses of homeodomains, denoted as typical and atypical, share low sequence identity and recognize substantially different DNA sequences, yet their docking with the DNA is nearly identical. This conserved binding geometry allows differences in amino acid sequence and DNA-binding specificity for various homeodomains to be interpreted within a common structural framework. Residues at positions 2, 3 and 5-8 on the N-terminal arm, as well as residues at positions 47, 50, 51, 54 and 55 on the recognition helix, can all contribute to DNA-binding specificity.
How specific sequence variations between homeodomains lead to different recognition preferences has been defined in several cases. Seminal experiments demonstrated that Lys50 promotes recognition of TAATCC by the Bicoid class of homeodomains instead of the TAAT(T/G)(A/G) recognized by the Gln50-containing Antp and En classes. Beachy and colleagues mapped differences in binding site position 2 specificity for the posterior HOX protein AbdB (TTATGG) and more anterior HOX family members (TAATGG) to amino acids at positions 3, 6 and 7 in the N-terminal arm. Interestingly, substitutions at amino acids that overlap with these positions (6-8) are sufficient to switch the specificity of an NK-2 type homeodomain (CAAGTG) to the specificity of an Antp-type homeodomain (T AAGTG) at the neighboring base, binding site position 1. This complexity is not limited to the N-terminal arm, as residues at different amino acid positions, such as 47 and 54, can potentially contact the same base pair. This diversity in potential recognition contacts has hindered efforts to globally reengineer homeodomain specificity. Consequently, a comprehensive description of the determinants of homeodomain DNA-binding specificity remains an important goal.