Guidelines for Designing Protein Transformers

Researchers have developed an artificial intelligence model named ProDomino. By "learning" from nature's evolutionary wisdom, it can accurately predict the optimal "modification sites" on protein molecules.

Proteins are the cornerstone of life, acting as tireless nano-sized molecular machines. Some are responsible for catalyzing biochemical reactions, others for transmitting signals, while some provide the framework for our bodies. However, these natural molecular machines do not always fully meet our needs. In the era of synthetic biology and precision medicine, we aspire to control these machines at will, allowing them to respond to specific signals and perform specific tasks at specific times and locations—as if equipped with a "switch."

Proteins controlled through the introduction of a "sensing" module to regulate "functional" module activity are called allosteric protein switches. Designing these "smart" proteins, particularly deciding where to insert the "sensing" module into the "functional" modules, has long been a significant challenge. It's akin to adding a new cog to a precision watch, where a slight misplacement can halt the entire mechanism. Traditional methods rely on extensive trial-and-error screening, which is time-consuming, labor-intensive, and has a low success rate, akin to finding a needle in a haystack.

A study published in "Nature Methods" titled "Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites" has brought us a key to opening the doors to a new world. Researchers developed an AI model named ProDomino, which learns from nature’s evolutionary wisdom to accurately predict the best "modification sites" on protein molecules, elevating the process of designing smart proteins from a brute-force era of "finding a needle in a haystack" to an intelligent era of "following the map."

The "Mix & Match" World of Proteins and the Engineer’s Dilemma

To comprehend the ingenuity of this research, let's first examine the construction philosophy of proteins. Many proteins aren’t a complete whole but are constructed from multiple relatively independent structural and functional units—domains—pieced together like LEGO bricks. Over hundreds of millions of years of evolution, nature, the greatest engineer, excels at "mixing and matching" these domains to create functionally diverse new proteins.

One powerful "mix & match" strategy is domain insertion. This isn’t just about linking two LEGO bricks end-to-end, but embedding one domain (insert domain) completely within another (parent domain). This close coupling enables a profound structural and functional interdependence between the two domains. When the conformation of the insert domain changes (e.g., due to binding to a small molecule or sensing light), this change can be mechanically transmitted to the parent domain, like a domino effect, switching its function on or off.

This is the golden strategy for designing allosteric protein switches. Theoretically, by finding a suitable sensing domain (e.g., a light-sensitive domain) and a functional domain we wish to control (e.g., an enzyme or gene editor), and inserting the former into a key location within the latter, a light-activated or drug-controlled molecular switch can be created.

However, while the ideal is enticing, reality is challenging. Where exactly is this "key location"? Protein amino acid chains vary in length, ranging from dozens to thousands of amino acids, resulting in thousands of possible insertion sites. Most insertions would directly disrupt folding and function of the parent protein, leading to its complete inactivation. Only a few "fortunate" sites can accommodate a foreign domain and effectively transmit conformational changes, achieving allosteric regulation. Finding these rare "allosteric hotspots" is the core challenge facing protein engineers. Previous studies have shown that even surface-exposed, seemingly flexible loop regions can only accommodate insertions at a few sites. Traditional bioinformatics methods lack a comprehensive understanding of protein dynamic conformation, performing poorly in prediction, ultimately requiring cumbersome laboratory screening.

Fig1. Intradomain insertions are common in natural proteins.

From "Evolutionary Loophole" to "Design Bible": The Birth of ProDomino

Faced with this dilemma, researchers shifted their approach: why not learn from nature, given the difficulty of artificial design?

They hypothesized that the "domain insertion" events naturally occurring, despite seeming like "bugs" in evolution, serve as perfect "engineering manuals," illustrating successful domain insertion conditions for various proteins and locations.

Thus, a grand plan emerged: create a large-scale database of natural domain insertion events and train a machine learning model on this data. Starting from extensive databases like Interpro and CATH, they screened 174,872 unique cases of natural domain insertion across the tree of life, covering 202 insert domain superfamilies and 168 different parent domain types.

Analyzing this dataset yielded fascinating insights. For example, the most "promiscuous" parent domain is the P-loop NTPase, capable of pairing with 13 different insert domains, while the most versatile insert domain is the PDZ domain, insertable into 11 different parents. More intriguingly, these natural insertion events appeared ubiquitous rather than restricted to specific protein types, indicating that underlying physical and chemical rules are universal.

With this "design Bible," the next step was to train AI. Researchers employed ESM-2, one of the most advanced protein language models. ESM-2 can transform a protein's amino acid sequence into a mathematically rich embedding containing structural and functional information, providing far more information than traditional one-hot encoding.

These high-quality embedding vectors were fed into a relatively simple neural network tasked with predicting whether each amino acid position in a protein sequence is an "insertion-tolerant" site. The researchers ingeniously employed a positional masking strategy during training, focusing the model's learning on distinguishing nuanced differences between "good" and "bad" sites.

Through this meticulous design, a powerful predictor for protein domain insertion sites—ProDomino (Protein Domain Insertion Optimizer)—was born.

Pilot Test: Can AI Locate Treasure on a Known Map?

With a new model in hand, the researchers first tested it on a well-studied protein—bacterial transcription factor AraC—making it ProDomino's first examination. Previous studies mapped nearly all AraC insertion sites through exhaustive experimental screening.

Could ProDomino find the "treasure" on this map without seeing it?

The results were encouraging. ProDomino's predictive score curve on AraC showed several distinct peaks, with the highest peaks precisely corresponding to two experimentally proven sites for strong allosteric regulation: I113 and S170. The stricter metric AUROC scored 0.84, indicating ProDomino can identify not only "inserable" sites but has the potential to locate "golden" sites for functional switches. This pilot test effectively demonstrated ProDomino's predictive capability.

Fig2. ProDomino informs the engineering of light-controlled antibiotic resistances.

Light and Shadow Magic: Using AI to Illuminate Antibiotic "Switches" for Precise Spatiotemporal Control

Finding treasure on a known map is gratifying, but exploring unknown worlds truly tests ProDomino. Researchers applied ProDomino to novel proteins, designing light-controlled "molecular switches."

They selected two common antibiotic resistance enzymes: Puromycin N-acetyltransferase (PAC) and Chloramphenicol Acetyltransferase (CAT). These enzymes enable cells to resist the toxicity of puromycin and chloramphenicol, respectively. The goal was to achieve light control by inserting a light-sensitive domain, AsLOV2, which undergoes conformational changes under blue light, rendering these resistance enzymes "light-controlled"—active in darkness and inactive under blue light.

ProDomino swiftly provided predictions: nearby E83 in PAC and at K136 in CAT, both located on α-β connecting loops on the surface of the proteins.

Following the AI design blueprint, researchers inserted the AsLOV2 domain at the predicted sites. The experimental results were perfect:

In human cells expressing light-controlled PAC, the cells resisted puromycin under darkness but died within 48 hours upon blue light exposure as PAC inactivated. E. coli expressing light-controlled CAT grew normally in darkness but halted growth under blue light, showing a nearly 20-fold OD value difference after 7 hours.

A spatial control experiment showcased the remarkable precision of AI-designed molecular switches. Researchers evenly spread bacteria expressing light-controlled CAT on a medium and illuminated it through a patterned light mask from below. Bacteria only grew in dark regions shielded by the mask, perfectly reproducing the pattern, vividly illustrating the precision in spatiotemporal control enabled by AI-designed molecular switches.

For a comprehensive evaluation of ProDomino's reliability, researchers tested several high-score and low-score predictive sites, reporting a predictive success rate of 78%. High-score predicted sites mostly led to active fusion proteins, while low scores resulted in inactive proteins, proving ProDomino's predictive score as a reliable experimental guide.

The Ultimate Testing Ground: Taming the "Gene Editing Scissors" CRISPR, from "Light-Control" to "Drug-Control"

Designing switches for single-domain enzymes is challenging, but modifying complex multi-domain molecular machines like CRISPR-Cas is an even greater challenge. If a switch could be created for CRISPR, the star tool in gene editing, to achieve precise control, it would enhance safety in gene therapy significantly.

Researchers first challenged the most renowned SpCas9 system. Previous research had mapped Cas9 insertion spots through extensive transposon screening experiments. ProDomino's predictions were consistent with this map (AUROC of 0.71) but also suggested several potentially high insertion sites not discovered experimentally.

Researchers selected four such contentious sites, inserting the AsLOV2 light-sensitive domain into them and connecting to a transcription activation domain, VPR, constructing a light-controlled gene activation tool (dCas9-VPR-LOV2). Experimental results again validated AI's foresight. These four-novel light-controlled Cas9 variants showed excellent insertion tolerance, and three exhibited strong light responsivity: effectively activating downstream gene expression in darkness, with activity reduced to 8-14 times under light exposure.

Modifying Cas9 is a notable success, yet successfully modifying the Cas12a system genuinely demonstrated ProDomino's potential to tackle "from scratch" problems. Cas12a is another critical gene-editing tool, with its single-chain structure making switch design even more challenging, with few prior successes.

ProDomino's prediction map for MbCas12a, a close homolog of Cas12a, had multiple peaks, suggesting enhanced insertion tolerance. High and low-score site insertion tolerances were tested, affirming ProDomino's prediction reliability.

They advanced to final design rounds, aiming to achieve not only "light control" but also "drug control."

Related Proteins

Cat.No. #	Product Name	Source (Host)	Species	Tag	Protein Length	Price
CAS9-22S	Active Recombinant Full Length Streptococcus pyogenes serotype M1 type II CRISPR RNA-guided endonuclease Cas9 Protein, GFP-tagged	E.coli	Streptococcus pyogenes serotype M1	GFP	Full L. Full length	$298 / 100ug $1,998 / 1mg
Cas9 -121S	Recombinant CRISPR Cas9 protein	E.coli		Non		Inquiry
cas9-12S	Active Recombinant Streptococcus pyogenes M1 cas9 Protein, His-tagged	Insect Cells	Streptococcus pyogenes M1	His		Inquiry
cas12a-1523E	Recombinant Eubacterium rectale cas12a Protein (Met1-His1290), C-His tagged	E.coli	Eubacterium rectale	His	Met1-His1290	Inquiry
Cas12a-3070L	Active Recombinant Lachnospiraceae bacterium Cas12a protein	E.coli	Lachnospiraceae	Non	Ser 2-His 1228	Inquiry
cas12a-4632A	Recombinant Acidaminococcus sp. (strain BV3L6) cas12a protein, His&Myc-tagged	E.coli	Acidaminococcus sp. (strain BV3L6)	His&Myc	1-1307a.a.	Inquiry

Light-controlled Cas12a: They chose the most active site, N1153, inserting AsLOV2. The resulting Cas12a-LOV2 hybrid protein showed significant light dependence, with gene-editing activity reduced to one-third in light conditions, achieving light-controlled gene editing.

Drug-controlled Cas12a: This was the study's highlight. Researchers replaced the light-sensitive AsLOV2 domain with a ligand-binding domain from human glucocorticoid receptor 2 (GR2), which remains loose without its ligand, cortisol, but becomes compact upon ligand binding. Researchers hoped this "drug-induced folding" property could achieve activation control over Cas12a. GR2 was inserted at sites K487 and N1153.

Experimental results were remarkable. Especially the Cas12a-GR2 variant at N1153 showed almost perfect switch characteristics. Gene-editing activity was almost negligible without cortisol, close to detection limits, while cortisol addition dramatically boosted activity, reaching 50-70% efficiency of wild-type Cas12a across multiple endogenous gene targets. This achievement meant creating an efficient, tightly controlled "safety lock" version of the gene-editing scissors controllable by clinical drugs.

Fig 3. ProDomino confidently predicts potent opto- and chemogenetic Cas9 and Cas12a variants.

From "Finding a Needle in a Haystack" to "Following the Map": Redefining Protein Engineering's Game Rules

ProDomino's success marks a paradigm shift in protein engineering design. It transforms the design of protein switches from reliant on intuition and extensive screening, akin to "crafting in workshops," to being driven by data and guided by models in "smart manufacturing."

This research's value extends beyond creating novel molecular tools.

Speed and Efficiency: The researchers noted that the entire experimental process—from cloning all candidate proteins to performing various tests—took only about six months, a significant leap from past experiences of taking years for a project to succeed.
Universality and Accessibility: ProDomino's success is not limited to a specific type of protein, achieving success across diverse proteins (transcription factors, enzymes, gene editors), demonstrating its wide application potential. It expedites the design of customized biosensors, controllable cell therapies, and cutting-edge research tools with unprecedented efficiency and predictability.

Inspirational Thinking: ProDomino at its core learns from nature, proving that evolutionary information hidden in vast biological data is a goldmine for solving complex biological design problems. Artificial intelligence is the powerful tool for mining this gold.

Looking ahead, integrating ProDomino with structure-based prediction tools or methods for designing new switch domains from scratch will further unleash its potential. No longer will protein engineers be explorers groping in the dark but navigators confidently heading toward their next design goal with AI-drawn precise maps in hand.

The language of life's design is profound and complex. However, with AI as a translator skilled in evolutionary grammar, we are deciphering this celestial book at unprecedented speed and beginning to write our new chapters.

Reference

Wolf, B., Shehu, P., Brenker, L., Von Bachmann, A., Kroell, A., Southern, N., Holderbach, S., Eigenmann, J., Aschenbrenner, S., Mathony, J., & Niopek, D. (2025). Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites. Nature Methods, 22(8), 1698-1706. https://doi.org/10.1038/s41592-025-02741-z