Skip to main content
< Back to news
Dr. Benedetta Bolognesi, lead co-author of the study and head of the Phase Transitions of Proteins in Health and Disease group at IBEC. Photo / IBEC
 02.05.2025

IBEC takes part in the development of CANYA, the Catalan AI that deciphers the secret language of “sticky” proteins

Researchers at the Institute for Bioengineering of Catalonia (IBEC), located in the Barcelona Science Park, in collaboration with the Centre for Genomic Regulation (CRG), have developed the artificial intelligence tool CANYA, which has enabled a major breakthrough in decoding the language proteins use to determine whether they form sticky aggregates — the presence of which is associated with Alzheimer’s and over fifty other human diseases. The study, published in Science Advances, was made possible thanks to the largest dataset on protein aggregation compiled to date. The work provides new insights into the molecular mechanisms behind aggregation, a process linked to diseases affecting 500 million people worldwide.

Protein clumping, or amyloid aggregation, is a health hazard that disrupts normal cell function. When certain patches in proteins stick to each other, proteins grow into dense fibrous masses that have pathological consequences. While the study has some implications for accelerating research efforts for neurodegenerative diseases, it’s more immediate impact will be in biotechnology. Many drugs are proteins, and they are often hampered by unwanted clumping.

“Protein aggregation is a major headache for pharmaceutical companies,” says Dr. Benedetta Bolognesi, co-corresponding author of the study and leader of the Protein Phase Transitions in Health and Disease group at the IBEC.

“If a therapeutic protein starts aggregating, manufacturing batches can fail, costing time and money. CANYA can help guide efforts to engineer antibodies and enzymes that are less likely to stick together and reduce expensive setbacks in the process,” she adds.

Protein clumps are formed using a poorly understood language. Proteins are made of twenty different types of amino acids. Instead of the usual A, C, G, T letters that make up the language of DNA, a protein’s language has twenty different letters, different combinations of which form “words” or “motifs”.

Researchers have long sought to decipher which combinations of motifs cause clumping and which others enable proteins to fold without error. Artificial intelligence tools that treat amino acids like the alphabet of a mysterious language could help identify the precise words or motifs responsible, but the quality and volume of data about protein aggregation needed to feed models have been historically scant or restricted to very small protein fragments.

The study addressed this challenge by carrying out large-scale experiments. The authors of the study created over 100,000 completely random protein fragments, each 20 amino acids long, from scratch. The ability for each synthetic fragment to clump was tested in living yeast cells. If a particular fragment triggered clump formation, the yeast cells would grow in a certain way that could be measured by the researchers to determine cause and effect.

Around one in every five protein fragments (21,936/100,000) caused clumping, while the rest did not. While previous studies might have tracked a handful of sequences, the new dataset captures a much bigger catalogue of the different protein variants which can cause amyloid aggregation.

“We created truly random protein fragments including many versions not found in nature. Evolution has explored only a fraction of all possible protein sequences, while our approach helps us peer into a much bigger galaxy of possibilities, providing lots of data points to help understand more general laws of aggregation behaviour,” explains Dr. Mike Thompson, first author of the study and postdoctoral researcher at the Centre for Genomic Regulation (CRG).

The vast amount of data generated from the experiments was used to train CANYA. The researchers decided to create it using the principles of “explainable AI”, making its decision-making processes transparent and understandable to humans. This meant sacrificing a little bit of its predictive power, which is usually higher in “black-box” AIs. Despite this, CANYA proved to be around 15% more accurate than existing models.

Specifically, CANYA is a convolution-attention model, a hybrid tool borrowing from two distinct corners of AI. Convolution models, like those used in image recognition, scan photos for features like an ear or a nose to identify a face, except in this case CANYA skims through the protein chain to find meaningful features like motifs or “words”.

Amyloid aggregation inside cells marked using fluorescence techniques/ Credit: Benedetta Bolognesi (IBEC)

Attention AI models are used by language translation tools to identify key phrases in a sentence before deciding on the best translation. The researchers incorporated this technique to help CANYA figure out which motifs matter most in the grand scheme of the entire protein.

Together, these two approaches help CANYA see local motifs up close while also spotting their bigger-picture importance. The researchers could use this information to not just predict which motifs in the protein chain encourage clumping, block it, or something in between, but also understand why.

“There are 1024 quintillion ways of creating a protein fragment that is 20-amino acids long. So far, we’ve trained an AI with just 100,000 fragments. We want to improve it by making more and bigger fragments. This is just the first step, but our work shows it is possible to decipher the language of protein aggregation. This is incredibly important for our understanding of human disease but also to guide synthetic biology efforts” concludes Dr. Bolognesi.

“This project is a great example of how combining large-scale data generation with AI can accelerate research. It’s also a very cost-effective method to generate data,” says ICREA Research Professor Ben Lehner, co-corresponding author and Group Leader at the Centre for Genomic Regulation (CRG) and the Wellcome Sanger Institute.

» Article of reference: Mike Thompson, Mariano Martín, Trinidad Sanmartín Olmo, Chandana Rajesh, Peter K. Koo, Benedetta Bolognesi, Ben Lehner. Massive experimental quantification allows interpretable deep learning of protein aggregation. Science Advances (2025). doi: 10.1126/sciadv.adt5111

» Link to the news: IBEC website [+]