What to look out for with an AlphaFold2 model

There is nothing more disheartening than telling someone "Sorry, I cannot help you with your protein, because no homologue structures of your protein are solved and any model will be rubbish". Now, with AlphaFold2 proteome release this is no longer the case. Or mostly: in fact there are several pitfalls and issues that need to be looked at, because the algorithm does not account for three things: binding partners and ligands, oligomerisation and alternate conformations.

Gamechanger

Before starting, I ought to address whether AlphaFold2 is actually a gamechanger.

AlphaFold2 stunned the world with its accuracy in the CASP competition and now DeepMind have teamed up with EMBL EBI to provide structures for the whole proteome of key model organisms. So what does this mean for science? And am I out of a job now?

Prior to this, the PDB contains many protein, in different conformation and bound states, while Swiss-Model provides threaded models for parts of any human protein that are homologous to a solved structure. The former is actually rather tricky to navigate as there may be a given human protein domain in complex with two bovine protein, all expressed in E. coli, in two different states. Even if I am used to the PDB, I often look at the name route in Michelanglo because it tells me what is bound and what are the ranges. Swiss-Model has a very clean interface nowadays and does have many models, but it is not as a well-known resource as of the PDB or as of this week the EBI's AlphaFold2 repository. Therefore, it is clear why many are excited.

Model making: threading

Whereas the structures on AlphaFold2 are better than models from other online servers or databases, but there are some caveats.

There are several ways to make a model. A model can be a one-to-one threaded model, an ab initio model or a hybridisation of threaded/ab initio models.

In threading, a structure is taken and the target sequence is mapped on based on a pairwise alignment. ModBase (database), Swiss-Model (database), Phyre2 one-to-one threading (on request) and Rosetta's ThreadingMover (local calculations) are example of threading applications. This performs poorly when there sequences are divergent, but when they are close it has several benefits.

First, if one has a protein in two conformations both conformations can be threaded separately. For example, with a carrier protein one generally has an open and a close face on either side of the membrane, if there are two structures of close homologues (>30–50 sequence identity) in these two conformations, one can thread twice to get both states and better understand how the protein operates. Unfortunately, AlphaFold2 does not take into account that proteins can have alternate conformations, so if one state has a stronger evolutionary signal the Evoformer will disregard the evolutionary contacts that favour the other states.

Second, one can migrate the bound ligands and/or protein after threading from the template to the model. This is incredibly useful, but runs on the assumption that the target has the same binding partners as the template. Evolution often takes measures to prevent crosstalk resulting in divergent binding patterns, but it is a mostly safe assumption that similar binding partners bind in a similar manner except for the divergent parts. In the case of small molecules, I-Tasser, an ab initio + threading hybridisation online algorithm, suggests the possibility of migrating these from the templates. Unfortunately, AlphaFold2 does not work by templates, or at least it is unaware of what specific model was used as a template in its training, therefore it does not do any such suggestion, although one can find out what may be useful (see supplementary notes below).

Loops

Not a bowl of spaghetti,
but a poor Phyre2 model

In ab initio models and hybrids based on these, loops that do not have an intrinsic structure appear as spaghetti. These may not have been modelled correctly (cf. picture) or may be bound as a flexible peptide to a target protein or may wrap around DNA (cf. picture).

PDB:3L6X.
E-cadherin peptide (turquoise)
bound to catenin.

In a crystal these unbound disordered loops lack density (i.e. missing density) and are therefore absent from the solved structure. I blogged about their addition in a past post and enumerated many reasons why this operation can be a bad idea. Many human AlphaFold2 models present with long spaghetti loops. Let's take MEF2C as an example: https://alphafold.ebi.ac.uk/entry/Q06413. This is not a random choice, but a protein that I have worked before on and my threaded model looked very different: https://michelanglo.sgc.ox.ac.uk/r/mef2c. Spot the difference:

Three and a half things are to be noted.

My model is a dimer. Although in the Twittersphere there are already colabs notebooks that can model oligomers with AlphaFold2.
My model is DNA.
I do not have the loops with low pLDDT
The different in information one learns from them is staggering!

However, it should be said that low pLDDT loops are still likely to have a role. In fact, in this protein, if the C-terminal spaghetti were junk, there would be many homozygous truncations in gnomAD —yet there are none. Looking at PhosphoSitePlus, reveals that the spaghetti is peppered with post-translation modifications, including C-terminal phosphorylations (a common nuclear protein regulatory mechanism) and several lysine acetylations (affecting DNA binding). One residue is commonly found phosphorylated, Ser222. Googling this residue reveals it is a residue with a studied effect. It is part of a spaghetti loop on AlphaFold2 model, but in reality it must wrap around something, recruiting things. In reality having something bind DNA does not make the RNA polymerase magically appear next to it ready to transcribe but instead a complicated interplay of protein, involving not only the RNA polymerase complex, but also the mediator complex, needs to happen. Namely, the eukaryotic RNA polymerase is recruited to the transcription factor indirectly: first the transcription factor binds the DNA, then it binds a certain part of the mediator complex (evidence suggests different parts may be bound by different transcription factor tails (cf. this review), thus integrating the signal with other factors, which then recruits the RNA polymerase.

Beads on a string

Even when there are no spaghetti loops, some multidomain protein may have structured domains, which do their thing independently of the other domains, like beads on a string. Namely, they are flexible in solution, but have an arrangement when bound. For example, whereas it is easy to predict how a KH domain looks like, it is not easy to predict how several KH domains interact with each other. For example, KSRP in AlphaFold2 looks like:

In blue are the KH domains,
which have high pLDDT

The KH domains appear to take on a certain arrangement, but the loops in between have low confidence and actually there are no residues between the domains that are in contact. Looking further at the loops between the domains shows they are flexible glycine-proline-charged rich linkers. When unbound these will move about, while when bound to RNA these will align along it.

For visual purposes I personally prefer to have these protein in a line, like an illustration of the planets, which are actually very very rarely in sysygy —for a script that can do this, see my GitHub.

Transmembrane

A common problem with models from I-Tasser, Phyre2 and AlphaFold2 of transmembrane protein is that the bells-and-whistles parts that decorate the protein on either side of the membrane may be placed looping back into where one would picture the membrane, which needs correcting as the algorithms do not really see membrane as a plane to cross (see footnote about adding markers for the membrane). The membrane plane invading segment can be deleted or moved even with autosculpting feature in PyMOL discussed in the previous blog post about filling missing density the hacky way as the span will have low confidence anyway.

Docking

There is a real potential in discovering new drugs that bind to the AlphaFold2 models. However, it is not a simple task and worse it may backfire. One problem that is going to arise in my opinion from this release is a swathe of docking studies that will not really lead anywhere —pun indented. This is because docking is a complicated toolset that requires scepticism, human-attention-to-details and majorly validation of a large subset of targets.

Covid19 is a good example of the issues with docking. With Covid19 the Zhang group (I-Tasser) released models of the viral protein and shortly afterwards there were two effects. The milder effect was a deluge of questions in Stack Exchange (SE) Bioinformatics, SE Chemistry and even Stack Overflow by users trying their hand at docking clearly without understanding what they were doing and without an understanding the system at hand —e.g. setting the grid the size of the protein, not protonating catalytic residues correctly, improper data analysis etc. The more problematic effect was a deluge of papers in BioRxiv and ChemRxiv, which later translated to actual publications, claiming that a drug could be repurposed based solely on a docking screen. In the subsequent sporadic publications in which these were tested the compounds generally proved to be near ineffective, but enough for a publication with the word Covid19 in the title. This proved to be such a major drain on funding that apparently most funding bodies were refusing to consider grant related to drug discovery for Covid19...

Another usage is discovering what proteins are bound by certain drugs that cause off-target effects by doing a longitudinal docking study with a given compound and find its target protein amidst the human proteome. This is however problematic because a key element is determining which parts of a protein are of interest —say binding somewhere on the surface is unlikely to have any effect on an enzyme for example. There are several algorithms that are trained to spot active sites, so it is not an impossible task, but it is far from as trivial as it sounds.

A further usage, and one less riddled with issues, is that AlphaFold2 models could reveal the binding mechanism of certain drugs whose targets could not be crystallised. As a results better variants could be designed.

Interpretation

A reoccurring concept in this blog post has been "interpretation". A structure or model is a great piece of information that nevertheless needs to be interpreted. Triosephosphateisomerase and alanine racemase are both TIM barrels, but catalyse completely different reactions. Therefore, an extra step is required to understand how the protein functions, and if required design an inhibitor or engineer it.

A protein in Michelanglo.
The green links are "prolinks", which change
the protein representation.

In order to help in the dissemination of protein information, I developed Michelanglo, a tool to share webpages with a user-written descriptions that can control an interactive protein. All too often when making a page for me to share, I realise something about the protein when annotating it!

A paper about a crystal structure generally has a lot of useful information and is not simply an inane description of what the authors can see. Okay, some papers are unfortunately like that, but often there are insightful connections that aren't obvious and biochemical assays probing the function.

Therefore, the Alphafold2 release is a great tool for biochemists and actually gives them more work and not less, because a lot of work is needed to interpret this treasure trove of data and convert it into knowledge.

The AlphaFold2 models do not replace crystallographers and actually will be used to solve crystals by molecular replacement without the need for the many tricks that have popped up along the years, allowing them to crystallise the protein in different states with different ligands. The phase problem is a problem, to the point that in the past decade, several structures, such as M-PMV retroviral protease, had to be solved by citizen scientists with FoldIt.

In fact, it is rather misleading hailing AlphaFold2 as a gamechanger for medicine (i.e. benefitting humans) more than hailing it as a gamechanger for our understanding of biochemistry (i.e. fundamental science), even if a better understanding of how proteins function ultimately is translated into medicine. AlphaFold2 does not circumvents the fundamental science part.

Supplementary notes

Finding partners

To migrate binding partners or ligands, one must know what the donor structures are.
To find out what is similar, go to NCBI Blast Protein and choose PDB as the database in "Choose search set".

The results should containt PDB 4 letter codes with an underscore chain letter.
If no results are found, a two step process will help. First, a PSI-Blast search ("Program selection") against RefSeq, preferably with the exclusion of over-represented taxa (with an included row with All taxon to avoid an occasional glitch) for a few iterations need to be done, then the PSSM matrix needs exporting (old view, export PSSM matrix), lastly a new search this time against the PDB dataset and with the downloaded PSSM matrix uploaded in algorithm parameters.

The candidate donors can be inspected in the PDB (https://www.rcsb.org/) to see what else is there. Do note that DNA is not marked in the description (e.g. https://www.rcsb.org/structure/1N6J).

Also note that the numbering in NCBI corresponds to the position in the sequence in the SEQRES record of a PDB, not the PDB residue index, what is called the pose index in Rosetta or the residue position in the canonical Uniprot isoform.

Migrating partners

First determine what part of the donor protein is homologous to your AlphaFold2 model. Then in PyMOL align this to your model. Assuming you want to move your donor onto the model, the command align donor, alphafold may be too broad, so try align donor and chain 👽 and resi 👹-👹, alphafold and resi 👾-👾, where 👽 is the chain letter that is homologous and 👹-👹 is the range of PDB residue indices (see note from previous section). If this does not work due to excessive sequence divergence, cealign alphafold and resi 👾-👾, donor and chain 👽 and resi 👹-👹 (inverted synthax) will do it. Then make sure no chains that are to be migrated are chain A (if they are do alter donor and chain A, chain='X' and then sort. Then do create combo, alphafold or (donor and not chain 👽) to make a new combined structure.

Membrane

A special case is adding membranes. This can be done in two ways:

adding an actual membrane. CHARMM GUI (https://www.charmm-gui.org/) is a great tool for input generation for MD and can add a customly defined membrane.
adding dots as a membrane marker. The site OPM (orientation of protein in membrane, https://opm.phar.umich.edu/) is a great resource which contains membrane structures oriented with a membrane marked as two leaves of dummy atoms (DUM residue).

The former is nice, but does make for a very heavy and rather tiresome to display model, the latter is nice and simple, but does have the issue that some viewers assume the dummy atoms are proximity bonded to the structure. Given the choice, I'd go with the latter.

The art of blowing up protein

Pages

Tuesday, 27 July 2021