There is nothing more disheartening than telling someone "Sorry, I cannot help you with your protein, because no homologue structures of your protein are solved and any model will be rubbish". Now, with AlphaFold2 proteome release this is no longer the case. Or mostly: in fact there are several pitfalls and issues that need to be looked at, because the algorithm does not account for three things: binding partners and ligands, oligomerisation and alternate conformations.
Gamechanger
Model making: threading
Whereas the structures on AlphaFold2 are better than models from other online servers or databases, but there are some caveats.
There are several ways to make a model. A model can be a one-to-one threaded model, an ab initio model or a hybridisation of threaded/ab initio models.
In threading, a structure is taken and the target sequence is mapped on based on a pairwise alignment. ModBase (database), Swiss-Model (database), Phyre2 one-to-one threading (on request) and Rosetta's ThreadingMover (local calculations) are example of threading applications. This performs poorly when there sequences are divergent, but when they are close it has several benefits.
First, if one has a protein in two conformations both conformations can be threaded separately. For example, with a carrier protein one generally has an open and a close face on either side of the membrane, if there are two structures of close homologues (>30–50 sequence identity) in these two conformations, one can thread twice to get both states and better understand how the protein operates. Unfortunately, AlphaFold2 does not take into account that proteins can have alternate conformations, so if one state has a stronger evolutionary signal the Evoformer will disregard the evolutionary contacts that favour the other states.
Second, one can migrate the bound ligands and/or protein after threading from the template to the model. This is incredibly useful, but runs on the assumption that the target has the same binding partners as the template. Evolution often takes measures to prevent crosstalk resulting in divergent binding patterns, but it is a mostly safe assumption that similar binding partners bind in a similar manner except for the divergent parts. In the case of small molecules, I-Tasser, an ab initio + threading hybridisation online algorithm, suggests the possibility of migrating these from the templates. Unfortunately, AlphaFold2 does not work by templates, or at least it is unaware of what specific model was used as a template in its training, therefore it does not do any such suggestion, although one can find out what may be useful (see supplementary notes below).
Loops
Not a bowl of spaghetti, but a poor Phyre2 model |
PDB:3L6X. E-cadherin peptide (turquoise) bound to catenin. |
In a crystal these unbound disordered loops lack density (i.e. missing density) and are therefore absent from the solved structure. I blogged about their addition in a past post and enumerated many reasons why this operation can be a bad idea. Many human AlphaFold2 models present with long spaghetti loops. Let's take MEF2C as an example: https://alphafold.ebi.ac.uk/entry/Q06413. This is not a random choice, but a protein that I have worked before on and my threaded model looked very different: https://michelanglo.sgc.ox.ac.uk/r/mef2c. Spot the difference:
Three and a half things are to be noted.
- My model is a dimer. Although in the Twittersphere there are already colabs notebooks that can model oligomers with AlphaFold2.
- My model is DNA.
- I do not have the loops with low pLDDT
- The different in information one learns from them is staggering!
Beads on a string
Even when there are no spaghetti loops, some multidomain protein may have structured domains, which do their thing independently of the other domains, like beads on a string. Namely, they are flexible in solution, but have an arrangement when bound. For example, whereas it is easy to predict how a KH domain looks like, it is not easy to predict how several KH domains interact with each other. For example, KSRP in AlphaFold2 looks like:
In blue are the KH domains, which have high pLDDT |
Transmembrane
Docking
There is a real potential in discovering new drugs that bind to the AlphaFold2 models. However, it is not a simple task and worse it may backfire. One problem that is going to arise in my opinion from this release is a swathe of docking studies that will not really lead anywhere —pun indented. This is because docking is a complicated toolset that requires scepticism, human-attention-to-details and majorly validation of a large subset of targets.
Covid19 is a good example of the issues with docking. With Covid19 the Zhang group (I-Tasser) released models of the viral protein and shortly afterwards there were two effects. The milder effect was a deluge of questions in Stack Exchange (SE) Bioinformatics, SE Chemistry and even Stack Overflow by users trying their hand at docking clearly without understanding what they were doing and without an understanding the system at hand —e.g. setting the grid the size of the protein, not protonating catalytic residues correctly, improper data analysis etc. The more problematic effect was a deluge of papers in BioRxiv and ChemRxiv, which later translated to actual publications, claiming that a drug could be repurposed based solely on a docking screen. In the subsequent sporadic publications in which these were tested the compounds generally proved to be near ineffective, but enough for a publication with the word Covid19 in the title. This proved to be such a major drain on funding that apparently most funding bodies were refusing to consider grant related to drug discovery for Covid19...
Another usage is discovering what proteins are bound by certain drugs that cause off-target effects by doing a longitudinal docking study with a given compound and find its target protein amidst the human proteome. This is however problematic because a key element is determining which parts of a protein are of interest —say binding somewhere on the surface is unlikely to have any effect on an enzyme for example. There are several algorithms that are trained to spot active sites, so it is not an impossible task, but it is far from as trivial as it sounds.
A further usage, and one less riddled with issues, is that AlphaFold2 models could reveal the binding mechanism of certain drugs whose targets could not be crystallised. As a results better variants could be designed.
Interpretation
A reoccurring concept in this blog post has been "interpretation". A structure or model is a great piece of information that nevertheless needs to be interpreted. Triosephosphateisomerase and alanine racemase are both TIM barrels, but catalyse completely different reactions. Therefore, an extra step is required to understand how the protein functions, and if required design an inhibitor or engineer it.
A protein in Michelanglo. The green links are "prolinks", which change the protein representation. |
In order to help in the dissemination of protein information, I developed Michelanglo, a tool to share webpages with a user-written descriptions that can control an interactive protein. All too often when making a page for me to share, I realise something about the protein when annotating it!
A paper about a crystal structure generally has a lot of useful information and is not simply an inane description of what the authors can see. Okay, some papers are unfortunately like that, but often there are insightful connections that aren't obvious and biochemical assays probing the function.
Therefore, the Alphafold2 release is a great tool for biochemists and actually gives them more work and not less, because a lot of work is needed to interpret this treasure trove of data and convert it into knowledge.
The AlphaFold2 models do not replace crystallographers and actually will be used to solve crystals by molecular replacement without the need for the many tricks that have popped up along the years, allowing them to crystallise the protein in different states with different ligands. The phase problem is a problem, to the point that in the past decade, several structures, such as M-PMV retroviral protease, had to be solved by citizen scientists with FoldIt.
In fact, it is rather misleading hailing AlphaFold2 as a gamechanger for medicine (i.e. benefitting humans) more than hailing it as a gamechanger for our understanding of biochemistry (i.e. fundamental science), even if a better understanding of how proteins function ultimately is translated into medicine. AlphaFold2 does not circumvents the fundamental science part.
Supplementary notes
Finding partners
To find out what is similar, go to NCBI Blast Protein and choose PDB as the database in "Choose search set".
If no results are found, a two step process will help. First, a PSI-Blast search ("Program selection") against RefSeq, preferably with the exclusion of over-represented taxa (with an included row with All taxon to avoid an occasional glitch) for a few iterations need to be done, then the PSSM matrix needs exporting (old view, export PSSM matrix), lastly a new search this time against the PDB dataset and with the downloaded PSSM matrix uploaded in algorithm parameters.
Migrating partners
align donor, alphafold
may be too broad, so try align donor and chain 👽 and resi 👹-👹, alphafold and resi 👾-👾
, where 👽 is the chain letter that is homologous and 👹-👹 is the range of PDB residue indices (see note from previous section). If this does not work due to excessive sequence divergence, cealign alphafold and resi 👾-👾, donor and chain 👽 and resi 👹-👹
(inverted synthax) will do it. Then make sure no chains that are to be migrated are chain A (if they are do alter donor and chain A, chain='X'
and then sort
. Then do create combo, alphafold or (donor and not chain 👽)
to make a new combined structure.Membrane
- adding an actual membrane. CHARMM GUI (https://www.charmm-gui.org/) is a great tool for input generation for MD and can add a customly defined membrane.
- adding dots as a membrane marker. The site OPM (orientation of protein in membrane, https://opm.phar.umich.edu/) is a great resource which contains membrane structures oriented with a membrane marked as two leaves of dummy atoms (DUM residue).
No comments:
Post a Comment