The art of blowing up protein: protein

Showing posts with label protein. Show all posts

Sunday, 31 December 2023

A possible BioB bipass route

Nearly a decade ago there was something bugging me and I believe I have figured it out —although it's pointless now. Namely, is another way of making biotin possible without using BioB, biotin synthase, an incredibly slow multistep radical SAM enzyme.

Move aside coIP Westerns, ColabFold has got this!

Recently AlphaFold2 released a new batch of models, this time covering all of the Trembl sequences in Uniprot, resulting in a huge number, which got hashtag-academic-twitter and some news editors very excited for the stamp-collecting feat. Personally, I find it annoying, not because it's pointless, but as of writing this, it has made any search for a target by name swamped by irrelevant sequences.
However, AlphaFold is great for other feats.
I have blogged about it a few times (e.g. link), which gives away my positive view of it! It can predict oligomers, with a lot more precision and confidence than docking. It does not always work either technically or meet the hypothesis. I did a long series of experiments with a hypothesis in mind which wasn't valid in the end (here), but revealed novel science and took a few minutes to set up and a few hours to run, which would have taken years if done by Western blot of a co-immunoprecipitation or cross-linking mass-spec.

Top 10 silliest PDB residue names for ligands!

UPDATE: The PDB will finish 3 letter chemical component IDs sometime before 2024 at which point they will switch to 5 letter codes, which will be usable solely in CIF format: https://www.wwpdb.org/news/news?year=2022#630fee4cebdf34532a949c34

In some situations it is handy to use in an in silico experiment a 3-letter residue name that is not taken in the PDB. For example, PyRosetta has a system of pregenerated topologies for PDB components, which can cause issues when a ligand is loaded and the movers may use that over an incorrectly provided residue type / param file, resulting in a blown up mishapen ligand —an overly common incident*. As a result, having a list handy of what is taken is helpful. Herein are some silly observations about what the taken and untaken names are —but not ranked as a top 10, because this is not a science blog, not my local newspaper.

Show neighbours in nglview

Nglview is a really nice Python library which encodes a widget to show a NGL viewport, a JS 3D protein viewer used until recently by the PDB. One annoying feature is that one cannot select neighbours as easily as say PyMOL's "select byres HEM around 3". But it is possible and here is how.

Multiple sequence alignments

A sequence alignment is a rather important tool.

Sequence conservation is a key ingredient in most nucleotide mutation severity predictors.
The covariance within it powers the AlphaFold2 Evoformer and other de novo structure predictors.
The phylogeny extracted from it tells the evolutionary tale of the protein

However, on the very basic level, i.e. getting a nice figure, far from the world of covariance matrices, it is a slight nuisance.

Therefore I would like share some pointers on choosing species and two python operation, namely getting the equivalent residue in a homologue and making a figure in Plotly. Just like with docking, where careful and diligent human choices make all the difference, rational choices help greatly with clarity for sequence alignments.

Filling missing loops by cannibalising AlphaFold2

I could not resist this Photoshop.
But the process is not as dramatic
and the results not as bad as Temple of Doom...
If done right.

AlphaFold2 models have a complete sequence, but for innumerable reasons the crystal structure of the protein is better, but may have missing spans. As a result one may want, for illustrative purposes only, to rip out the required parts from the AlphaFold2 models (as fragments) and have them built into the target structure. Here is how to do it by threading.

What to look out for with an AlphaFold2 model

There is nothing more disheartening than telling someone "Sorry, I cannot help you with your protein, because no homologue structures of your protein are solved and any model will be rubbish". Now, with AlphaFold2 proteome release this is no longer the case. Or mostly: in fact there are several pitfalls and issues that need to be looked at, because the algorithm does not account for three things: binding partners and ligands, oligomerisation and alternate conformations.

Per residue RMSD

Recently I calculated the local RMSD caused by each residue and I thought I'd share the methods I used using PyRosetta —it is nothing at all novel, but I could not find a suitable implementation. The task is simple given two poses, find out what residue's backbone is changing the most by scanning along comparing each a short peptide window from each.

Remodel in Pyrosetta

The Rosetta binary Remodel is a great tool as it allows interesting designs to be made. However, it is rather incompatible with Rosetta Scripts and Pyrosetta as it is heavily dependent on command line options for customisation and repeats some of the processes internally. Despite this, it can be cohersed rather effectively to work in Pyrosetta with some convenience and this is how.

Multiple poses in NGLView

As mentioned previously, most of my Pyrosetta operations are done in a Jupyter notebook run in a cluster node. As a result, I am heavily dependent on NGLView, an IPython widget that uses NGL.js. This is nice for some quick tasks, although admitted more limited than the PyMOL mover, which however requires another ssh to forward another port. My Michelanglo webapp uses NGL.js, so I cannot but say good things of NGL.js. However, one or two things in the Python module NGLView are not immediately clear, so I'll quickly cover dealing with multiple poses here.

Switching ligand in a PDB with Fragmenstein

For the Covid Moonshot project, one question by Prof. Frank von Delft of Diamond XChem led to a series of events that culminated in Fragmenstein, a module to do fragment mergers when the followup is as faithful to the starting crystal hits as possible. Even if it's intended use is the hit-to-lead process, there is a nice use that make it rather handy for computational biochemistry in general: switching the ligand in a PDB to another in an energy minimised fashion that obeys the original ligand.

Filling missing loops —the proper way

Previously, I posted about how to join proteins and add missing loops the shoddy way. Now I'll address how to do it correctly, using Rosetta or Pyrosetta —I am sorry this has been so long overdue.
Since posting this, I realised one can do it even faster by hijacking the threading algorithm, which albeit not it's intended purpose works fine for fixing a structure without supervision —which the following discussed methods do.

How to set up an electron density scorefunction in Pyrosetta

Energy minimising structures in Rosetta/Pyrosetta is essential to avoid artifactual results. Say a mutation is introduced and in the protocol the neighbourhood is repacked: if the structure is not energy minimised properly the neighbourhood repacking step will spuriously reward the mutation a very negative ∆∆G. One worry is that the energy minimisation is not faithful to the crystal structure. This argument has two sides, on one the fudgey force fields in Rosetta do not truly model the chemical interactions while on the other crystal packing may be unnatural. Both points have merit. After all Rosetta does use implicit water, which do not behave like the stripped crystallographic waters and some residues may have non-standard protonations etc. But if one wants one can use a scorefunction that is weighted by the electron density map and here is how.

Atom names purely in RDKit

For some applications, such as PyMOL scripts or Rosetta, atom names are really important, say CA is the standard name for the α-carbon. Example uses of atom names in Rosetta/pyrosetta include setting constraints, using a params file for a custom ligand and so forth. However, RDKit is a bit of a nuisance with atom names as it is not a central feature, but a feature added for PDB files that is not too well documented.

Go away glycerol!!

Due to the nature of crystallisation additives are often found in PDB structures. These are generally unwelcome, especially if you want to extract ligands. In fact, I have heard only once someone talk excitedly about their crystallisation reagent in their structure, but only because they were trying to flog it off as an allosteric binding site. Generally, they are just annoying. Luckily you don't need reinvent the wheel as a list or two already exist!

PDB numbering rollercoaster

The position in a crystal structure and the protein sequence rarely match. In fact, there are four parts of start-end:

position in whole protein,
position in extracted sequence,
position in residues stated in the PDB/mmCIF structure and
position which actually has coordinates.

When will the PDB run out of 4-letter codes?

The PDB ids are really nice and short: 4 letter codes. But when will all the combinations run out? Actually, not for a long long time.
The current total is 155,618 structures and new ones are added at a rate of 12000 structures per year, which means that, assuming a constant growth, in 125 years —(36 ^ 4 - 155,618 ) / 12,000 —the PDB will finish codes to allocate.
2145. That is a few years after the setting of Kim Robinson's New York 2140, where New York is a flooded super-Venice, so I am guessing the RCSB PDB, in San Diego, will have long been flooded so lack of 4-letter codes is not top of their concerns.

A note on the Linux PyMOL C01 atom oddity

This weird bug has been haunting me for ages. The PyMOL 1.8 (not 2 in Win or Mac) and Linux PyMOL 2 builder creates residues with a Cα called C01 as opposed to CA. If any operation is done to these (e.g. Rosetta Relax), they will be discarded during the reading of the file. That is, they will not be fixed and worse if Rosetta Remodel is used, it will assume that the residue never existed, because Remodel does not understand PDB numbering annoyingly. Simply substituting all 'C01' to 'CA' fixes the problem.

Phosphorylated PDB files

Sometime in human protein, a residue is phosphorylated, yet the model one gets from I-TASSER, Phyre etc. or the actual PDB structure lacks these. Here is how to add them easily and quickly with Rosetta.

Everything you wanted to know about isopeptide bonds in Rosetta, but were too afraid to ask

Rosetta is great at predicting (with some accuracy) the energies of variant proteins, however, to make the most out of it with proteins with internal isopeptide bonds a few considerations are needed.

Pages

Sunday, 31 December 2023

Saturday, 1 October 2022

Sunday, 19 June 2022

Tuesday, 10 May 2022

Sunday, 31 October 2021

Sunday, 17 October 2021

Tuesday, 27 July 2021

Wednesday, 7 July 2021

Monday, 26 April 2021

Monday, 22 February 2021

Tuesday, 21 July 2020

Saturday, 4 July 2020

Sunday, 19 April 2020

Wednesday, 18 March 2020

Thursday, 7 November 2019

Wednesday, 4 September 2019

Saturday, 3 August 2019

Friday, 31 May 2019

Tuesday, 15 January 2019

Tuesday, 18 September 2018