Tweaking AlphaFold2 models with PyRosetta

Monday 23 August 2021

Tweaking AlphaFold2 models with PyRosetta

In a previous post I explored the pitfalls of an AlphaFold2 model from EBI. Here I thought I'd share some PyRosetta methods that may be handy to use with AlphaFold2 models.

Overview

This is meant to be a handy collection of function, wherein one can search for what one wants.

It is not meant to be a primer for PyRosetta, but I hope these notes may be helpful also for those who have never used PyRosetta, but would like a taster as it is a framework well worth learning to use! As a result I will try and be as comprehensive as is reasonable, so I apologise if any parts are too basic for advanced PyRosetta users.

If you want to use Colabs, here is a Colabs notebook with PyRosetta

Download or make

There are two options for an AlphaFold2 model, one is to download ofd the EBI and the other is to make it from scratch with relevant mods on a Colabs notebook. The Twittersphere contains lots of useful information about doing the later. The original Colab notebook now has a few siblings, with different functionalities, but the best are in the ColabFold GitHub repo by Sergey "Krypton" Ovchinnikov (@sokrypton, a PI at Harvard). Whereas the EBI downloads provide a single model, creating the model oneself allows an output of multiple models, 5 by default, and these often represent different conformations. And these can be oligomers and protein-protein complexes, etc.
But unless you are using the ruse of high-throughput to not polish, generally more work is required, such as adding ligands, stretching, mutagenesis etc. etc.

Helper module

Shameless advertisement. I keep putting general snippets of PyRosetta code in my pyrosetta_help module, which consists of a collection of functions that do diverse things, and the functions I mention here are there too —so no need to copy-paste.

For example, for starting Pyrosetta I like to initialise like so:
import pyrosetta
from pyrosetta_help import make_option_string, configure_logger, get_log_entries

# capture to log
logger = configure_logger()
# give CLI attributes in a civilised way
pyrosetta.distributed.maybe_init(extra_options=make_option_string(no_optH=False,
                                                ex1=None,
                                                ex2=None,
                                                mute='all',
                                                ignore_unrecognized_res=False,
                                                load_PDB_components=False,
                                                ignore_waters=False)
                               )

Model

Download

The most obvious function is a AF2-EBI downloader

import requests
import pyrosetta

def pose_from_alphafold2(uniprot:str) -> pyrosetta.Pose:
    reply = requests.get(f'https://alphafold.ebi.ac.uk/files/AF-{uniprot}-F1-model_v1.pdb')
    assert reply.status_code == 200
    pdbblock = reply.text
    pose = pyrosetta.Pose()
    pyrosetta.rosetta.core.import_pose.pose_from_pdbstring(pose, pdbblock)
    return pose

As of writing, only one model can be downloaded and there is only version 1. The address accepts a Uniprot accession, which is a letter plus digits, e.g. P01112, and should not be confused with a Uniprot name, RASH_HUMAN. Understandably only key species have their proteome modelled. Here is a dictionary of species available and their NCBI taxon identifiers (taxids).

alphamodels = { # name: taxon id
               'Arabidopsis thaliana': 3702,
               'Caenorhabditis elegans': 6239,
               'Candida albicans': 5476,
               'Danio rerio': 7955,
               'Dictyostelium discoideum': 44689,
               'Drosophila melanogaster': 7227,
               'Escherichia coli': 562,
               'Glycine max': 3847,
               'Homo sapiens': 9606,
               'Leishmania infantum': 5671,
               'Methanocaldococcus jannaschii': 243232,
               'Mus musculus': 10090,
               'Mycobacterium tuberculosis': 1773,
               'Oryza sativa': 4530,
               'Plasmodium falciparum': 5833,
               'Rattus norvegicus': 10116,
               'Saccharomyces cerevisiae': 4932,
               'Schizosaccharomyces pombe': 4896,
               'Staphylococcus aureus': 1280,
               'Trypanosoma cruzi': 5693}

Visualise

The next requirement is to visualise it. The PyMOLMover is great, but I generally work on a remote notebook (as discussed here and here) so I prefer to use NGL, which is embedded in the Jupyter notebook as a widget. And alternative plugin is py3Dmol (as used by colabfold notebooks as the former does not work in Colabs), but I am more familar with NGL so I use that!
Here is a class that adds a add_rosetta bound method as discussed in the Multiple poses in NGLView post, but colouring by b-factor by default, and a method to show a pose in salmon (#F8766D) and the original pose in turquoise (#00B4C4).

import nglview as nv
from io import StringIO
import pyrosetta

class ModNGLWidget(nv.NGLWidget):
    def add_rosetta(self, pose: pyrosetta.Pose, color_by_bfactor:bool=True):
        buffer = pyrosetta.rosetta.std.stringbuf()
        pose.dump_pdb(pyrosetta.rosetta.std.ostream(buffer))
        fh = StringIO(buffer.str())
        c = self.add_component(fh, ext='pdb')
        if color_by_bfactor:
            c.update_cartoon(color='bfactor')
        return c

  def make_pose_comparison(self, pose: pyrosetta.Pose, original: pyrosetta.Pose) -> ModNGLWidget:
      c0 = self.add_rosetta(original)
      c0.update_cartoon(color='#00B4C4', smoothSheet=True)
      c1 = self.add_rosetta(pose)
      c1.update_cartoon(color='#F8766D', smoothSheet=True)
    return view

Note how the bfactor is a colour (component.update_cartoon(color='bfactor')), this is a special case, similar to the colour "atomic" in PyMOL (colorValue argument sets carbon only in NGL). Parenthesis: in my pyrosetta_help module I have these methods, but as a monkey patch to the NGLWidget class. So importing pyrosetta_help alters nglview —I should state that adding methods to a class in a different module (monkeypatching) is heavily frowned upon.

Error

However, a problem are the spaghetti loops, that is to say not all residues are as good. Luckily, EBI or colabfold provide with two useful pieces of information, pLDDT and the distance errors in Ångström for each residue pair. The AF2-EBI site explains these very well, so I will not go into detail, but pLDDT is a 0-100 value stored as the isotropic temperature factor in the PDB files that represents the confidence (high is good) of the residue, while the per-residue errors are the errors associated with each pair of atoms. The latter is stored in a JSON as a list whose sole element list is a dictionary with keys 'residue1', 'residue2' and 'distance'.

Download

The dictionary format is a bit naff, hence, why the following function, get_alphafold2_error returns a normal numpy n × n matrix, after calling reshape_errors.

import requests
from typing import *
import numpy as np

def get_alphafold2_error(uniprot:str, reshaped=True) -> Union[np.ndarray, list]:
    """
    Returns the distances errors either as numpy matrix (``reshaped=True``)
    or as the weird format from AF2-EBI —see ``help(reshape_errors)`` for more.

    Remember that the matrix is zero indexed and that these values are in Ångström
    and are not pLDDT, which are stored as b-factors.
    """
    # https://alphafold.ebi.ac.uk/files/AF-Q00341-F1-predicted_aligned_error_v1.json
    reply = requests.get(f'https://alphafold.ebi.ac.uk/files/AF-{uniprot}-F1-predicted_aligned_error_v1.json')
    assert reply.status_code == 200
    errors = reply.json()
    if reshaped:
        return reshape_errors(errors)
    else:
        return errors

def reshape_errors(errors: List[Dict[str, list]]) -> np.array:
    """
    The JSON from AF2 has a single element list.
    the sole element is a dictionary with keys 'residue1', 'residue2' and 'distance'.
    This method returns a matrix of distances reshaped based on the stated residue indices.
    This is rather unlikely to differ from a regular reshape...
    but idiotically I am not taking changes assuming it is always sorted.
    """
    n_residues = int(np.sqrt(len(errors[0]['distance'])))
    error_matrix = np.zeros((n_residues, n_residues)) * np.nan
    for i,d in enumerate(errors[0]['distance']):
        error_matrix[errors[0]['residue1'][i] - 1, errors[0]['residue2'][i] - 1] = d
    return error_matrix

If using a local saved JSON:

filename = f'{folder}/rank_{rank}_model_{model}_ptm_seed_{seed}_pae.json'
with open(filename, 'r') as fh:
    errors = reshape_errors(json.load(fh))

It is always wise to check that there is no misunderstanding involved in the downloaded data, so here is a method that returns a Plotly figure, that is a PAE in the same colour scale —but using a nice interactive graphing library.

def make_pae_plot(errors: np.ndarray) -> go.Figure:
    """
    Make AlphaFold2-EBI like PAE plot
    """
    fig = px.imshow(errors, color_continuous_scale=[(0,'green'), (1, 'white')])
    return fig

Constrain

For a crystal structure, minimising against the ccp4 map is a wise call (see past post), so likewise these errors provide a similar function as one can make a constraint* set with them, thus preventing nice parts from"blowing up" to use the Gromacs term for it.

(* In Rosetta a "constraint" is a crystallography refinement "restraint").

def constrain_distances(pose: pyrosetta.Pose, 
                    errors:np.ndarray, 
                    cutoff:float=5, 
                    weight:float=1, 
                    blank:bool=True):
"""
Add constrains to the pose based on the errors matrix.
NB. this matrix is a reshaped version of what AF2 returns.

A harmonic function is added to CA atoms that are in residues with the error under a specified cutoff.
The mu is the current distance and the standard deviation of the harmonic is the error times ``weight``.

``blank`` as False keeps the current constaints.

To find out how many were added:

>>> len(pose.constraint_set().get_all_constraints())
"""
get_ca = lambda r, i: pyrosetta.AtomID(atomno_in=r.atom_index('CA'), rsd_in=i)
HarmonicFunc = pyrosetta.rosetta.core.scoring.func.HarmonicFunc
AtomPairConstraint = pyrosetta.rosetta.core.scoring.constraints.AtomPairConstraint
cs = pyrosetta.rosetta.core.scoring.constraints.ConstraintSet()
if not blank:
    previous = pyrosetta.rosetta.core.scoring.constraints.ConstraintSet().get_all_constraints()
    for con in previous:
        cs.add_constraint(con)
for r1_idx, r2_idx in np.argwhere(errors < cutoff):
    d_error = errors[r1_idx, r2_idx]
    residue1 = pose.residue(r1_idx + 1)
    ca1_atom = get_ca(residue1, r1_idx + 1)
    residue2 = pose.residue(r2_idx + 1)
    ca2_atom = get_ca(residue2, r2_idx + 1)
    ca1_xyz = residue1.xyz(ca1_atom.atomno())
    ca2_xyz = residue2.xyz(ca2_atom.atomno())
    d = (ca1_xyz - ca2_xyz).norm()
    apc = AtomPairConstraint(ca1_atom, ca2_atom, HarmonicFunc(x0_in=d, sd_in=d_error * weight))
    cs.add_constraint(apc)
setup = pyrosetta.rosetta.protocols.constraint_movers.ConstraintSetMover()
setup.constraint_set(cs)
setup.apply(pose)
return cs

In the above the pose's constraint set gets replaced, this is not really required and is simply playing it safe as the ConstraintSet returned by pose.constraint_set() is not a copy so the method add_constraint will work fine —unlike pose.constraint_set().get_all_constraints().append, which adds to a vector1-derived objected that does not modify the set.

To see how many constraints a pose has, one can do len(pose.constraint_set().get_all_constraints()).

Also, adding constraints to a pose is only half the job, as one needs to set the weight on the scorefunction for that constraint.

scorefxn = pyrosetta.get_fa_scorefxn()  # default is ref2015
ap_st = pyrosetta.rosetta.core.scoring.ScoreType.atom_pair_constraint
scorefxn.set_weight(ap_st, 1)

To check how much do the constraints contribute to the scorefunction, one can use the print_constraint_scores or constraints2pandas in my pyrosetta_help module (see code). Also to visualise the constraints in a pose, check out the add_constraints in the monkeypatched NGLView (may crash the browser for thousands of constraints):



As mentioned this is to prevent odd things such as residues moving apart due to a clash etc. in a first instance. After which, it is best to remove them, or the operations will lag.

Long distance

Traditionally, a multidomain protein is cut up and solved in chunks, leaving a problem of whether the domains interact. With AlphaFold2, this is not the case as long distance interactions are revealed. To find the longest distance of interacting residues, one can cheat and simply use the error matrix. To do so two rounds of np.argwhere (one for a error-distance cutoff, the second for a residue separation), will do the trick:

import numpy as np
distance_cutoff = 5 # 5Å
residue_cutoff = 170
closer = np.argwhere(error<distance_cutoff)  # x,y indices less than ``distance_cutoff``
resi_distances = np.abs(closer[:,0] - closer[:,1])
# the following +1 is because indices are C style, while residue index is human/Fortran style
distant_pairs = np.squeeze(closer[np.argwhere(resi_distances > residue_cutoff)] + 1)

Alternatively, to see the most sequence-distant spacially close (according to error) residue pairs

print(f'{np.max(resi_distances)} residues apart:')
closer[np.reshape(np.argwhere(resi_distances == np.max(resi_distances)), -1)] + 1

(If fairly unfamiliar with numpy, don't worry about the np.squeeze or np.reshape(array, -1) calls: they simply remove singleton dimensions and flatten to a vector respectively, because np.argwhere, the command to get the indices where a condition is met, likes to return extra dimensions).

In the above I should emphasise I am using the errors as a proxy for distance. In reality the cartesian distance in the model may differ. To get a matrix of that:

import numpy as np

distances = np.zeros((pose.total_residue(),pose.total_residue()))
ca_xyzs = [pose.residue(r).xyz('CA') for r in range(1, pose.total_residue()+1)]
for i in range(len(ca_xyzs)):
    for j in range(i):
        d = ca_xyzs[i].distance(ca_xyzs[j])
        distances[i, j] = d
        distances[j, i] = d

Using the same numpy operation will give residues that are close. The difference between error and distance matrices will be how much the position is misleading.

I like to stretch models to avoid confusion, which is discussed in the "Stretch" section below.

Minimisation

In order to cover the basics, I ought to mention that to get a ∆G for a model —say to compare to the same sequence folded different— one needs to minimise, i.e. make minor changes to the pose to make it more energetically favourable according to the forcefield.

cycles = 3  # equivalent to quick mode in the binary
cycles = 15  # equivalent to thorough mode in the binary
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 3) 
relax.apply(pose)

Now, a thing to note is that scorefxn is the one from above and has atom_pair_constraint set to a non-zero value and the pose has many constraints.

ap_st = pyrosetta.rosetta.core.scoring.ScoreType.atom_pair_constraint
print('Weight:', scorefxn.get_weight(ap_st))
print('N_constraints:', len(pose.constraint_set().get_all_constraints())

This is unsuitable for comparing one model with another even if they contain the same residues. For that a scorefunction without the atom_pair_constraint weight is required.

vanilla_scorefxn = pyrosetta.get_fa_scorefxn()
dG = vanilla_scorefxn(pose)
print(f'{dG} kcal/mol')

Obviously, to dump a scored PDB, the unweighted one would make more sense.

pose.dump_scored_pdb('foo.pdb', vanilla_scorefxn)

If you want to see where the models differ, you may want to check out my other post about per residue RMS using PyRosetta.

If you want to analyse the energetic contribution of the residues of a minimised model, you may want to check out the pose per residue energy scores to pandas function I wrote:

from pyrosetta_help import pose2pandas
scores = pose2pandas(pose)
# example: get the worst
scores.loc[scores.total_score > 5][['residue', 'total_score']]

pLDDT

Another useful value is the pLDDT stored as b-factors in the PDB file. These are not b-factors as the many warnings for molecular replacement remind us. These are a percentage value of how good the residue location is, where 100 is perfect and 0 is aweful —so the inverse of b-factors.

superimposition

To superimpose two structures ("align" to use a nasty PyMOL term), one would preferably want to superimpose the non-spaghetti parts, so this function, superimpose_by_pLDDT, (modified from this github gist) can help:

import difflib
import pyrosetta
from pyrosetta.rosetta.std import map_core_id_AtomID_core_id_AtomID
from pyrosetta.rosetta.core.id import AtomID


def common_residue_indices(a: pyrosetta.Pose, b: pyrosetta.Pose):
    """Get paired indicies of common residues in two structures."""
    aseq = a.sequence()
    bseq = b.sequence()
    astart, bstart, align_len = difflib.SequenceMatcher(a=aseq, b=bseq).find_longest_match(0, len(aseq), 0, len(bseq))
    return [(astart + i, bstart + i) for i in range(align_len)]


def superimpose_by_pLDDT(pose: pyrosetta.Pose,
                         original: pyrosetta.Pose,
                         cutoff=70,
                         pose_range=None) -> map_core_id_AtomID_core_id_AtomID:
    """
    Superimpose two poses, based on residues with pLDDT above a given threshold.

    :param pose:
    :param original:
    :param cutoff: %
    :param pose_range: optional argument to subset (start:int, end:int)
    :return:
    """
    ca_map = map_core_id_AtomID_core_id_AtomID()
    for w, a in paired_residue_inds(pose, original):
        if original.pdb_info().bfactor(a + 1, 1) <= cutoff:
            continue
        if pose_range is not None and (w < pose_range[0] or w > pose_range[1]):
            continue
        ca_map[AtomID(pose.residue(w + 1).atom_index("CA"), w + 1)] = AtomID(
            original.residue(a + 1).atom_index("CA"), a + 1
        )
    assert len(ca_map), 'No atoms greater than cutoff'
    pyrosetta.rosetta.core.scoring.superimpose_pose(pose, original, ca_map)
    return ca_map

However, this will not work when there are independent domains. For which case, one would have to pick which domain to care about (pose_range argument).

Selector

Unfortunately, there is not a ResidueSelector that selects by b-factors, so one has to iterate across each the PDBInfo:

def get_bfactor_vector(pose: pyrosetta.Pose, cutoff: float, above=True) -> pyrosetta.rosetta.utility.vector1_bool:
    """
    Return a selection vector based on b-factors.
    above = get all above. So to select bad b-factors above is ``True``,
    but to select AF2 bad ones. above is ``False``
    """
    pdb_info = pose.pdb_info()
    vector = pyrosetta.rosetta.utility.vector1_bool(pose.total_residue())
    for r in range(1, pose.total_residue() + 1):
        try:
            atom_index = pose.residue(r).atom_index('CA')
        except AttributeError:
            atom_index = 1
        bfactor = pdb_info.bfactor(r, atom_index)
        if above and bfactor >= cutoff:
            vector[r] = True
        elif not above and bfactor <= cutoff:
            vector[r] = True
        else:
            pass  # vector[r] = False
    return vector

Obsolete pdb_info

One thing to watch out of is the pose.pdb_info().obsolete() value. This is set to true when certain operations are done and as a result of it, the PDBInfo gets ignored when dumping to file —setting it to true or adding the PDBInfo of an unmodified pose will circumvent this.

Stretch

syzygy

Independent domains should ideally be spaced out to avoid ambiguity to humans and they make for better pictures —like planets in syzygy as metioned in the previous post. As a result, adding a constrait that is not too strenous on the scorefunction would do it:

def add_stretch_constraint(pose: pyrosetta.Pose,
                           weight: float = 5,
                           slope_in: float = -0.05,
                           residue_index_A: int = 1,
                           residue_index_B: int = -1,
                           distance: Optional[
                               float] = None) -> pyrosetta.rosetta.core.scoring.constraints.AtomPairConstraint:
    """
    Add a constraint to "stretch out" the model, because ``slope_in`` is negative.

    :param pose: Pose to add constraint to
    :param weight: how strength of constraint (max of 0.5 for ``SigmoidFunc``)
    :param slope_in: negative number to stretch
    :param residue_index_A: first residue?
    :param residue_index_B: last residue is "-1"
    :param distance: if omitted, the midpoint of Sigmoid will be the current distance
    :return:
    """
    # get current length
    first_ca = pyrosetta.AtomID(atomno_in=pose.residue(residue_index_A).atom_index('CA'),
                                rsd_in=residue_index_A)
    if residue_index_B == -1:
        residue_index_B = pose.total_residue()
    last_ca = pyrosetta.AtomID(atomno_in=pose.residue(residue_index_B).atom_index('CA'),
                               rsd_in=residue_index_B)
    first_ca_xyz = pose.residue(1).xyz(first_ca.atomno())
    last_ca_xyz = pose.residue(pose.total_residue()).xyz(last_ca.atomno())
    if distance is None:
        distance = (first_ca_xyz - last_ca_xyz).norm()
    # make & add con
    sf = pyrosetta.rosetta.core.scoring.func
    AtomPairConstraint = pyrosetta.rosetta.core.scoring.constraints.AtomPairConstraint
    fun = sf.ScalarWeightedFunc(weight, sf.SigmoidFunc(x0_in=distance, slope_in=slope_in))
    con = AtomPairConstraint(first_ca, last_ca, fun)
    pose.constraint_set().add_constraint(con)
    return con

Note that the slope_in is negative. On many of the functions, including the harmonic function, setting a negative sigma/slope will result in repulsion and not attraction... Obviously, a negative sigma on a harmonic function (HarmonicFunc) is just an explosion, whereas a flat harmonic (FlatHarmonicFunc) is a better choice. A sigmoid (SigmoidFunc) with a negative slope as used above has the nice thing that below the distance x0 the score is positive (bad), while above it it is negative (good) and it plateaus at ±0.5, hence the additional ScalarWeightedFunc to beef it up.

NB. Setting constraints too high makes them dominant the score. Ideally, they should contribute less than 5% of the final score.

Let's see what the function looks like:

import plotly.express as px
import numpy as np

# example data
slope_in = -0.05
distance = 30
weight = 5

x = np.linspace(0, 200, 100)
# y = f(x) via PyRosetta
sf = pyrosetta.rosetta.core.scoring.func
fun = sf.ScalarWeightedFunc(weight, sf.SigmoidFunc(x0_in=distance, slope_in=slope_in))
# fun.func is a function that given a float returns another one. No extra values required.
y = np.vectorize(fun.func)(x)
# equivalent to:
# y  = (np.reciprocal(np.exp(( x-distance ) * (-slope_in)) + 1) - 0.5) * weight
# plot
fig = px.line(x=x, y=y, title=f'Sigmoid function x_0={distance}, m={slope_in}')
fig.add_vline(x=distance, line_color="gainsboro")
fig.update_layout(xaxis=dict(title='Distance [Å]'), yaxis=dict(title='Score'))
fig


Example

Let's take as example, spectrin-alpha. This protein forms scaffolding, whereas in EBI-AF2 the model is curled up on itself.

pose = ph.pose_from_alphafold2('Q13813')

import nglview as nv
view = nv.show_rosetta(pose)
view.update_cartoon(color='bfactor')
view

Then we could add the errors as contraints:

error = ph.get_alphafold2_error('Q13813')
ph.add_pae_constraints(pose, error)

Or check to see if there are any long distance interactions (viz. "Long distance" section above).

Having seen that we definetely want to stretch it isoenergetically:

ph.add_stretch_constraint(pose, 10)
# scorefxn is weighted for atom pairs
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 3)
relax.apply(pose)

Often however, it may be necessary to remove the stretching constraint (or leave it) and add a new one because beyond a certain point, the slightest unfavourable Ramachandran angle may disfavour the streching constraint.

Parenthesis about stretching with FastRelax

Now, FastRelax is not really meant for this and may not be as fast as hoped. One could use FastRelax, but with only backbone movement allowed in the movemap. One could use a more basic mover as MonteCarlo trials. And one may be tempted to do something in centroid mode. The latter is not a great idea.

But first, taking a step back, I'd like to mention that whereas we thing of atoms moving in 3D cartesian space along a X, Y, Z coordinate, the majority of operations in FastRelax consist of rotating bonds, which may results in a large change in catersian space —exploited here. So just tweaking the settings of FastRelax makes sense:

We could minimise only the bad residues, using the bfactor selector mentioned above:

# select only the bad bits.
bv = ph.get_bfactor_vector(pose, 70, False)
movemap = pyrosetta.rosetta.core.kinematics.MoveMap()
movemap.set_chi(bv)
movemap.set_bb(bv)
# scorefxn is weighted for atom pairs
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 3)
relax.set_movemap(movemap)
relax.apply(pose)

One could simply not touch the sidechains:

...
movemap.set_chi(False)
...

Regarding the other options mentioned. I'll just touch on them for completeness. The simple mover that does rotations is ShearMover, which applied as a MonteCarlo mover would look something like:

shear_mover = pyrosetta.rosetta.protocols.simple_moves.ShearMover(movemap_in=movemap,
                                                                  temperature_in=0.5,
                                                                  n_moves=6)
shear_mover.angle_max(30) # degrees
# set up the monteCarlo sampler
scorefxn(pose)
monégasque = pyrosetta.MonteCarlo(pose, scorefxn, kT)
# this is pyrosetta.rosetta.protocols.moves.MonteCarlo not 
# pyrosetta.rosetta.protocols.monte_carlo.GenericMonteCarloMover,
# which has the loop/trial thing within and is fancier and faster
trial = rosetta.TrialMover(shear_mover, monégasque)
# the trial accepts/rejects a pose based upon the MonteCarlo object
for i in range(10):
    scorefxn(pose) # update energies
    monégasque.recover_low(pose)
    trial.apply(pose)

monégasque.show_counters()
monégasque.recover_low(pose)

Whereas the centroid option (coarse grain) revolves around replacing the sidechain with a CEN atom and recovering them after all is done.

pr_sm = pyrosetta.rosetta.protocols.simple_moves
original_pose = pose.clone()
cen_switch = pr_sm.SwitchResidueTypeSetMover("centroid")
cen_switch.apply(pose)
... # do something
recover_sidechains = pr_sm.ReturnSidechainMover(pose)
recover_sidechains.apply(pose)


NB. That the switch happens in place, so it is best to avoid mark in the variable name if it is centroid or not.

Centroid is good for some applications, but it not really intended for this. Also requires a centroid scorefunction and it's own variant of ClassicRelax, CentroidRelax.

print(cen_relax.get_scorefxn().get_name()) # 'cen_std'
pyrosetta.create_score_function('cen_std_smooth')
cen_relax = pyrosetta.rosetta.protocols.relax.CentroidRelax()
#cen_relax.set_movemap(movemap)
cen_relax.set_scorefxn(scorefxn_smooth)
#cen_relax.set_rounds(5)
# cen_relax.min_type() # 'lbfgs_armijo_nonmonotone'
cen_relax.apply(pose)

This occasionally actually may crash CEN atoms are too close without recovering the best pose when giving the errors "NAN occurred in H-bonding calculations!" or "AtomTree torsion_angle_dof_id angle range error".

Best of five

So far, I have discussed methods to analyse a specific model. However, AlphaFold2 returns multiple and it is nice to check all. Here are some methods I used to look at the interface.

Exploring the folder of goodies

From a ColabFold notebook run one gets the following files:

  • msa.pickle
  • msa_coverage.png
  • rank_{👾}model{🗿}_ptmseed{🌰}.png
  • rank_{👾}model{🗿}_ptmseed{🌰}_pae.json
  • rank_{👾}model{🗿}_ptmseed{🌰}_relaxed.pdb
  • rank_1model{🗿}_ptmseed{🌰}_unrelaxed.pdb
  • settings.txt

Where 👾 is the rank (value between 1 and N, where N=5 is the default) and 🗿 is the model number and 🌰 is the seed value, default is zero. The JSON is the same as downloaded (see above for reshaping function). Settings.txt contains the summary scores in addition to the settings.

Rank 1 is the best according to the chosen metric, which can be misleading, especially if the top are close.

Pandas

To keep the data in a tidy way, I used pandas via the following method (which does not require pyrosetta):

def make_AF2_dataframe(folder:str) -> pd.DataFrame:
    """
    Given a folder from ColabFold return a pandas dataframe
    with key rank index
    and value a dictionary of details

    This is convoluted, but it may have been altered by a human.
    """
    filenames = os.listdir(folder)
    # group files
    ranked_filenames = {}
    for filename in filenames:
        if not re.match('rank_\d+.*\.pdb', filename):
            continue
        rex = re.match('rank_(\d+)_model_(\d+).*seed_(\d+)(.*)\.pdb', filename)
        rank = int(rex.group(1))
        model = int(rex.group(2))
        seed = int(rex.group(3))
        other = rex.group(4)
        if rank in ranked_filenames and ranked_filenames[rank]['relaxed'] == True:
            continue
        data = dict(name=filename,
                   path=os.path.join(folder, filename),
                  rank=rank,
                  model=model,
                   seed=seed,
                  relaxed='_relaxed' in other
                 )
        ranked_filenames[rank] = data
    # make dataframe
    details = pd.DataFrame(list(ranked_filenames.values()))
    # add data from settings.
    pLDDTs = {}
    pTMscores = {}
    with open(os.path.join(folder, 'settings.txt'), 'r') as fh:
        for match in re.findall('rank_(\d+).*pLDDT\:(\d+\.\d+)\ pTMscore:(\d+\.\d+)', fh.read()):
            pLDDTs[int(match[0])] = float(match[1])
            pTMscores[int(match[0])] = float(match[2])
    details['pLDDT'] = details['rank'].map(pLDDTs)
    details['pTMscore'] = details['rank'].map(pTMscores)
    return details

So running scores = make_AF2_dataframe(folder = 'prediction_38cac') gives me:

rank model seed relaxed pLDDT pTMscore
1 5 0 True 78.21 0.6023
2 4 0 False 77.82 0.5788
3 2 0 False 77.33 0.5772
4 1 0 False 77.05 0.5866
5 3 0 False 76.87 0.5711

Then I fetch the errors and poses, but without putting them in the table:

import json

def get_folder_errors(scores: pd.DataFrame, path_column:str='path') -> dict:
    errors = dict()
    for i, row in scores.iterrows():
        filename = row[path_column].replace('.pdb', '.json')\
                                   .replace('_unrelaxed', '_pae')\
                                   .replace('_relaxed', '_pae')
        with open(filename, 'r') as fh:
            errors[row['rank']] = ph.reshape_errors(json.load(fh))
    return errors

poses = get_folder_poses(scores)
errors = get_folder_errors(scores)

def get_folder_poses(df: pd.DataFrame, path_column:str='path') -> pyrosetta.rosetta.utility.Dict[int, pyrosetta.Pose]:
    poses = dict() # Not pyrosetta.rosetta.utility.vector1_core_pose_Pose(len(df))
    for i, row in df.iterrows():
        poses[row['rank']] = pyrosetta.pose_from_file(row[path_column])
    return poses

In an earlier implementation I was using pyrosetta.rosetta.utility.vector1_core_pose_Pose as opposed to a dictionary, which actually returns a clone when an array item was retrieved.

Having done that I got some measurements of the complexes:

import json
from functools import partial
import pyrosetta_help as ph
pr_rs = pyrosetta.rosetta.core.select.residue_selector

def _get_interaction_vectors(pose:pyrosetta.Pose, chain_id:int, threshold:float = 3.) -> pr_rs.ResidueVector:
    """
    For ``analyse_complex``.
    """
    assert pose.num_chains() > 1, 'Single chain!'
    chain_sele = pr_rs.ChainSelector(chain_id)
    other_chains_sele =pr_rs.NotResidueSelector(chain_sele)
    cc_sele = pr_rs.CloseContactResidueSelector()
    cc_sele.central_residue_group_selector(other_chains_sele)
    cc_sele.threshold(float(threshold))
    other_cc_sele = pr_rs.AndResidueSelector(chain_sele, cc_sele)
    return pr_rs.ResidueVector(other_cc_sele.apply(pose))

def _get_errors(row, i):
    """
    For ``analyse_complex``.
    """
    residues = row[f'interchain_residues_{i}']
    error = errors[row['rank']]
    return ([np.round(np.min(error[r-1,:]), 1) for r in residues])

def analyse_complex(poses, details: pd.DataFrame):
    details['interchain_residues_1'] = details['rank'].apply(lambda rank: _get_interactions(poses[rank], 1)) 
    details['interchain_residues_2'] = details['rank'].apply(lambda rank: _get_interactions(poses[rank], 2)) 
    details['N_interchain_residues_1'] = details['interchain_residues_1'].apply(len)
    details['N_interchain_residues_2'] = details['interchain_residues_2'].apply(len)
    details['errors_interchain_residues_1'] = details.apply(partial(_get_errors, i=1), 1)
    details['errors_interchain_residues_2'] = details.apply(partial(_get_errors, i=2), 1)

Namely analyse_complex adds a column to the dataframe for the interface residues in chain 1, one for those in chain 2 and two columns for the number of these.

Note that the residue indices are pose residue indices, not PDB residue numbers, so chain B will continue on from the numbering of chain A.

The median pLDDT for the interface residues can be calculated:

def get_median_interface_bfactor(pose, residues):
    pbd_info = pose.pdb_info()
    bfactors = [pbd_info.bfactor(r, pose.residue(r).atom_index('CA')) for r in residues]
    return np.array(bfactors).median()

def get_ibf(row):
    residues = list(row.interchain_residues_1) + list(row.interchain_residues_2)
    pose = poses[row['rank']] # ``row.rank`` is a function, just like ``row.name``.
    return get_median_interface_bfactor(pose, residues)

# run:
details['interface_pLDDT'] = details.apply(get_ibf ,1)

For the interface strength, I need to minimise first:

scorefxn = pyrosetta.get_fa_scorefxn()
ap_st = pyrosetta.rosetta.core.scoring.ScoreType.atom_pair_constraint
scorefxn.set_weight(ap_st, 1)  
for pose, error in zip(poses, errors):
    ph.constrain_distances(pose, error)
    relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, 3)
    relax.apply(pose)

# run:
details['dG'] = details['rank'].apply(lambda rank: scorefxn(poses[rank]))

Then I can calculate the interface strength:

def calculate_interface(details, interface = 'A_B'):
    newdata = []
    for rank in scores['rank']:
        ia = pyrosetta.rosetta.protocols.analysis.InterfaceAnalyzerMover(interface)
        ia.apply(poses[rank])
newdata.append({'complex_energy': ia.get_complex_energy(), 'separated_interface_energy': ia.get_separated_interface_energy(), 'complexed_sasa': ia.get_complexed_sasa(), 'crossterm_interface_energy': ia.get_crossterm_interface_energy(), 'interface_dG': ia.get_interface_dG(), 'interface_delta_sasa': ia.get_interface_delta_sasa()}) # adding multiple columns, hence why not apply route. # the order is not chanced, so all good newdata = pd.DataFrame(newdata) for column in newdata.columns: details[column] = newdata[column]

1 comment:

  1. Hi, there.
    I'm a newbie to pyrosetta and struggling to learn pyrosetta. I found your blog when I goggled 'pyrosetta'. May I ask you a question? I cannot find answer...
    I want to use 'MPIJobDistributor' to utilize my multiple cpu. But I cannot find the information how to use this function.

    In detail, I tried to replace 'PyJobDistributor' with 'MPIJobDistributor' refering to the pyrosetta tutorial (https://nbviewer.org/github/RosettaCommons/PyRosetta.notebooks/blob/master/notebooks/07.02-Docking-Moves-in-Rosetta.ipynb).
    I tried the following: 'jd = MPIJobDistributor("output", 10, scorefxn_low)', or'jd = MPIJobDistributor(10, socrefxn_low)' It read "name 'logger' is not defined". Then I tried 'pyrosetta.mpi.mpi_init(10)', but it didn't work...
    Could you give some help about it..?
    Any advices would be greatly appreciated.

    ReplyDelete