Filling missing loops by cannibalising AlphaFold2

I could not resist this Photoshop.
But the process is not as dramatic
and the results not as bad as Temple of Doom...
If done right.

AlphaFold2 models have a complete sequence, but for innumerable reasons the crystal structure of the protein is better, but may have missing spans. As a result one may want, for illustrative purposes only, to rip out the required parts from the AlphaFold2 models (as fragments) and have them built into the target structure. Here is how to do it by threading.

Cannibalistic threading

Note: I have made a Colab notebook that does the operations described below (here).

In two previous blog posts, I went through how to add missing loops in Rosetta or PyRosetta and how to add missing loops hackishly in PyMOL. Here I'll go through how one can add them by stealing off other structures by threading and using the fragments of the cannibalised structure to fill missing parts.

Given a structure of a homologue (the template) and a target sequence one can "thread" the positions of the latter onto the former. There are many tools out there, here I will use PyRosetta with the RosettaCM threading mover. As threading is actually meant for homologues, I will cover those too.

The reason, why I am misusing it for filling missing loops is that the infrastructure is already there: a good coder is a lazy one! Using other crystal structures as the fragment donors and I would strongly suggest this (instead of, or in addition to, the AlphaFold2 model). The AlphaFold2 model is mentioned not because of the hashtag-trendiness, but because these have a complete sequence, albeit with spaghetti loops of low quality as previously discussed, and do not blow as SwissModels do (discussed here). A nice benefit is that the number offset is also corrected.

In my pyrosetta_help module, there are several simple functions to deal with making a threaded model.
This includes a function threaded_pose, threader, unaltered = thread(unaligned_target_sequence, template_pose), which does all the operations below (alignment, threading, sidechains and ligand theft), but most times there is something technical in the way, hence why I will go through it. Otherwise, here is how to do it cryptically with that module:

uniprot = 'P08684'
template_pose = ph.parameterised_pose_from_file(ph.download_pdb('5A1R'), overriding_params=['HEM.params'])
af_pose = ph.pose_from_alphafold2(uniprot)
fragsets = ph.make_fragment_sets(af_pose)
threaded, threader, threadites = ph.thread(target_sequence=af_pose.sequence(),
                                  template_pose=template_pose,
                                  fragment_sets=fragsets)
ph.steal_ligands(template_pose, threaded)
threaded.dump_pdb('threaded.pdb')

Where P08684 is the Uniprot sequence and 5A1R which is the PDB. This has one oddity in that the PDB provided SMILES for the residue HEM (CC1=C(CCC(O)=O)C2=Cc3n4[Fe]5|6|N2=C1C=c7n5c(=CC8=N|6C(=Cc4c(C)c3CCC(O)=O)C(=C8C=C)C)c(C)c7C=C, fails to parse, so I provided the HEM.params file found in one of the Rosetta tests (it lacks the inter-residue axial donor bonds with his/cys and an oxygen, but haem is a nightmare for another time). This illustrates that a one solution fits rarely works. Anyway this is the result of the above:

Before the pitchforks come out, I should say that (a) yes, the image is from PyMOL because it's prettier, but for checking stuff in a Jupyter notebook I always opt for NGLView, and (b) the missing bits are added from fragments hence why there is not a 1:1 match, which is a good thing as otherwise there would be nastily closed loops.

Steps explained

Alignment

An important feature is having a decent pairwise alignment. In Biopython, there is a handy, albeit quirky pairwise alignment set of functions, in the module pairwise2.

from Bio import pairwise2
alignments = pairwise2.align.globalxs(target,
                                      template,
                                      -1,   # open
                                      -0.1  # extend
                                      )

Refreshingly target and template are strings not Bio.Seq.Seq instances, but there are a few oddities:

the global and local functions are not overloaded, but different mode are controlled by two letters (above xs does not mean extra-small),
the functions do not accept named arguments making them confusing ten minutes after use
Non-amino acid letters are aligned with no penalty — w, X and Z will align happily to the twenty AAs
The output alignments above is a list of tuples of different meaings. Here is an example making it more sensible:

alignments = [dict(zip(['target', 'template', 'score', 'begin', 'end'], alignment)) for alignment in alignments]

Parenthetically, there is a nice function (format_alignment) to add dots for matches etc. but if one is using a Jupyter notebook, to display the output in a sensible way a scroll bar is a must:

from Bio import pairwise2
formatted:str = pairwise2.format_alignment(*alignment)
a, gap, b, score = formatted.strip().split('\n')
gap = ''.join(['.' if c == '|' else '*' for c in gap]).replace(" ", "*")

from IPython.display import display, HTML
display(HTML(f'<div style="display: inline-block; font-family: monospace; white-space: nowrap;">'+
             f'{a}<br />{gap}<br />{b}<br />{score}</div>'))

In the case of distant protein, it may be best to align multiple sequences and trim down. Multiple sequence alignments are not possible with biopython, but there are several tools, clustal was the original kid on the block, but Muscle is more accurate according to benchmarks.

For the RosettaCM framework, alignments are stored in the "grishin format". In my aforementioned helper module, there is a function to write a grishin file, so here is an alignment with pairwise2 to a grishin file.


import pyrosetta_help as ph
alignment = ph.get_alignment(target_sequence, template_pose.sequence())
aln_file = f'{target_name}.aln'
ph.write_grishin(target_name,
                  alignment['target'],
                  template_name,
                  alignment['template'],
                  aln_file)

Movers

Obviously, if a Muscle alignment was done, the first step is different. Once a Grishin alignment file has been written, the fun can start:

align = pyrosetta.rosetta.core.sequence.read_aln(format='grishin', filename=aln_file)
threader = pyrosetta.rosetta.protocols.comparative_modeling.ThreadingMover(align=align[1],
                                                                           template_pose=template_pose)
                                                                           target_pose = pyrosetta.Pose()
pyrosetta.rosetta.core.pose.make_pose_from_sequence(target_pose, target_sequence, 'fa_standard')
threader.apply(target_pose)

This has four things to note:

If the template has termini these will be maintained
Mismatching residues and gaps will results in some residues not being placed
Sidechains have not been mimicked
Ligands have not been stolen and should not have been present in the template to start with (even for the alignment step)

Termini

A peptide's N-terminus is protonated, while the C-terminus has an oxygen called OXT. In Rosetta termini are a patch, like phosphorylations etc. To remove these:

clean_template_pose = template_pose.clone()
pyrosetta.rosetta.core.pose.remove_nonprotein_residues(clean_template_pose)
### find
lowers = pyrosetta.rosetta.utility.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
uppers = pyrosetta.rosetta.utility.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
pyrosetta.rosetta.protocols.sic_dock.get_termini_from_pose(clean_template_pose, lowers, uppers)
### remove
rm_upper = pyrosetta.rosetta.core.conformation.remove_upper_terminus_type_from_conformation_residue
rm_lower = pyrosetta.rosetta.core.conformation.remove_lower_terminus_type_from_conformation_residue
for upper, _ in uppers:
    rm_upper(clean_template_pose.conformation(), upper)
for lower, _ in lowers:
    rm_lower(clean_template_pose.conformation(), lower)

Fragments

The first means that these residues will be in their original places. So if an AlphaFold2 pose was given as the target these would be comically left behind.

In the case of creating a pose from a sequence, these will be in the weird 12Å ring structure centred around the origin, which actually is nicer to work with.

Parenthetically, if for an other application one is making a pose from sequence, but is in need of a specific secondary structure, iterating across the residues and chaining the ψ and φ angles is the way to do it —In pyrosetta_help the functions make_ss (ψ and φ to be specified), make_alpha_helical (φ=-57.8°, ψ=-47.0°), make_310_helical (φ=-74.0°, ψ=-4.0°), make_pi_helical (φ=-57.1°, ψ=-69.7°), make_sheet (φ=-139°, ψ=+135°) do this.

To solve the unthreaded bits, one can use fragments stolen from other poses.

confragset = pyrosetta.rosetta.core.fragment.ConstantLengthFragSet(3)
pyrosetta.rosetta.core.fragment.steal_constant_length_frag_set_from_pose(cannibilised_pose, confragset)  # does not blank pre-existing values

fragsets = pyrosetta.rosetta.utility.vector1_std_shared_ptr_core_fragment_FragSet_t(1)  # or however many
fragsets[1] = confragset   # .append(confragset) to add to the end, i.e. zero in the above.

The vector of sets is because one may want different constant lengths (eg. 3 & 9) or different files were read etc. Here cannibilised_pose is the pose that is cannibilised for fragments. It can even have ligands. And can be an AlphaFold2 model, which has a full sequence —AlphaFold2 models do not blow up, unlike SwissModel models, so the backbone angles must be decent enough.

cannibilised_pose = ph.pose_from_alphafold2('👾👾👾')

These can be saved too, although this is not required for threading.

fragio = pyrosetta.rosetta.core.fragment.FragmentIO()
fragio.write_data('👾👾👾.frags', confragset)

To use in threading, these need to be passed to the threading mover before applying to the pose.

threader.build_loops(True)
threader.randomize_loop_coords(True) # default
threader.frag_libs(fragsets)

Sidechains

The next issue is that the sidechains need stealing, which is were StealSideChainsMover comes in. First, this needs to be told the equivalence of the two poses. The threader mover has an object called qt_mapping, an instance of pyrosetta.rosetta.core.id.SequenceMapping, which is different that the element in the aligment vector from reading the grishin file (pyrosetta.rosetta.utility.vector1_core_sequence_SequenceAlignment) as the former maps the good pairs and not everything including the rubbish ones. I mention this as it can be handy to be know what residues are a perfect vs. imperfect match.

qt:  = threader.get_qt_mapping(target_pose)
steal = pyrosetta.rosetta.protocols.comparative_modeling.StealSideChainsMover(template_pose, qt)
steal.apply(target_pose)

mapping : pyrosetta.rosetta.utility.vector1_unsigned_long = qt.mapping()
vector = pyrosetta.rosetta.utility.vector1_bool(target_pose.total_residue())
for r in range(1, target_pose.total_residue() + 1):
    if mapping[r] != 0:  # match!
        vector[r] = 1

Ligands

The next part is stealing ligands.

The pose.sequence() outputs non-polymer residues too (w, X, Z), which PairWise2 aligns with disastrous consequences (missing span). There many useful functions, in the pose submodule, which aren't methods of Pose, one of these is:

pyrosetta.rosetta.core.pose.remove_nonprotein_residues(pose)

This operates in place, so obviously one would do it to a clone or else there'd not be anything to steal.

There is a function, StealLigandMover with the following arguments:

pyrosetta.rosetta.protocols.comparative_modeling.StealLigandMover(source: pyrosetta.rosetta.core.pose.Pose, 
                                                                  anchor_atom_dest: pyrosetta.rosetta.core.id.NamedAtomID, 
                                                                  anchor_atom_source: pyrosetta.rosetta.core.id.NamedAtomID, 
                                                                  ligand_indices: pyrosetta.rosetta.utility.vector1_core_id_NamedAtomID)

To be honest I don't quite get the benefit in this as it makes everything more complicated, but briefly the anchor business is for coordinate referencing (not connections or similar), while the ligand indices is vector1 of NamedAtomID, with one atom per residue. A NamedAtomID can be converted from an AtomID, via pyrosetta.rosetta.core.conformation.atom_id_to_named_atom_id, but it requires a residue instance (pose.residue(res_index)) anyway. The major issue is that one has to know what the residue indices of interest are anyway. If that is the case simply using pyrosetta.rosetta.core.pose.append_subpose_to_pose is quicker.

As a results, one way to do this is by taking blindly all non-protein residues:

PROTEIN =  pyrosetta.rosetta.core.chemical.ResidueProperty.PROTEIN
prot_sele = pyrosetta.rosetta.core.select.residue_selector.ResiduePropertySelector(PROTEIN)
not_sele = pyrosetta.rosetta.core.select.residue_selector.NotResidueSelector(prot_sele)
rv = pyrosetta.rosetta.core.select.residue_selector.ResidueVector(not_sele.apply(donor_pose))
for res in rv:
    pyrosetta.rosetta.core.pose.append_subpose_to_pose(acceptor_pose, donor_pose, res, res, True)

I should mention that append_subpose_to_pose unlike pyrosetta.Pose(donor_pose, start_res, end_res) copies the residue types so is works fine, but does not like copying more than one ligand. For example doing:

for from_res, to_res in ph.rangify(rv):
    pyrosetta.rosetta.core.pose.append_subpose_to_pose(acceptor_pose, donor_pose, from_res, to_res, True)

Will give the error Can't create a polymer bond after residue n due to incompatible type: XXX. Assuming one is certain the residues are there, one could use the ResidueNameSelector (if the ligand residue does not exist it will raise an error):

resn_sele = pyrosetta.rosetta.core.select.residue_selector.ResidueNameSelector()
            resn_sele.set_residue_name3(','.join(wanted_ligands))

Huston, we have a problem



In some cases, an error may have occurred but there is a glitched outputted structure.

Generally it is wiser to try to figure out the why than hack it, but if one were in a pinch one could simply align and steal the coordinates. For example, say we know everything after 498 is bad, then we could align the backbone of this and copy over:
# Viewer discretion adviced
mapping = pyrosetta.rosetta.std.map_core_id_AtomID_core_id_AtomID()
r = 498
assert threaded.residue(r).natoms() == af_pose.residue(r).natoms()
for a in threaded.residue(r).mainchain_atoms():  # should always be 1,2,3
    mapping[pyrosetta.AtomID(a, r)] = pyrosetta.AtomID(a, r)
pyrosetta.rosetta.core.scoring.superimpose_pose(af_pose, threaded, mapping)
for r in range(r, af_pose.total_residue()+1):
    assert threaded.residue(r).natoms() == af_pose.residue(r).natoms()
    print(r)
    for a in range(1, threaded.residue(r).natoms() + 1):
        threaded.residue(r).set_xyz(a, af_pose.residue(r).xyz(a))
If an in-between bit is the problem one would have to close the gap as discussed here.

The art of blowing up protein

Pages

Sunday 17 October 2021