Filling missing loops —the proper way

Saturday 4 July 2020

Filling missing loops —the proper way

Previously, I posted about how to join proteins and add missing loops the shoddy way. Now I'll address how to do it correctly, using Rosetta or Pyrosetta —I am sorry this has been so long overdue.
Since posting this, I realised one can do it even faster by hijacking the threading algorithm, which albeit not it's intended purpose works fine for fixing a structure without supervision —which the following discussed methods do.

When to do it

First, I should stress that placing missing loops ought to be done to prevent errors in certain applications and potentially some back of the envelope calculations, while it should not really be done for ligand or protein docking unless the loop is small for the same reasons why docking against an ab initio model is a bad idea. I am on the fence about doing it for aesthetical reasons, as on one hand, it is made up by the computer and less reliable than a crystal structure, especially if joining protein domains, while on the other hand discontinuities very often confuse the readers or viewers. PyMOL adds a nice dotted line, while NGL does not, making me realise how useful the dotted line across discontinuities is.

I much prefer adding loops and joining domains in Rosetta, but it does require some coding, hence if you rather do it more manually, see my older post.

Side note: Coot

Adding or correcting a loop in Coot is most correct way if there is density there. Most structures are refined in Coot, so there is no chance that the author of the PDB structure did not already try this. Coot is a visual refinement program that allows one to fit residues based on a density map. However, if there is no density, a loop can totally be added but it would be a total guess and actually the forcefield based approach using Rosetta will be more accurate than fitting to tenuous density in Coot, which could be mislead by water chains and relies only on geometry. If one wanted to do so, Coot is standalone or included in the CCP4 package. Open in Coot the PDB and the reciprocal map (mtz file) —Coot solves the phases (absent in reciprocal space) based on the PDB. Look at the last present residue and check that the carbonyl or sidechain aren't in the backbone mesh —use the Refine Zone tool to fix it. Then press Calculate>Fit loop, add the missing sequence and accept the best loop. However, if there is no loop modelled in a deposited structure it means that there is no electron density, so using Rosetta is a far better option. O

How to do it: Remodel

I am a big fan of Rosetta. It is a very flexible toolkit. And there are multiple ways to add loops. Without and with Python.


Rosetta Remodel is a standalone application that takes modifies a protein based upon a blueprint file, which is a list of residue index, name, secondary structure and a command instructing what to do there. It is designed for protein design and is a really great tool —I made my Christmas-tree protein with it. For example:
Will do two mutations M1R and either A2E or A2D  depending what has the lowest Gibbs. Residues to be left alone take the form 1 M ., whereas 3 S EMPTY NC HSR will change residue 3 to the non-canonical amino acid specified —assuming its topology file has been specified for both full atom (-extra_res_fa) and centroid (-extra_res_cen). omitting a residue (and allowing the nearby ones to wiggle to compensate) will delete it, while 0 X L PIKAA N will add a asparagine (again neighbours permitting). The neighbours note is because any mutation needs the nearby residues to be repacked (without redesigning) (10 W NATAA) or else it would be impossible to close the cutpoint, especially for gap closing as will become clear in the pyrosetta section of this post.

Remodelling loops

To model loops, one simply gets the blueprint file generated via the Rosetta perl script (or via four lines of python) and adds the missing residues lines via the 0 X L PIKAA N style syntax and changes the preceding and succeeding residues to 10 S L NATAA syntax or 10 S L PIKAA S style syntax.
Remodel used a centroid step (coarse grain) using a generic amino acid, the gets a bunch of fragments (automatically!) that best suit and then does full atom step. The generic amino acid is by default valine, which is great for sheets and helices, but not so much for loops. So using the flag generic_aa G is a good option as this tells the default rough fitting (centroid mode) to use glycine and not valine as generic residue.
The top pose will actually not have a superb score, so a 3 (regular) or more cycles (15 thorough) with Rosetta Relax application either of the whole protein or with a movemap specific to the linker, may be done.


Pose numbering and single chain

The PDB residues and pose residues must match, i.e. the first residue is 1 and the whole protein should be on a single chain —the chain can have cutpoints, so it is a simply a question of changing it in PyMOL, say, using the command alter chain B, resv+=100, alter chain B, chain='A' etc. including shifting the residues after the gap. Also, ligands may crash it. Non-canonical amino acids are fine as proven by my other post about Rosetta and isopeptide bonds.

In-frame deletion

Here I have been talking about loop modelling for aesthetics, if you are doing it for science, then you need to do a lot of replicates, with proper controls. For example, if you want to calculate the difference in Gibbs free energy from an in-frame deletion, you need to run  remodel with a high number of replicates for both with both the deletion and without, both with the same neighbouring residues that are recalcuated. After a local FastRelax, if the score of the latter converges within 1-2 kcal/mol to the original, then the number of replicates is satisfactory. In terms of the score, a fair comparison is getting the difference between the free energy of the common residues.

How to do it: Pyrosetta

There are many movers (=samplers) in Rosetta that can model loops. But these work with a similar principle, namely you need two anchoring points and a cutpoint. The loop gets wiggled around from the anchors and the cutpoint is the cut end that gets wiggled around to try and meet the next residue with one of many fancy algorithms —the robotics-inspired kinematic closure (KIC) is by far the coolest sampler name in computational biochemistry.

Get your pose

Like for remodel, ligands are not okay and will cause a segmentation fault (if you don't want your main process to die, check out my post about processes and segmentation faults). So do not add any ligands.
import pyrosetta
pyrosetta.init(extra_options='-no_optH false -mute all -ignore_unrecognized_res true -load_PDB_components false')
pose = pyrosetta.rosetta.core.pose.Pose()
#params_paths = pyrosetta.rosetta.utility.vector1_string()
#params_paths.extend(['E4S.params']) # No! This is forbidden!
#resiset = pyrosetta.generate_nonstandard_residue_set(pose, params_paths)
pyrosetta.rosetta.core.import_pose.pose_from_file(pose, 'pdb6bq1.ent')
# or use pose_from_pdbstring or pose_from_file etc. or the toolbox module...

Add missing residues

Unlike Remodel missing residues are not added and has to be done manually. If you were to make a pose from sequence and graft the pose in the pose you like this:
# create the insert with nice PDB numbering
insert = pyrosetta.pose_from_sequence(sequence)
for i in range(1, insert.total_residue() + 1):
    insert.pdb_info().chain(i, your_chain)
    insert.pdb_info().number(i, i + first_gap_pdb_resi - 1)
# make a fold tree
ft = pyrosetta.FoldTree()
# say i is the pose residue position after the gap
ft.new_jump(i -1,i ,i -1)
assert ft.check_fold_tree(), f'Error in foldtree {ft}'
# add pose into pose
n = ft.num_jump()
pose.append_pose_by_jump(gap_pose, n)
This will result in the insert being it totally the wrong place, namely it has two cutpoints, which is not what we want. So to make this approach work, you'd need to add an overlapping residue and align from there —there are many nice functions in the grafting module that do this.
Instead it is way easier to extend the residues one by one. In PyMOL the fab command is fabulous at making new residues and one can control chain, residue and it will be extended from there, whereas pyrosetta.pose_from_sequence is a bit more limited. So the residues need to be added one at the time, via a slightly convoluted way of making new residues.
 # get residue
# first, get the relevant set of residue types.
# In the pose import there was residuetype set commented out: resiset = pyrosetta.generate_nonstandard_residue_set(pose, params_paths)
chm = pyrosetta.rosetta.core.chemical.ChemicalManager.get_instance()
resiset = chm.residue_type_set( 'fa_standard' )
res_type = resiset.get_representative_type_name1(single_letter_name) #e.g. A
residue = pyrosetta.rosetta.core.conformation.ResidueFactory.create_residue(res_type)
pose.append_polymer_residue_after_seqpos(residue, previous + 1, True)
The residue type could have been retrieved with resiset.name_map( three_letter_name )or even more manually:

v = pyrosetta.rosetta.core.chemical.ResidueTypeFinder(resiset).get_all_possible_residue_types()
ligtype = [vv for vv in v if vv.name3() == name][0] # or whatever property!

This is not a terminal

One thing to be aware of is the ends of the residues may be termini —if you run the code you end up with two protons on the backbone nitrogen and possibly with a OXT atom. Consequently remove the terminus type. I personally get too confused trying to remember which is upper and which lower, so why not just clear both terminus patches (no segmentation fault) and not worry which side is which (for the record it's Cprevious ← N:UPPER_CONNECT — CA — C:LOWER_CONNECT → Nnext).

rm_upper = pyrosetta.rosetta.core.conformation.remove_upper_terminus_type_from_conformation_residue
rm_lower = pyrosetta.rosetta.core.conformation.remove_lower_terminus_type_from_conformation_residue
rm_upper(pose.conformation(), i)
rm_lower(pose.conformation(), i)

Define the loop

Now the loop needs to be closed. If using a method that requires jumps or one that requires loops, one thing to make sure is that the cutpoint is correct. The cutpoint position is the position of the residue before the discontinuity, so in our case the last added residue.

# close loop using loops
loops = pyrosetta.rosetta.protocols.loops.Loops()
# say previous is the residue taht was preset before the gap.
loop = pyrosetta.rosetta.protocols.loops.Loop(previous - 1, 
                                              cutpoint + 2, 
# cutpoint cannot be autochosen, it has to be the last residue added!
# print(loops) gives a nice output
Done. That was easy —once done right! Now, let's fix it.

Loop modelling

This step is amply described online for each of the various methods.
lm = pyrosetta.rosetta.protocols.loop_modeler.LoopModeler()
# these are enabled by default. here for quick changing.
This worked well, other movers work in a similar way with the same caveats about which is the cutpoint and the lack of termini. Say:
kic_mover = pyrosetta.rosetta.protocols.loops.loop_closure.kinematic_closure.KinematicMover()
kic_mover.set_pivots(18, 19, 20)
In the adding of the residue was an incorrect approach that had a foldtree. A foldtree is a bit more complicated, but has various way to set up (some curious, say FoldTree().split_existing_edge_at_residue(19), which does not add a jump), but are explained in detail in the official documentation.

Fragment based

A worry with ab initio loops is that they might have torsions that aren't seen in nature despite the forcefield's best intentions (depending on the paper read though). As a consequence, one may opt for fragment based approaches. To do this, LoopModeler mover does have the method setup_kic_with_fragments_config, but requires a command line option specifying the fragment files (loops:frag_files), either during initialisation or after with pyrosetta.rosetta.basic.options.set_file_vector_option('loops::frag_files', vector) where "vector" is a pyrosetta.rosetta.utility.vector1_std_string. These can be generated and written as described here or a general version found in the examples of Rosetta.

Full house

We can go one step further and determine where the gaps are thanks to pose.pdb_info().pose2pdb

A step further could be using an electron density weighted scorefunction —as discussed in this other post— but that is a bit unnecessary as there is no density if the loop is missing... (see Coot note)


Here is an example of a python class with a class method, fix_pose that will identify and fill missing loops given a pose and a list of dictionaries with the fields
  •  chains with value the chains (e.g. ('A','B') or 'AB')
  • sequence which requires the PDB to be not offset —but that is a discussion for another time.

1 comment:

  1. I stumbled on this while struggling through how to perform loop remodeling, and I couldn't understand why Rosetta's remodel kept failing on my multi-chain .pdb. I'm very happy I found your explanation about renaming everything as a single chain - saved me a load of time!!