I could not resist this Photoshop. But the process is not as dramatic and the results not as bad as Temple of Doom... If done right. |
Cannibalistic threading
In my pyrosetta_help module, there are several simple functions to deal with making a threaded model.
This includes a function
threaded_pose, threader, unaltered = thread(unaligned_target_sequence, template_pose)
, which does all the operations below (alignment, threading, sidechains and ligand theft), but most times there is something technical in the way, hence why I will go through it. Otherwise, here is how to do it cryptically with that module:uniprot = 'P08684'
template_pose = ph.parameterised_pose_from_file(ph.download_pdb('5A1R'), overriding_params=['HEM.params'])
af_pose = ph.pose_from_alphafold2(uniprot)
fragsets = ph.make_fragment_sets(af_pose)
threaded, threader, threadites = ph.thread(target_sequence=af_pose.sequence(),
template_pose=template_pose,
fragment_sets=fragsets)
ph.steal_ligands(template_pose, threaded)
threaded.dump_pdb('threaded.pdb')
Where P08684 is the Uniprot sequence and 5A1R which is the PDB. This has one oddity in that the PDB provided SMILES for the residue HEM (CC1=C(CCC(O)=O)C2=Cc3n4[Fe]5|6|N2=C1C=c7n5c(=CC8=N|6C(=Cc4c(C)c3CCC(O)=O)C(=C8C=C)C)c(C)c7C=C
, fails to parse, so I provided the HEM.params file found in one of the Rosetta tests (it lacks the inter-residue axial donor bonds with his/cys and an oxygen, but haem is a nightmare for another time). This illustrates that a one solution fits rarely works. Anyway this is the result of the above:
Before the pitchforks come out, I should say that (a) yes, the image is from PyMOL because it's prettier, but for checking stuff in a Jupyter notebook I always opt for NGLView, and (b) the missing bits are added from fragments hence why there is not a 1:1 match, which is a good thing as otherwise there would be nastily closed loops.
Steps explained
Alignment
from Bio import pairwise2
alignments = pairwise2.align.globalxs(target,
template,
-1, # open
-0.1 # extend
)
Refreshingly target and template are strings not Bio.Seq.Seq
instances, but there are a few oddities:
- the global and local functions are not overloaded, but different mode are controlled by two letters (above xs does not mean extra-small),
- the functions do not accept named arguments making them confusing ten minutes after use
- Non-amino acid letters are aligned with no penalty — w, X and Z will align happily to the twenty AAs
- The output
alignments
above is a list of tuples of different meaings. Here is an example making it more sensible:
alignments = [dict(zip(['target', 'template', 'score', 'begin', 'end'], alignment)) for alignment in alignments]
Parenthetically, there is a nice function (format_alignment
) to add dots for matches etc. but if one is using a Jupyter notebook, to display the output in a sensible way a scroll bar is a must:
from Bio import pairwise2
formatted:str = pairwise2.format_alignment(*alignment)
a, gap, b, score = formatted.strip().split('\n')
gap = ''.join(['.' if c == '|' else '*' for c in gap]).replace(" ", "*")
from IPython.display import display, HTML
display(HTML(f'<div style="display: inline-block; font-family: monospace; white-space: nowrap;">'+
f'{a}<br />{gap}<br />{b}<br />{score}</div>'))
In the case of distant protein, it may be best to align multiple sequences and trim down. Multiple sequence alignments are not possible with biopython, but there are several tools, clustal was the original kid on the block, but Muscle is more accurate according to benchmarks.
For the RosettaCM framework, alignments are stored in the "grishin format". In my aforementioned helper module, there is a function to write a grishin file, so here is an alignment with pairwise2 to a grishin file.
import pyrosetta_help as ph
alignment = ph.get_alignment(target_sequence, template_pose.sequence())
aln_file = f'{target_name}.aln'
ph.write_grishin(target_name,
alignment['target'],
template_name,
alignment['template'],
aln_file)
Movers
Obviously, if a Muscle alignment was done, the first step is different. Once a Grishin alignment file has been written, the fun can start:
align = pyrosetta.rosetta.core.sequence.read_aln(format='grishin', filename=aln_file)
threader = pyrosetta.rosetta.protocols.comparative_modeling.ThreadingMover(align=align[1],
template_pose=template_pose)
target_pose = pyrosetta.Pose()
pyrosetta.rosetta.core.pose.make_pose_from_sequence(target_pose, target_sequence, 'fa_standard')
threader.apply(target_pose)
This has four things to note:
- If the template has termini these will be maintained
- Mismatching residues and gaps will results in some residues not being placed
- Sidechains have not been mimicked
- Ligands have not been stolen and should not have been present in the template to start with (even for the alignment step)
Termini
A peptide's N-terminus is protonated, while the C-terminus has an oxygen called OXT
. In Rosetta termini are a patch, like phosphorylations etc. To remove these:
clean_template_pose = template_pose.clone()
pyrosetta.rosetta.core.pose.remove_nonprotein_residues(clean_template_pose)
### find
lowers = pyrosetta.rosetta.utility.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
uppers = pyrosetta.rosetta.utility.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
pyrosetta.rosetta.protocols.sic_dock.get_termini_from_pose(clean_template_pose, lowers, uppers)
### remove
rm_upper = pyrosetta.rosetta.core.conformation.remove_upper_terminus_type_from_conformation_residue
rm_lower = pyrosetta.rosetta.core.conformation.remove_lower_terminus_type_from_conformation_residue
for upper, _ in uppers:
rm_upper(clean_template_pose.conformation(), upper)
for lower, _ in lowers:
rm_lower(clean_template_pose.conformation(), lower)
Fragments
The first means that these residues will be in their original places. So if an AlphaFold2 pose was given as the target these would be comically left behind.
Parenthetically, if for an other application one is making a pose from sequence, but is in need of a specific secondary structure, iterating across the residues and chaining the ψ and φ angles is the way to do it —In pyrosetta_help
the functions make_ss
(ψ and φ to be specified), make_alpha_helical
(φ=-57.8°, ψ=-47.0°), make_310_helical
(φ=-74.0°, ψ=-4.0°), make_pi_helical
(φ=-57.1°, ψ=-69.7°), make_sheet
(φ=-139°, ψ=+135°) do this.
To solve the unthreaded bits, one can use fragments stolen from other poses.
confragset = pyrosetta.rosetta.core.fragment.ConstantLengthFragSet(3)
pyrosetta.rosetta.core.fragment.steal_constant_length_frag_set_from_pose(cannibilised_pose, confragset) # does not blank pre-existing values
fragsets = pyrosetta.rosetta.utility.vector1_std_shared_ptr_core_fragment_FragSet_t(1) # or however many
fragsets[1] = confragset # .append(confragset) to add to the end, i.e. zero in the above.
The vector of sets is because one may want different constant lengths (eg. 3 & 9) or different files were read etc. Here cannibilised_pose
is the pose that is cannibilised for fragments. It can even have ligands. And can be an AlphaFold2 model, which has a full sequence —AlphaFold2 models do not blow up, unlike SwissModel models, so the backbone angles must be decent enough.
cannibilised_pose = ph.pose_from_alphafold2('👾👾👾')
These can be saved too, although this is not required for threading.
fragio = pyrosetta.rosetta.core.fragment.FragmentIO()
fragio.write_data('👾👾
👾.frags', confragset)
To use in threading, these need to be passed to the threading mover before applying to the pose.
threader.build_loops(True)
threader.randomize_loop_coords(True) # default
threader.frag_libs(fragsets)
Sidechains
StealSideChainsMover
comes in. First, this needs to be told the equivalence of the two poses. The threader mover has an object called qt_mapping
, an instance of pyrosetta.rosetta.core.id.SequenceMapping
, which is different that the element in the aligment vector from reading the grishin file (pyrosetta.rosetta.utility.vector1_core_sequence_SequenceAlignment
) as the former maps the good pairs and not everything including the rubbish ones. I mention this as it can be handy to be know what residues are a perfect vs. imperfect match.qt: = threader.get_qt_mapping(target_pose)
steal = pyrosetta.rosetta.protocols.comparative_modeling.StealSideChainsMover(template_pose, qt)
steal.apply(target_pose)
mapping : pyrosetta.rosetta.utility.vector1_unsigned_long = qt.mapping()
vector = pyrosetta.rosetta.utility.vector1_bool(target_pose.total_residue())
for r in range(1, target_pose.total_residue() + 1):
if mapping[r] != 0: # match!
vector[r] = 1
Ligands
The next part is stealing ligands.
The pose.sequence()
outputs non-polymer residues too (w, X, Z), which PairWise2 aligns with disastrous consequences (missing span). There many useful functions, in the pose
submodule, which aren't methods of Pose
, one of these is:
pyrosetta.rosetta.core.pose.remove_nonprotein_residues(pose)
This operates in place, so obviously one would do it to a clone or else there'd not be anything to steal.
There is a function, StealLigandMover
with the following arguments:
pyrosetta.rosetta.protocols.comparative_modeling.StealLigandMover(source: pyrosetta.rosetta.core.pose.Pose,
anchor_atom_dest: pyrosetta.rosetta.core.id.NamedAtomID,
anchor_atom_source: pyrosetta.rosetta.core.id.NamedAtomID,
ligand_indices: pyrosetta.rosetta.utility.vector1_core_id_NamedAtomID)
To be honest I don't quite get the benefit in this as it makes everything more complicated, but briefly the anchor business is for coordinate referencing (not connections or similar), while the ligand indices is vector1 of NamedAtomID, with one atom per residue. A NamedAtomID can be converted from an AtomID, via pyrosetta.rosetta.core.conformation.atom_id_to_named_atom_id
, but it requires a residue instance (pose.residue(res_index)
) anyway. The major issue is that one has to know what the residue indices of interest are anyway. If that is the case simply using pyrosetta.rosetta.core.pose.append_subpose_to_pose
is quicker.
As a results, one way to do this is by taking blindly all non-protein residues:
PROTEIN = pyrosetta.rosetta.core.chemical.ResidueProperty.PROTEIN
prot_sele = pyrosetta.rosetta.core.select.residue_selector.ResiduePropertySelector(PROTEIN)
not_sele = pyrosetta.rosetta.core.select.residue_selector.NotResidueSelector(prot_sele)
rv = pyrosetta.rosetta.core.select.residue_selector.ResidueVector(not_sele.apply(donor_pose))
for res in rv:
pyrosetta.rosetta.core.pose.append_subpose_to_pose(acceptor_pose, donor_pose, res, res, True)
I should mention that append_subpose_to_pose
unlike pyrosetta.Pose(donor_pose, start_res, end_res)
copies the residue types so is works fine, but does not like copying more than one ligand. For example doing:
for from_res, to_res in ph.rangify(rv):
pyrosetta.rosetta.core.pose.append_subpose_to_pose(acceptor_pose, donor_pose, from_res, to_res, True)
Will give the error Can't create a polymer bond after residue n due to incompatible type: XXX
. Assuming one is certain the residues are there, one could use the ResidueNameSelector
(if the ligand residue does not exist it will raise an error):
resn_sele = pyrosetta.rosetta.core.select.residue_selector.ResidueNameSelector()
resn_sele.set_residue_name3(','.join(wanted_ligands))
Huston, we have a problem
In some cases, an error may have occurred but there is a glitched outputted structure.
Generally it is wiser to try to figure out the why than hack it, but if one were in a pinch one could simply align and steal the coordinates. For example, say we know everything after 498 is bad, then we could align the backbone of this and copy over:# Viewer discretion adviced
mapping = pyrosetta.rosetta.std.map_core_id_AtomID_core_id_AtomID()
r = 498
assert threaded.residue(r).natoms() == af_pose.residue(r).natoms()
for a in threaded.residue(r).mainchain_atoms(): # should always be 1,2,3
mapping[pyrosetta.AtomID(a, r)] = pyrosetta.AtomID(a, r)
pyrosetta.rosetta.core.scoring.superimpose_pose(af_pose, threaded, mapping)
for r in range(r, af_pose.total_residue()+1):
assert threaded.residue(r).natoms() == af_pose.residue(r).natoms()
print(r)
for a in range(1, threaded.residue(r).natoms() + 1):
threaded.residue(r).set_xyz(a, af_pose.residue(r).xyz(a))
If an in-between bit is the problem one would have to close the gap as discussed here.
No comments:
Post a Comment