XML to Pyrosetta: EvolutionaryDynamicsMover as an example

In the previous post I discuss the strategies to use a Pyrosetta class when the documentation lets you down. One topic discussed was the conversion of a Rosetta XML script to Pyrosetta. Here is an example, namely using the EvolutionaryDynamics mover as an example.

Evolution

While looking for something I spotted a mover with an interesting name "EvolutionaryDynamicsMover" in the namespace protocols::evolution / protocols.evolution. I did my PhD in the lab of Prof. Wayne Patrick in New Zealand, who fucuses on enzyme evolution, enzyme promiscuity and primordial enzymes —therefore I could not leave this mover move by! So I looked into it and, now that I wanted a test example, I thought to dig it back out again.

The mover name suggests that the mover mutates the protein by following evolutionary rules, which is rather different that other design movers.

Some mutations that are neutral become fixed in the population by statistical chance.
Some mutations require compensating mutations to be effective (epistasis) and as result would arise in Nature only if one were of the two were neutral. whereas with a design mover they would be sampled toghether.
A mutations in a genome arise not with an equiprobable towards any amino acids, but based on a mutational spectrum on DNA.

In the wet lab, the latter hits hard when doing protein engineering via random mutagenesis as it is hard to control the spectrum of amino acid changes. Actually, I wrote/adopted a suite of tools (direvo.mutanalyst.com) for web lab scientists to aid in planning or assessing the outcome of a random mutagenesis experiment —for example, Mutanalysts calculates the mutational spectrum and load at the DNA level, PedelAA calculates the diversity of amino acid mutations.

Mechanism and aims

In the C++ source code I found the name of the main author and searched for any relevant papers, but could not find it. As a result I dropped an email asking about the application/rationale of the mover and got a very helpful reply back.

The unpublished module is intended to simulate evolutionary trajectories where the protein aquires mutations based on the DNA sequence and the sequence drifts unless the mutation brings the Gibbs free energy below a given threshold —which is very cool.

Differences

I should stress that there are many movers that mutate proteins intended for design, whereas here I am strictly talking about evolution. For example:

pyrosetta.rosetta.protocols.simple_moves.MutateResidue mutates a specified residue
pyrosetta.rosetta.protocols.protein_interface_design.movers.RandomMutation adds a random mutation (based on a packer task)
pyrosetta.rosetta.protocols.design_opt.GreedyOptMutationMover which adds sets of mutations iteritively (based on a packer task)
pyrosetta.rosetta.protocols.forge.remodel.RemodelMover adds mutations and indels based on a blueprint
pyrosetta.rosetta.protocols.minimization_packing.PackRotamersMover, pyrosetta.rosetta.protocols.relax.FastRelax etc. design based on a packer task and return the best scoring

XML to Pyrosetta

XML

In the test folder there is the following RosettaScripts XML:

<ROSETTASCRIPTS>
	<SCOREFXNS>
		<ScoreFunction name="beta_nov15_cart" weights="beta_nov15_cart"/>
	</SCOREFXNS>
	<TASKOPERATIONS/>
	<FILTERS>
		<ScoreType confidence="0" name="total_score" score_type="total_score" scorefxn="beta_nov15_cart" threshold="100000"/>
		<Sigmoid filter="total_score" name="stability" negate="0" offset="0" steepness="0.6"/>
	</FILTERS>
	<MOVERS>
		<NucleotideMutation bbnbrs="0" flexbb="0" name="mut" scorefxn="beta_nov15_cart"/>
		<EvolutionaryDynamics drift="1" filter_name="stability" mover_name="mut" name="evolve" population_size="10000" preapply="0" progress_file="trajectory_out.txt" recover_low="0" reset_baselines="0" scorefxn_name="beta_nov15_cart" trials="3"/>
	</MOVERS>
	<PROTOCOLS>
		<Add mover="evolve"/>
	</PROTOCOLS>
</ROSETTASCRIPTS>

As discussed in the previous post, let's check it works then start from the filters and then the movers.

Test

As mentioned, it is important to test the XML script works —it does.

before = pose.sequence()
xml_obj = pyrosetta.rosetta.protocols.rosetta_scripts.RosettaScriptsParser()
protocol = xml_obj.generate_mover_and_apply_to_pose(pose, 'evolve.xml')
protocol.apply(pose)
after = pose.sequence()
print(before, after)

Scorefunction


cart_scorefxn = pyrosetta.create_score_function('ref2015_cart')

Filters

There are two filters in the XML, the score filter gets applied to the sigmoid filter. But actually there is another filter in between as we will see. Also, remember to not call filters filter as that will shadow a built-in. Following the whole principle of converting the inner parts first, let's start with ScoreType, which is one of those annoying tags that lacks the type at the end in RosettaScripts (ScoreType => ScoreTypeFilter).

As discussed in the previous post, to find out how to set up a mover or filter, you can lazily see what the public methods are via dir(mover) and seeing the arguments of any of these methods (especially the constructor/__init__) by calling them with something incorrect like TypeError to trigger a TypeError. Alternatively, you can use the class pyrosetta_documentarian.Documentarian (GitHub) discussed in the previous post to get a table of the method outputs of the XML protocol mover compared to a blank.

# ScoreType name="total_score" scorefxn="beta_nov15_cart" score_type="total_score" confidence="0" threshold="100000"

total_score=pyrosetta.rosetta.core.scoring.ScoreType.total_score
subfilter = pyrosetta.rosetta.protocols.score_filters.ScoreTypeFilter(scorefxn=cart_scorefxn, 
                                                                      score_type=total_score, 
                                                                      score_type_threshold=1e5
                                                                     )

Testing if correct: subfilter.apply(pose)

This is a pretty standard filter, but the threshold is insanely high —score_type_threshold=scorefxn(pose) is a common-ish implementation, so will always return True, so it's a testing thing. However, what is this "confidence="0" attribute in the Tag? It is present in the RosettaScripts Tag but not in the Pyrosetta. Therefore, let's check the protocol generated before.

pm = protocol.get_mover(1)
pm = protocol.get_mover(1)
print(pm.mover_name()) #EvolutionaryDynamics
pf = pm.filters()[1]
pm.filters()[1]
print(pf.name()) # Sigmoid
pf2 = pf.filter()
print(pf2.name()) # StochasticFilter
pf3 = pf2.subfilter()
print(pf3.name()) # ScoreTypeFilter

So the ScoreTypeFilter is wrapped by a StochasticFilter.

stocha = pyrosetta.rosetta.protocols.filters.StochasticFilter(confidence=0.,
                                                              subfilter=subfilter,
                                                              run_subfilter_on=True)

Testing if correct: stocha.apply(pose)

Again, confidence=0. is for testing purposes.

Now the last outer filter:

# Sigmoid name="stability" filter="total_score" steepness="0.6" offset="0" negate="0
sigmoid = pyrosetta.rosetta.protocols.calc_taskop_filters.Sigmoid()
sigmoid.steepness(0.6) # 1.0 default
sigmoid.offset(0.0) #0.0 default
sigmoid.negate(False) # False default
sigmoid.threshold(0) # 0 threshold. values above this are true
sigmoid.filter(stocha)
# test it does not blow up:
sigmoid.score(pose), sigmoid.apply(pose)

Movers

There are two movers, one calls the other —if they were called sequentially they would be both present within the PROTOCOL tag.

# NucleotideMutation name="mut" flexbb="0" bbnbrs="0" scorefxn="beta_nov15_cart"
nmut = pyrosetta.rosetta.protocols.evolution.NucleotideMutation()
nmut.flexbb(False) # default True
nmut.bbnbrs(0) # default zero
nmut.scorefxn(cart_scorefxn)

As written nmut.apply(pose) will segfault. This is because the TaskFactory has to be specified or it segfaults as discussed in the previous post.


tf = pyrosetta.rosetta.core.pack.task.TaskFactory()
nmut.task_factory(tf) # segfaults otherwise

However, now nmut.apply(pose) says it does not have a sequence. Using my class pyrosetta_documentarian.Documentarian to compare the protocol version and a blank one reveals some other settings. However, there appear two methods related to sequence: .init_sequence(seq) and .add_nt_seq_to_pose(pose). Trying them manually reveals that only the first is required. Which results in the following:

# NucleotideMutation name="mut" flexbb="0" bbnbrs="0" scorefxn="beta_nov15_cart"
nmut = pyrosetta.rosetta.protocols.evolution.NucleotideMutation()
nmut.flexbb(False) # default True
nmut.bbnbrs(0) # default zero
nmut.scorefxn(cart_scorefxn)
nmut.cache_task(False)
seq = ''.join([pyrosetta.rosetta.core.chemical.NucleotideTools.aa2randomCodon(aa) for aa in pose.sequence()])
nmut.init_sequence(seq)
# no to nmut.add_nt_seq_to_pose(pose)
tf = pyrosetta.rosetta.core.pack.task.TaskFactory()
nmut.task_factory(tf) # segfaults otherwise

This time, nmut.apply(pose), works and it reveals that FastRelax is being called behind the scenes.

nmut.compute_folding_energies(fa_scorefxn=cart_scorefxn, 
                              pose=pose, 
                              is_flexible=pyrosetta.rosetta.utility.vector1_bool(pose.total_residue()),
                              is_mutpos=pyrosetta.rosetta.utility.vector1_bool(pose.total_residue()),
                              bbnbrs=nmut.bbnbrs(),
                              flexbb=nmut.flexbb()
                             )

The class also picks a random residue, but based on codons. Sometimes, one has very technical questions and the quickest thing is to simply look at the C++ code. I had the following questions for example:

What is the mutational spectrum used? Equiprobable, i.e. each nucleotide has a the same probability to mutate to any other
Can one change the genetic code? It is fixed

Now we can convert the other mover. Everything goes fine, until one hits add_filter which appears to want a lot of details —namely,

add_filter(self, filter: pyrosetta.rosetta.protocols.filters.Filter, adaptive: bool, temp: float, sample_type: str, rank_by: bool)

. This adaptive business is because this mover is a GenericMonteCarloMover, which is an important mover to get to know, so do check out its official documentation —also to check what is inherited in a Python class the magic attribute .__mro__ is very handy.
Adaptive means that the filter settings change based on how the run is going, but by turning it off, it make it simply a mover that iteratively call another and accepts based on a filter. If one were scared of the GenericMonteCarloMover, but wanted simply a mover that applied another, but accepted it based on a filter, then pyrosetta.rosetta.protocols.simple_moves.ContingentAcceptMover wrapped by pyrosetta.rosetta.protocols.protein_interface_design.movers.LoopOver is the way forward, but has a lot less stuff. pyrosetta.rosetta.protocols.evolution.EvolutionaryDynamicsMover.__mro__ The EvolutionaryDynamicsMover mover iteratively calls NucleotideMutation and accepts the result if it passess the filter.
Now, let's add the settings as they appear in the tag.

# EvolutionaryDynamics name="evolve" reset_baselines="0" mover_name="mut" recover_low="0" preapply="0" drift="1" progress_file="trajectory_out.txt" filter_name="stability" trials="3" scorefxn_name="beta_nov15_cart" population_size="10000"
evo = pyrosetta.rosetta.protocols.evolution.EvolutionaryDynamicsMover()
evo.reset_baselines(False) # True by default
evo.set_recover_low(False) # True by default
evo.set_preapply(False) # True by default. setting to True seems to del pose.
evo.set_drift(True) # True by default
evo.set_keep_filters(True)
evo.set_mover(nmut)
evo.set_maxtrials(100) # 0 by default... `trials`. `.total_trials` Or is this `.set_maxtrials`?
evo.branch_length(1e200)
evo.add_filter(filter=sigmoid, 
               adaptive=False, 
               temp=0.0,
               sample_type='low',
               rank_by=False)
evo.mutation_rate(1e200)
evo.set_max_accepted_trials(3)
evo.set_scorefxn(cart_scorefxn)
evo.population_size(1e4) #1e6 default

Unfortunately, something has gone wrong when we do evo.apply(pose), i.e. there is no filter. Checking the presence of the filters shows it lost them: this appear to be one of those cases where one setting blanks another.

# EvolutionaryDynamics name="evolve" reset_baselines="0" mover_name="mut" recover_low="0" preapply="0" drift="1" progress_file="trajectory_out.txt" filter_name="stability" trials="3" scorefxn_name="beta_nov15_cart" population_size="10000"
evo = pyrosetta.rosetta.protocols.evolution.EvolutionaryDynamicsMover()
evo.reset_baselines(False) # True by default
evo.set_recover_low(False) # True by default
evo.set_preapply(False) # True by default. setting to True seems to del pose.
evo.set_drift(True) # True by default
evo.set_keep_filters(True)
evo.set_mover(nmut)
evo.set_maxtrials(100) # 0 by default... `trials`. `.total_trials` Or is this `.set_maxtrials`?
evo.population_size(1e4) #1e6 default
evo.set_scorefxn(cart_scorefxn)
evo.branch_length(1e200)
evo.set_max_accepted_trials(3) # this seems to blank the filters
evo.add_filter(filter=sigmoid, 
               adaptive=False, 
               temp=0.0,
               sample_type='low',
               rank_by=False)
evo.mutation_rate(1e200)

This now works perfectly! Success.

Other goodies

Coverting a RosettaScript to Pyrosetta is not the end though as there are many more things one would want to do that aren't covered, but can be used with the same tricks as above. Looking at the evolution submodule in fact shows that there is more.

fitnessCostPerATP

The pyrosetta.rosetta.protocols.evolution.AASynthesisFitnessCost filter assesses the variant based on amino acid cost as some amino acids, for example in most environments the aromatics and hydrophobics have a higher fitness cost than small amino acids. All filters, starting from pyrosetta.rosetta.protocols.filters.Filter have a .score(pose) method, which returns a calculated number, whereas .apply(pose) gets converts it into a boolean depending on some criterion, generally whether it is larger/smaller than the threshold value —most often called threshold() (overloaded getter/setter). In this case, .fitnessCostPerATP() (overloaded getter/setter) is used to calculate the fitness cost —if you divide the score by this you'll find out how many ATP your protein costs, which is a fun thing to find out.

AlignmentAAFinder and AlignmentGapInserter

There also two classes for filtering based upon an alignment file alignment_file (overloaded getter/setter), namely pyrosetta.rosetta.protocols.evolution.AlignmentAAFinder and pyrosetta.rosetta.protocols.evolution.AlignmentGapInserter. There is also the AlignmentCleanerTools namespace/module within the evolution namespace/module with several helper functions.
The comparative_modelling namespace/module (for ab initio/threading) has also got a lot using MSA files. It should be said these are filters and none of the others are not packers, so there is not a one line way to convert a MSA file into a resfile.

Generalisation

This example covered only one XML to Pyrosetta conversion, but the principle is the same —especially the typically tripping point of the task factory...

2 comments:

Unknown4 November 2020 at 01:00
Thank you for all your contributions in github and I am reading for the firts time your blog and it seems very interesting. Let me introduce myself. I am Blaise Mbunga Mputu, I am working as data scientist in heatlh domain. I am new in Pyrosetta and I would like to perform a docking with Main Proteinase and one of its potential inhibitor. Unfortunately i am experiencing issues when i use the scripts i found in this link https://github.com/RosettaCommons/PyRosetta.notebooks. I tried one of the protocol described in the github and I have this error message: File: /home/benchmark/rosetta/source/src/core/chemical/ResidueType.cc:2135
[ ERROR ] UtilityExitException
ERROR: One or more internal errors have occurred in residue type setup for LIG (LIG, Z)
In chi #24, the base of the fourth atom ( C33) is C32, rather than the third atom of the chi ( C34)
In chi #29, the base of the third atom ( C33) is C32, rather than the second atom of the chi ( C34)
In chi #29, the base of the fourth atom ( C32) is C31, rather than the third atom of the chi ( C33)
In chi #31, the base of the fourth atom ( C24) is N4 , rather than the third atom of the chi ( C25)
In chi #32, the base of the third atom ( C34) is C35, rather than the second atom of the chi ( C33)
In chi #32, the base of the fourth atom ( C35) is N6 , rather than the third atom of the chi ( C34)
In chi #33, the base of the third atom ( C32) is C31, rather than the second atom of the chi ( C33)
In chi #33, the base of the fourth atom ( C31) is N6 , rather than the third atom of the chi ( C32)
In chi #36, the base of the third atom ( C24) is N4 , rather than the second atom of the chi ( C25)
In chi #36, the base of the fourth atom ( N4 ) is C19, rather than the third atom of the chi ( C24)
In chi #41, the base of the third atom ( N4 ) is C19, rather than the second atom of the chi ( C24)
In chi #41, the base of the fourth atom ( C19) is C18, rather than the third atom of the chi ( N4 )
In chi #45, the base of the third atom ( C25) is N5 , rather than the second atom of the chi ( C24).
I do not understand what is the problem. Could you please help me? Do you think that the evolutionary dynamics mover could do well than flexible XML docking protocol. Once again thank you for the cointributions you made towards your github and the blog.
Best Regards

The art of blowing up protein

Pages

Saturday, 31 October 2020