Pyrosetta scripting without a manual

Tuesday, 27 October 2020

Pyrosetta scripting without a manual

I got recently asked how to figure out how to write a Pyrosetta script when there is no example. This is definitely the biggest weakness of Pyrosetta and Rosetta script, but it is not insurmountable. In fact, there is a wealth of information that is hidden that can be mined. Here is how and in the next post, I give an example.

Repository

In writing this post I put together a bunch of codeblocks into a class (Documentarian) that can be found in the GitHub repo Pyrosetta documentarian (any mention of this has green borders).

Resources

So let's start from the top. How to use Pyrosetta in the first place is documented in a few places.

Choosing a mover

The sometimes trickiest part is figuring out which mover to use. If there's a binary that you want emulate so that is great. The table of movers has the major ones, but there are many more. For something not listed that you suspect Rosetta/Pyrosetta can do the next best options are...
(a) ask and get lucky with a reply,
(b) Google until you get lucky or
(c) search/read the names of the protocols listed in the Pyrosetta documentation until you get lucky.

One thing to remember is that sometimes names are misleading. For example, there are a dozen methods called "ddG": some are for mutation scoring, other for interfaces or ligands. The NetCharge filter is pyrosetta.rosetta.protocols.simple_filters.NetChargeFilter and not pyrosetta.rosetta.core.scoring.netcharge_energy.NetChargeEnergy.

Another is not to give into the tyranny of sunken costs: if you gave it a go for more than an hour, let it go and try a different approach —for example, I wanted to get the pKA of a protein from model as opposed to from sequence, so I tried pyrosetta.rosetta.protocols.simple_moves.ReportEffectivePKA, which wanted LYS_D params file etc. only to encounter more issues, until I read the C++ code comments, which made me realise it wasn't what I wanted in the first place...

Using a mover

Given a mover there are two things we want to know, how to use it and what it does internally. Let's look at the first challenge.

GitHub

So there are a couple of tutorials online and searching for the name of the mover may reveal them. Search the mover name prefixed with an asterisk may reveal others. Alternatively search the mover without the word Mover at the end as some don't have that part in the C++ code. If there are many tagging the word GitHub can reveal some handy code snippets. Not all places are however googlable, for example there are some handy files in the Jens Meiler lab webpage but are zipped or as pdfs.

Converting a XML script

Alternatively convert an XML script.

Finding the XML script

In the Rosetta folder under the path main/source/src/main/tests/integration/tests, there are many XML scripts. Additionally, more can be found in various places mentioned above. However, the amount hiding in that path is staggering.

Prepare for conversion

There are many ways to do the XML Rosetta script to Pyrosetta, but for me, the best way is in a Jupyter notebook doing the following steps.
  • Copy-paste the XML in a markdown cell, but tab it to prevent rendering of the XML tags.
  • Start Pyrosetta preferably with some of the flags for the XML (a file called flags or commands.txt) —later you can use the Options function.
  • Import a pose, preferably the one with the XML.
  • Then copy each tag into a cell as a comment —<MoverName ...< is called a tag, which you see a lot in Pyrosetta methods.
  • And start re-writing (see below)
  • Testing each block as you go along (see below)

Converting tag by tag

Each tag starts with a name in CamelCase. The SCOREFXN will most likely require a particular set of weights, but you create that with pyrosetta.create_score_function.
The Pyrosetta equivalent of the filters, mover or tasks tags will have the same name, except that they may have an extra word "Mover" or "Filter" or "Task" lopped onto the end. Simply search the Pyrosetta documentation for them.
The argument names to initialise these will likely differ however. Many constructors (__init__ magic method in Pythonese) are overloaded methods, which means they can be initialised with different sets of arguments. The constructors of classes are not in the online documentation, but help(...) or trying to initialise with something clearly wrong, say Exception, will tell you what it expected —do pay attention to the typehinting as this will reveal what is meant by a given argument.
In most cases there will be setters, .set_👾👾() or if not, the method name that seems more similar will be overloaded to be a setter or a getter. An example of this is setting a scorefxn, which is via .scorefxn(scorefxn) in regular movers, but set_scorefxn in MCMC movers...
A quicker way that looking it up online is passing your mover (instance or class) to dir(...) in a new cell and reading the list.
In some cases vectors are required and sometimes a method accepts a blank vector and fills it as seen in the post about adding waters in Pyrosetta, where vectors are also discussed.

Testing

To use a score function simply call it with the pose as an argument (scorefxn(pose)), while for filters and movers use .apply(pose).
Leave the movers last and don't forget to give it the filters, taskFactory and scorefunctions. However, you might get a segmentation fault. This means that something was attempted to be accessed but was not there.
Even if the XML script does not explicitly set a TaskFactory or MoveMap or scorefxn, it does not mean that you don't have to. If a mover has one or more methods for these, chances are you will need to specify them. A blank one will do however:
tf = pyrosetta.rosetta.core.pack.task.TaskFactory()
mover.task_factory(tf) # segfaults otherwise

In this regard the pyrosetta.rosetta.protocols.filters.TrueFilte filter which always returns true, pyrosetta.rosetta.core.select.residue_selector.TrueResidueSelector selects everything and pyrosetta.rosetta.protocols.moves.NullMover mover whose apply does nothing (which differs from pyrosetta.rosetta.protocols.moves.Mover, which has a virtual apply as discussed below)

An handy thing to do is to check which the getters that the values match —some settings may blank others, so the order is sometimes important. However sometimes there is something completely unclear, so running the XML in Pyrosetta and comparing things is the best approach.

Running an XML script in Pyrosetta

To run a XML script in Pyrosetta you do the following:
xml_obj = pyrosetta.rosetta.protocols.rosetta_scripts.RosettaScriptsParser()
protocol = xml_obj.generate_mover_and_apply_to_pose(pose, "👾👾.xml")
protocol.apply(pose)
This is rather simplistic, as you may require some command line arguments, such as -parser:script_vars, which you can feed with the pyrosetta.init command. I discuss this further here. The protocol variable above is an instance of pyrosetta.rosetta.protocols.rosetta_scripts.ParsedProtocol, which we can inspect. An XML script has a MOVERS block with one or more mover tags. To see what the movers are within a ParsedProtocol object you can call .get_mover(i), which will be a normal mover... But filled correctly!

If you are using pyrosetta_documentarian.Documentarian, the method .compare() is very useful as it will give a Pandas table showing the differences with a blank object.

Inspecting the inheritance

All the movers are built upon another. So it can be highly useful to see which methods have been added. Also some methods are virtual only in the parent class (that is they are just a placeholder noot to be called, similar to methods of abstract base classes in Python.

If you are using pyrosetta_documentarian.Documentarian, the dynamic attribute .uninherited will tell you what isn't inherited.

The cogs of a mover

provide_citation_info method

If you want to know a technical details of a mover, like how it works, you can see what paper it is published in, by calling the bound method provide_citation_info(), which actually gives a rather meaningless return, which you need to finangle:
def get_citation(target_mover: pyrosetta.rosetta.protocols.moves.Mover) -> str:
        ci = target_mover.provide_citation_info()
        if len(ci) == 0:
            return ''
        else:
            cc = ci.pop()
            # print(cc.module_name(), cc.module_type())
            buffer = pyrosetta.rosetta.std.stringbuf()
            cc.get_citations_formatted(pyrosetta.rosetta.std.ostream(buffer))
            return buffer.str().strip()

If you are using pyrosetta_documentarian.Documentarian, the dynamic attribute .citation will do this

Author → Paper

Alternatively and often more fruitful is finding who the author is. In the C++ code it is the @author line. Unfortunately, the online Pyrosetta documentation does not show C++ class comments or the constructor (=__init__ in Python) comments, so looking at the actual C++ code is often helpful, but more on that later. Once you have an author name it is easy to figure out which paper it corresponds to.

If there is nothing at all and Googling the mover name draws blanks, emailing one of the authors is a good option. For example, curious about one mover I got a helpful reply that even included a paper draft.

Read the C++ code

This is really a last resort. However, you only need to know the bare basics of C++ to read it, so it is not that dire. Rosetta is written in C++ and Pyrosetta wraps this using Pybind11. Some parts of Pyrosetta are pure Python and you can read the code. The Pybind11 part however does not show the C++ (it's precompiled), so you'll have to download the regular Rosetta suite to read the C++. So let's see some basics in order to reverse engineer a class!

Header files

The main thing to understand is that C++ code is often spit into header files and implementation files. When looking a introductions to C++, it has several features in common with Python or JavaScript, but the header file business is totally alien. Basically a C++ class can be split into an empty shell that simply declares the methods within, which is in the header file, a file with the extension .h, and this shell gets filled in the implementation file, a file with extension .c or .cpp. The code in the header file looks more similar to Python, while the implementation file has many :: bits.

Class

A C++ class written in a single file can look like this:
class Animal {

  private:
    // Private attribute
    int age;

  public:
    // some
    void Eat() {
        std::cout << "Nom nom\n";
    }

    // Setter
    void setAge(int a) {
      age = a;
    }
    // Getter
    int getAge() {
      return age;
    }

};
There are a few things to note here:
  • public and private (access) are equivalent to Python single preceding underscore notation, but whereas it's bad practice to call a _method in Python outside of the class instance, you cannot call a private method in C++ outside of the instance.
  • Typehint came only recently to Python, but it's mighty useful. In C++ specifying types is compulsory and you cannot change the type of a variable. in the above there is an int and a void (=None). What the methods return is specified before the name
  • In python when you import a module you can get its contents with module.method kind of notation. In C++ it's similar, namely namespace::method. In implementation files, you'll see class::method.
  • There are no docstrings, but comments (// or /* ... */) above in the header files. These may feature @author, @description and other Doxygen-style comment block features. Not all of these went into the python docstrings and then the Sphynx documentation online for Pyrosetta

If you are using pyrosetta_documentarian.Documentarian, the dynamic attribute .comments will show all the comments.

No comments:

Post a comment