Friday, 21 February 2020

Working around segmentation faults of pyrosetta: threads & processes

Rosetta often does not die gracefully. Pyrosetta is the same. If the starting template is not great segmentation faults will result in the kernel issuing signal 11 to kill the process. The way around it is to spin it up as its own process via the multiprocessing module and not the threading module, because child threads use the same process.

Signal handling is too late

A Segmentation fault is not an error that gets raised is a signal given to the process to terminate it. As a result, in a Jupyter notebook you'll get a modal that informs you that the IPython kernel is restarting. signal.signal(signal.SIGSEGV, sigHandler) can block it in the main thread, but once it is blocked all child threads have terminated and there is only ruins of what was once a glorious process.

Testing code

From the technical point of view, it occurs when the wrong memory address is tried to be accessed. There are many ways that it can happen, including as a result of a stack overflow, for example. These two lines will raise a segmentation fault.
import ctypes
ctypes.string_at(0)
This happens because there is no string at address 0. When it is run in a script it terminate it. In IPython it does the same:
(base) matteo@brc10:~$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ctypes
>>> ctypes.string_at(0)
Segmentation fault (core dumped)
(base) matteo@brc10:~$ 
If this is excuted in a thread it will kill not only the thread but the process, therefore, it there if it happens in a job managed by the APScheduler the whole thing will crash. If it happens for a request in Pyramid or Flash, it will kill the application. Here is an example that terminates the script:
import threading
import ctypes
def subpro():
    ctypes.string_at(0)
t =threading.Thread(target=subpro)
t.start()
time.sleep(0.1)
print(t.is_alive())
In a Jupyter notebook, the IPython kernel will restart for that notebook, but not the others are unaffected. This is because they are different processes with different ids. Sister threads get run within the same core and share data, whereas processes may not.

Multiprocesses

The distinction between processes and threads is important. In fact, paralleling code with the threading module will be limited to one core and does not step over the Global Interpreter Lock, while multiprocessing is not limited. Threads are great for simple things like timeout timers for which there is in fact a inheriting class called Timer that does exactly that. For networking application asyncio is similar but works more like JavaScript and is more efficient at (a)waiting for the response to come back.
This partially why Pool is a class in multithreading and but not in threading. To get reponses back a subprocess requires one of various communication means, such as Pipe. Here is the above method, but with the twist that it dies on command. Between this Abraham-Isaac pair of methods communication is done via Pipe. However, if the process dies, the pipe receive command (parent_conn.recv()) will wait for eternity.
from multiprocessing import Process, Pipe
from threading import Thread
import ctypes, time

def subpro(must_die, conn):
    if must_die:
        ctypes.string_at(0)
    else:
        conn.send('Thanks you for not killing me.')

def hyperpro():
    parent_conn, child_conn = Pipe()
    for must_die in (False, True):
        p = Process(target=subpro, args=(must_die, child_conn))
        p.start()
        msg = parent_conn.recv()
        print('I asked for you to die?', must_die, 'Are you alive? ', p.is_alive(), 'Thoughts?', msg)

hyperpro()
To avoid this the reading must happen after that the subprocess ends with an error message queue up in case there is no message.
def hyperpro():
    parent_conn, child_conn = Pipe()
    for must_die in (False, True):
        p = Process(target=subpro, args=(must_die, child_conn))
        p.start()
        while 1:
            if not p.is_alive():
                child_conn.send('The dead do not speak')
                msg = parent_conn.recv()
                break   
        print('I asked for you to die?', must_die, 'Are you alive? ', p.is_alive(), 'Thoughts?', msg)
the whole pipeline does not come tumbling down! Hurray! Well... Unfortunately, if the data sent thorough the pipe is large, the pipe will not send all the data at once and terminate the child, but will kind-of-wait forever as it is still alive. This is the case with pyrosetta, but luckily there is another command pipe.poll(), which gives a true or false:
while 1:
    if parent_conn.poll():
        break
    elif not p.is_alive():
        child_conn.send({'error': 'segmentation fault'})
        break
    else:
        pass

Wednesday, 12 February 2020

Guess bond order in Rdkit by number of bound atoms

Some compChem/Biochem programs do not care about bond order and strip them, which is rather frustrating. Pre-made Rosetta params files for some non-canonical amino acids in the database folder are a classic example.
There is no single magic mol.CorrectBondOrder() command in Rdkit, but luckily there are some tricks that can be done. Here I will discuss finding out using the number of bound atoms.

Sunday, 15 December 2019

What-if: Biosynthesis of deazaguanine

7-deazaguanine is an analogue of guanine that has a carbon instead of a nitrogen. This molecule that is not made in nature, but many substituted versions, namely queine, archaeosine nucleobase and their precursors. These are made via a different route. However, it would be very feasible to make via straightforward enzymology.

Saturday, 30 November 2019

Convert Python docstrings to GitHub markdown readmes

The Greek philosopher Epictetus said that a day reverse engineering a piece of code saves you half an hour reading the documentation. A maxim still valid to this day. Nevertheless, documenting code is important. With PyCharm and the push towards typehinting in Python writing docstrings is fairly simple. However, getting docstrings into the readme.md of GitHub is not straightforward the first time round. Hence, I wrote this simple guide to doing so.

Thursday, 7 November 2019

Go away glycerol!!

Due to the nature of crystallisation additives are often found in PDB structures. These are generally unwelcome, especially if you want to extract ligands. In fact, I have heard only once someone talk excitedly about their crystallisation reagent in their structure, but only because they were trying to flog it off as an allosteric binding site. Generally, they are just annoying. Luckily you don't need reinvent the wheel as a list or two already exist!

Monday, 21 October 2019

RDKit for Rosetta: PLP ligand space as an example

Docking requires a molecule to dock. Preparing a ligand is often tricky, especially if the ligand is complicated, such as PLP. PLP is an interesting cofactor as it catalyses the reaction while the protein chooses the ligand. It binds tightly to the active site via its phosphate and its pyridine ring, while the metabolite to be transformed forms a Schiff base with it. Therefore, one would think that it makes easy to explore chemistry space with it. However, several technical hurdles are encountered, making it quite didactic.

Toasty CSS with BS4

In Bootstrap 4 you can have appear small alert-like rectangles, called toasts. However, getting these to work like notifications on top of the page in the top right is not trivial as it requires some CSS trickery. Here is what is required.