RDKit for Rosetta: PLP ligand space as an example

Monday 21 October 2019

RDKit for Rosetta: PLP ligand space as an example

Docking requires a molecule to dock. Preparing a ligand is often tricky, especially if the ligand is complicated, such as PLP. PLP is an interesting cofactor as it catalyses the reaction while the protein chooses the ligand. It binds tightly to the active site via its phosphate and its pyridine ring, while the metabolite to be transformed forms a Schiff base with it. Therefore, one would think that it makes easy to explore chemistry space with it. However, several technical hurdles are encountered, making it quite didactic.

Overview of PLP: know thy opponent

PLP stands for PyridoxaL 5'-Phosphate. The "L" has nothing to do with chirality. It is a key cofactor that could possibly be prebiotic (with or without one or two substituents).

PLP forms a Schiff base (secondary imine) with an amine group, often an amino acid. Thanks to its pyridine ring it can stabilise the deprotonated form (carbocation) of the ligand as shown in this figure from Wikipedia*. There are plenty of reviews and QM MD papers to oogle at so I won't go into details. But in brief, it is formed from a pyridine with four substituents:
  • formyl group — reversible substitution reaction with amine to form imines
  • a hydroxyl group — locks the nitrogen in place and helps with resonance
  • a methyl group — biosynthetic remnant**
*) Cool fact: I had drawn it and uploaded, but didn't recall having done so until I saw my Wiki username resulting in a weird déjà vu moment.
**) This is not important, but it is interesting as it is unavoidable in the R5P-dependent pathway as it is the phosphate-bound carbon (5) from ribose-5-phosphate. Glyoxalate could be used as opposed to pyruvate in the DXP-dependent pathway, but that is likely the newcomer pathway.


Atom names

The names of atoms in the structure is important if you want to do some manipulations and, well, talk about them. Hence why I mention them. As is typical, the pyridine ring atoms get numbered in order and all substituents get a prime. Of note is the N1 is the nitrogen of the heterocycle.

Hydrogens matter

Protons are visible only on neutron diffraction structures and NMR structures in deuterated water — PDB:6MOP is a really nice example. So in most structures (>1.2 Å), change and valence are not visible and automated pipelines make some errors. Therefore it is always important to get familiar with a ligand before working with it. This is true with active sites too: for sure surface glutamates are in the conjugate base form (negative, no proton), but active site ones might be protonated thanks to a hydrogen bond. This is true for example with the catalytic glutamic acid in the structure of SpyCatcher, discussed in my isopeptide bond post.
This example is nice, because it shows that an incorrectly declared hydrogen bond will be repulsed in force field calculations, meaning that correct protonation matters.

In PLP-derivatives there are two ultra-important protons.

  • The N1 nitrogen is protonated and positively charged. Kind of, there is a nice paper where they make a nitrogen-free analogue, deazapyridoxal, and discuss this. However, the amino acid it H-bonds to isn't canonically protonated, so making it protonated saves the hassle.
  • Either the nitrogen of the amino acid (N) is protonated or the O3 is. Technically, it is the nitrogen for the reaction (cf. QM MD papers). But a positively charged secondary imine is not a standard atom type in Rosetta so it will just cause grief for classical mechanics force field calculations.

Imine problem

A thing to note is that all structures with PLP with a bound amino acids have a non-productive analogue. Say the structures 1L6G and 1L6F contain a PLP bound to alanine in D and L enantiomers, but the bonding is wrong in that, for crystallisation purposes, the amino acid backbone nitrogen ("N") is a secondary amine and not a secondary imine. This means that the nitrogen and C-alpha are not coplanar as they should be. So they cannot be used out of the box, but are excellent for checking how the C' carboxyl binds.

Actually and not atypically, the ligand is not planar in these structures, but shows some bucking from imperfect fitting. Hence why the energy minimisation step is (generally) required for most force-field calculations.


Herein, I want explore ligand space for a give PLP enzyme with Rosetta. I am not restricted to Rosetta as there are may other tools, but I like Rosetta as it's not a one trick pony.

For this I have a few options:

  1. Rosetta ligand_dock with PLP-amino acid —but the PLP rolls about in place, so a few constraints will be needed.
  2. Rosetta ligand_dock with a covalent amino acid only —this requires a constraint file otherwise the N atom will flop around unnaturally with impunity (as seen my post about fluorescein conjugated ligand)
  3. Rosetta score with a PLP backbone only —score will see there are a few missing atoms and find the best one, a super-quick hackish trick also discussed in my post about phosphorylated residues.
  4. Rosetta script ligand docking is possible with a script, would be a re-write of above three options, but with better suited steps, so the decision is the same.
The last option is really nice, but has two caveats, first conformations need to present as score does not generate conformers and the second is that the side chains are not repacked to accommodate the ligand.


The first thing to do is generate the ligands to explore. The Python comp-chemist library of choice is RDKit, a powerful library with shockingly meagre documentation. RDKit allows the manipulation of chemical structures, say changing a aldehyde to a secondary imine.

Like PyMol, RDKit is best install with anaconda, actually whereas I have managed to install the former without conda, I have not managed with the latter in either Ubuntu or iOS.

The imports show how confusing it is:
from rdkit import Chem
from rdkit.Chem import AllChem
# if it's not a method of a Chem instance, it might be a Chem class method, if not it is in AllChem.
# You just try until you find it.
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules in Jupyter
from rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaults
from IPython import display
# With Jupyter notebooks the last line in a cell is secretly wrapped with a display,
# using display allows you to do this anywhere. Superhandy with pandas too. 
plp = Chem.MolFromSmiles('CC1=NC=C(C(=C1O)C=O)COP(=O)(O)O') # Smiles copy pasted form PubChem
display(plp) ## it's a PLP structure!
A smiles string is a way of representing a molecule as a string. You can find these in Pubchem or in Wikipedia. NIH Cactus has an online Smiles to SDF file translator, so if you have only the one ligand and cannot find your structure in PubChem, but a close relative you can use this and ignore RDKit. A Smart string is a variant with some extra features, but also difference when it comes hydrogens. These are tailored to reactions. As shown in this (not important) example:
plp = Chem.MolFromSmiles('CC1=NC=C(C(=C1O)C=O)COP(=O)(O)O')
glu = Chem.MolFromSmiles('C(CC(=O)O)[C@@H](C(=O)O)N')
rxn = AllChem.ReactionFromSmarts('[R1:1][c:2]C=O.[N:0][CH]C(=O)O>>[R1:1][c:2]C=[N:0][CH]C(=O)O')
prods = rxn.RunReactants((plp,glu))
However, for the main part smart strings are the part that I will concentrate on. The RDKit tutorial is nearly all the documentation present, but covers a lot of the functionality as I recommend looking at it as I will only mention thing that are important to me.

Useful functions

These are the functions (it's very much a C++ library) that are worth keeping an eye out for:

Creates a Chem (chemical) object from a Smiles string. The index of the atoms is the same as that in which they appeared —this is true for all the other MolFrom functions (SDF, Mol2, FASTA and PDB blocks (str) or filenames to name some). Knowing the index of an atom is super important as will become clear in the next example

A list is returned (or an error) of the indices which correspond to the indices of the query compound, which generally is a small smile so say I want to find the amino nitrogen of an amino acid I would do:
my_amino_acid = Chem.MolFrom...whatever
backbone = Chem.MolFromSmiles('C(C(=O)[O-])N')
n = my_amino_acid.GetSubstructMatch(backbone)[4]
One thing to remember where they come from (smiles or smart strings say) as the hydrogens may or may not be required for the match. In the above the backbone has a [O-], but it could have been O or [OH]. There are pages and pages on the internet explaining how to write a match for zero or more hydrogens say, however, the flexibility is limited and generally switching between smarts and smiles will do the trick.

AllChem.ReplaceSubstructs(target, query, replacement, replacementConnectionPoint=index)
Target, query and replacement are all Chem objects, and the function returns a Chem instance where the query match in target is replaced with replacement. That makes sense, right? Well, this behaves slightly counter-intuitively as the replacement point isn't what you'd expect, but by the default the first atom in the replacement object. To give it the right point add the index (an integer) to replacementConnectionPoint.

Now that these three are covered we can write some handy functions for the task at hand:

def Lise(mol):
    query = Chem.MolFromSmiles('CC(C(=O)O)N')
    rep = Chem.MolFromSmiles('C[C@@H](C(=O)[O-])N')
    return AllChem.ReplaceSubstructs(mol, query, rep, replacementConnectionPoint=0)[0]

def Dise(mol):
    query = Chem.MolFromSmiles('CC(C(=O)O)N')
    rep = Chem.MolFromSmiles('C[C@H](C(=O)[O-])N')
    return AllChem.ReplaceSubstructs(mol, query, rep, replacementConnectionPoint=0)[0]

def imminise(mol):
    query = Chem.MolFromSmiles('CC(C(=O)O)N')
    rep = Chem.MolFromSmiles('CC(C(=O)[O-])(=N)')
    return AllChem.ReplaceSubstructs(mol, query, rep, replacementConnectionPoint=0)[0]
def get_alphaN(mol):
    bb = Chem.MolFromSmarts('NCC(=O)O') # N = 1
    mb = mol.GetSubstructMatch(bb)
    return mb[0]

def get_important_atoms(mol):
    bb = Chem.MolFromSmiles('Cc1ncc(COP(=O)(O)O)c(C=NCC(=O)(O))c1O')
    match = mol.GetSubstructMatch(bb)
    if match:
        c2prime, c2, n1, c6, c5, c5prime, op4, p, op1, op2, op3, c4, c4prime, n, α, c, o, oxt, c3, o3 = match
    else: ## its an quinone form
        bb = Chem.MolFromSmiles('C(O)(=O)C=NC=C1C(COP(=O)(O)O)=CNC(C)=C1O')
        c, oxt, o, α, n, c4prime, c4, c5, c5prime, op4, p, op1, op2, op3, c6, n1, c2, c2prime, c3, o3 = mol.GetSubstructMatch(bb)
    return {'CA': α, 
            'C': c,
            'O': o,
            'OXT': oxt,
            'N': n,
           'P': p,
           'OP1': op1,
           'OP2': op2,
           'OP3': op3,
           'OP4': op4,
           'N1': n1, #pyridoxal nitrogen
            'C2': c2,
            'C2\'': c2prime,
            'C3': c3,
            'O3': o3, #pyridoxal hydroxyl
            'C4': c4,
            'C4\'': c4prime,
            'C5': c5,
            'C5\'': c5prime,
            'C6': c6
The last method is a more elaborate version of the preceding one. It seems a bit overkill, but actually, it is needed for a hack that is coming up.
Namely, it is really handy to have consistent standardised atom names for both PyMOL and Rosetta. Hence why I went to the effort of naming all the atoms.

3D structure

So far I have been making 2D representations (display(my_mol) will be pretty), but what I want are 3D structures.

def tridimensionalise(mol,filename):
    # I might have changed some atoms around, so the charge or protonation may be wrong.
    # The error you get isn't helpful, so it always pays to do this before protonating
    mol = Chem.AddHs(mol) #protonate explicitly
    Chem.GetSSSR(mol) #not communists, but resonance fixing
    AllChem.EmbedMolecule(mol) #initialise for 3d.
    AllChem.UFFOptimizeMolecule(mol, maxIters=2000) #UFFO is an force field algorithm
    #AllChem.MMFFOptimizeMolecule(mol) #another one! But use UFFO. Or Gauss if your cluster has it!
    Chem.rdPartialCharges.ComputeGasteigerCharges(mol) #partial charges. utterly useless when saving as a sdf...
    return mol
Great. Except that some atoms need rotating, such as the N on C4'. Luckily, I don't need to do any selections as I have already got a handy method for that.

def fix(mol):
    atoms = get_important_atoms(mol)
    Chem.rdMolTransforms.SetDihedralRad(mol.GetConformer(), atoms['C3'], atoms['C4'], atoms['C4\''], atoms['N'], 0)
    return mol
When we are happy we can save the molecule: Chem.SDWriter(filename).write(mol) This seems madness. Mol/SDF files do not have atom names, while PDB or Mol2 files do.
However, RDKit cannot save MOL2 files and even if it could I cannot figure out what the attribute for names (if any) is called. So to give atoms names, they need to be renamed outside of RDKit.
The first step is to convert the Mol/SDF file to Mol2. The tool Open Babel is great for file conversion and much more. Then we can rename the atoms.
Annoyingly RDKit is a bit unpredictable with Mol2 files, so the mol file is used, but the atom order ought to be maintained, for this the following code can be used, or alternative check out my GitHub repo AtomicNamer.

def make_mol2_w_good_names(mol, filename, three_letter):
    tmpfile = filename.replace('.mol', '.bk.mol2')
    mol2file = filename.replace('.mol', '.mol2')
    if os.system(f"obabel -i mol {filename} -o mol2 -O {tmpfile} --confab"):
        raise SystemError('Babel failed.')
    # get important atoms.
    # invert
    mapping= get_important_atoms(mol) # CA: C12
    rename_map = {}
    definitions = False
    with open(mol2file, 'w') as w:
        with open(tmpfile) as r:
            counter = 20
            for line in r:
                if '@ATOM' in line:
                    definitions = True
                elif '@BOND' in line:
                    definitions = False
                elif definitions:
                    index, name, x, y, z, element, n, resn, charge = line.split()
                    i = int(index) - 1
                    if i in mapping.values():
                        newname = list(mapping.keys())[list(mapping.values()).index(i)]
                        newname = element[0]+str(counter)
                        counter += 1
                    new = f" {index:>6} {newname:<4 float="" x="">7}   {float(y):>7}   {float(z):>7} {element:<6 0="" charge="" float="" three_letter="">5}\n"
    return mol2file

We have solved two issues at once as now we have conformers too (confab flag). The Mol2 file (with conformers) is ready to be parameterised. In order to make stuff less weird, I ported the script molfile_to_params.py to Python 3 and actually wrote a method that allows it to be called as a normal module. To get this see https://github.com/matteoferla/mol_to_params.py
molfile_to_params.run(mol2file, name=three_letter, pdb=name, clobber=True, keep_names=True, conformers_in_one_file=True)
The keep_names argument keeps the atom names, not the molecule name. That is given by the pdb argument. The molecule in RDKit name can be changed at any point with mol.SetProp("_Name",name). But there is not really a need. One issue that can arise is if you have two amino acid cores, as cystathionine has. In which case, the laziest way is simply to protect the other amine group by replacing it at the SMILES string stage with bromine and replacing it back.

mol = AllChem.ReplaceSubstructs(mol, Chem.MolFromSmiles('Br'), Chem.MolFromSmiles('[NH3+]'))[0]
It is very lazy and incredibly much simpler than the alternative...


  1. Hi there, I found this very cool, thanks for sharing. I was wondering if you have something to fix corrupted sdf files (with wrong bond orders and charges) using a mol2 file containing the original molecules. As the output of a VS program named Pharmer, I have a large amount of compounds in an sdf file. The thing is that many of this molecules have mistakes in the conectivity and charges and since I have a mol2 file with the original molecules that I used for the screeining, I was hoping to be able to fix the sdf file molecules using the mol2 files. Do you know if rdkit could be useful to do so? Thanks a lot! excuse my ignorance. I am new in the field.

    1. Sorry for not replying, I get so much random spam here (hence the moderate comment set-up) that your comment fell throw the cracks. wrong bond orders can be fixed with the RDKit command `AssignBondOrdersFromTemplate`.

      from rdkit import Chem
      from rdkit.Chem import AllChem

      # let's make a fake acrylate. In reality this would come from a file
      faux_acrylate = Chem.MolFromSmiles('CCC(O)O')
      # let's load a SMILES that has the correct bond order and formal charge
      acrylate = Chem.MolFromSmiles('C=CC([O-])O')
      fixed_acrylate = AllChem.AssignBondOrdersFromTemplate(acrylate, faux_acrylate)

      The only caveat is that the numbers of atoms and needs to be of the same atomic number (i.e. no dummy atoms) and a bond needs to be there even if of the wrong order and not extra atoms or bonds.