The art of blowing up protein: Atom names purely in RDKit

For some applications, such as PyMOL scripts or Rosetta, atom names are really important, say CA is the standard name for the α-carbon. Example uses of atom names in Rosetta/pyrosetta include setting constraints, using a params file for a custom ligand and so forth. However, RDKit is a bit of a nuisance with atom names as it is not a central feature, but a feature added for PDB files that is not too well documented.

File formats

This problem stems from the fact that few formats handle atom names. For small ligands, a few file formats exist as evident from the different Chem.MolFromXXXX functions. A SMILES string contains no atom information (or its coordinates), bar the element and the index, which is the order in which the atoms appears. A mol files and a sdf file do not contain atom names and differ primarily in that the latter can contain multiple molecules (different or conformers) and extra data. A pdb file contains a lot more information, such as an atom index, element and name, residue name and index. A mol2 files contains similar information, plus partial charges. Technically, Autodock's pdbqt format is a pdb with partial charges, but is not a standard file format.

RDKit

In RDKit, an atom (i.e. rdkit.Chem.rdchem.Atom, retrieved say with mol.GetAtomWithIdx(19)) contains extra data for read/writing to PDB files. This extra data is not in a "Prop" property (e.g. found via atom.GetPropsAsDict), but in an rdkit.Chem.rdchem.AtomPDBResidueInfo object retrieved via atom.GetPDBResidueInfo and setable via atom.SetMonomerInfo. One property of the info is atom.GetPDBResidueInfo().GetName(), which is that in the PDB.
If an atom is from a molecules not from a PDB file/block atom.GetPDBResidueInfo will return None and when written to PDB will result in a HETATM line with atom name element symbol + Nth (starting from 1, not includes atoms with AtomPDBResidueInfo data, and present even when no other atoms of that element are present), residue name UNL and residue index 1.

AtomPDBResidueInfo

The constructor for AtomPDBResidueInfo accepts a few arguments:

__init__(_object*,
        std::__1::basic_string<char,
                               std::__1::char_traits<char>,
                               std::__1::allocator<char> > atomName,
        int serialNumber=1,
        std::__1::basic_string<char,
                               std::__1::char_traits<char>,
                               std::__1::allocator<char> > altLoc='',
        std::__1::basic_string<char,
                               std::__1::char_traits<char>,
                               std::__1::allocator<char> > residueName='',
        int residueNumber=0,
        std::__1::basic_string<char,
                               std::__1::char_traits<char>,
                               std::__1::allocator<char> > chainId='',
        std::__1::basic_string<char,
                               std::__1::char_traits<char>,
                               std::__1::allocator<char> > insertionCode='',
        double occupancy=1.0,
        double tempFactor=0.0,
        bool isHeteroAtom=False,
        unsigned int secondaryStructure=0,
        unsigned int segmentNumber=0)

This is very C++ish. So in Python this would be:

__init__(self,
        atomName:str,
        serialNumber:int=1,
        altLoc:str='',
        residueName:str='',
        residueNumber:int=0,
        chainId:str='',
        insertionCode:str='',
        occupancy:float=1.0,
        tempFactor:float=0.0,
        isHeteroAtom:bool=False,
        secondaryStructure:int=0,
        segmentNumber:int=0)

Note that saving a molecule without a PDB residue info, has slightly different defaults for four attributes:

Argument	Type	Default	To PDB
`atomName`	string	Blank	element+Nth
`residueName`	string	Blank	`UNK`
`residueNumber`	integer	`0`	`1`
`isHeteroAtom`	boolean	`False`	`True`

Caveats

When an operation on a molecule returns a new molecule, such as substitutions, the PDB data is not preserved, so the information needs to be added manually.
The spacing of the output is not corrected and the spacing on a PDB file is important on paper —although some programs do not care and split by spaces and have consequently trouble with CONECT lines between 4 digit atom indices. Therefore it is likely important to pad the strings: the atom name to be 4 chars long, the residue name 3 for ATOM and 4 for HETATM.

As a result of this, it may be easier sometimes to tinker with the files directly as I have done in a previous post about using RDKit with Rosetta or in a GitHub repo to rename atom names of a molecule based on a similar one with names (i.e. to trick Rosetta, e.g. making a GNP to a GTP).

Edit: Footnote

To save atom names in a mol or sdf a hack is to save them in molFileAlias property:

for a in mol.GetAtoms():
    name = a.GetPDBResidueInfo().GetName()
    a.SetProp('molFileAlias', name)

When saved as a mol or sdf these will appear as a block below the bond definitions (e.g. for an atom index zero named CB: A 1\nCB)

The art of blowing up protein

Pages

Wednesday, 18 March 2020

Atom names purely in RDKit

File formats

RDKit

AtomPDBResidueInfo

Caveats

Edit: Footnote

No comments:

Post a Comment