Reading a mmCIF from PyMOL in PyRosetta

The mmCIF (PDBx as in extended PDB) format is meant to replace PDB format. Soon the RCSB PDB will have to adopt 4-letter codes for novel chemical components, which will break the PDB format. PDBx format is space separate as opposed to the really annoying column position in the PDB format and in the PDBx format the metadata can be stored in a nearly sensible manner. However, PDBx is solely a deposition format, but it is not really used as analysis format regardless of what the PDB claims. I personally had to add support for it because a reviewer asked me to. This lack of adoption is often attributed to the "if it ain't broken don't fix it" principle. Although I personally would argue that it may due to how it's implemented: opening a PDBx from one program ought to work in another, but this is not often the case. An example of this is PyMOL files read in PyRosetta.

Format

In a PDBx format each line is a dictionary keys prefixed with an underscore followed by the space separated values (single quote marks allow spaces within a value). Whereas tables are declared with the loop_ directive followed by the various headers line by line, followed by the table, closed by... violating the formatting of the table, such as declaring a new entry or EOF.
The PDB ATOM entries become _atom_site loop_:

loop_
_atom_site.group_PDB 
_atom_site.id 
_atom_site.type_symbol 
_atom_site.label_atom_id 
_atom_site.label_alt_id 
_atom_site.label_comp_id 
_atom_site.label_asym_id 
_atom_site.label_entity_id 
_atom_site.label_seq_id 
_atom_site.pdbx_PDB_ins_code 
_atom_site.Cartn_x 
_atom_site.Cartn_y 
_atom_site.Cartn_z 
_atom_site.occupancy 
_atom_site.B_iso_or_equiv 
_atom_site.pdbx_formal_charge 
_atom_site.auth_seq_id 
_atom_site.auth_comp_id 
_atom_site.auth_asym_id 
_atom_site.auth_atom_id 
_atom_site.pdbx_PDB_model_num 
ATOM   1    N N   . VAL A 1 3   ? 16.783  48.812 26.447  1.00 30.15 ? 3   VAL A N   1 
ATOM   2    C CA  . VAL A 1 3   ? 17.591  48.101 25.416  1.00 27.93 ? 3   VAL A CA  1 
ATOM   3    C C   . VAL A 1 3   ? 16.643  47.160 24.676  1.00 25.52 ? 3   VAL A C   1

The dictionary-item definitions are non-trival to find, beyond the basics (which can be found in the pdb101 help page), which roughly covers the mandatory fields.
However, say I want to add a SMILES of a chemical component and add a post-it like comment for a future self, what would I do? Honestly, making a separate text file or adding a # comment would be as constructive as following the rules. There is no official _chem_comp subfield called ".smiles" or ".inchi" or ".systematic_name" and instead a loop_ table is used with _pdbx_chem_comp_descriptor.id for the three letter code, _pdbx_chem_comp_descriptor.type for the work SMILES or InChi or systematic name and _pdbx_chem_comp_descriptor.descriptor for the actual value. In terms of comments, a loop_ with _database_PDB_remark.text is AFAIK the best option, despite actually being a backwards compatibility hack (in a traditional PDB, REMARK are messily classified notes).

A nice addition is _pdbx_sifts_xref_db, which allows the addition of per-residue annotations such as PFam domains (not ranges, so it's a bit verbose).

Tools

For something simple, i.e. reading a file, there are a wealth of tools listed in the PDB help pages. This is normally a cause for alarm bells: the wheel reinvention happens because either the solution was hidden or it was broken. Signs point to the latter case: the provided C++ library, CIFPARSE-OBJ is rather cryptic and rough around the edges. PyRosetta uses it in reading mmCIF files and it carries over the glitch that if the entry ends in a loop_ the EOF does not trigger it to complete, but instead the table is lost.

A python library that is nice to use that does not suffer from it is pdbx.

with open(filename, 'r') as fh:
    dcs: List[pdbx.DataContainer] = pdbx.load(fh)

pdbblock: str = pdbx.dumps(dcs)

dc: pdbx.DataContainer = dcs[0]
items: List[str] = dc.get_object_name_list()
cat: pdbx.Category = dc.get_object('entity_poly')

import pandas as pd

print(cat.name, cat.row_count)
pd.DataFrame(cat.row_list, columns=cat.attribute_list)

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	entity_id	type	nstd_linkage	nstd_monomer	pdbx_seq_one_letter_code	pdbx_seq_one_letter_code_can	pdbx_strand_id	pdbx_target_identifier
0	1	polypeptide(L)	no	no	AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS\\nGFRNSDRILYSSDWLIYKTTDHYQTFTKIR	AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS\\nGFRNSDRILYSSDWLIYKTTDHYQTFTKIR	A,B,C	None
1	2	polypeptide(L)	no	no	KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE\\nGADITIILS	KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE\\nGADITIILS	D,E,F	None

A file generally has a single entry (data_ in PDBx, MODEL in old PDB), but it can have multiple, hence the list business. The last step shows how useful the loop_ table can be when used in a civilised manner.

PyMOL

When saving a file from PyMOL most details bar for the atomic coordinates get lost, in both PDB and PDBx format. This includes the secondary structure information (HELIX & SHEET in PDB and _struct_conf in PDBx) and connections beyond the standard peptide backbone, such as isopeptide bonds, disulfide bonds and crosslinks to ligands (LINK & SSBOND in PDB and _struct_conn in PDBx). This is not great as one needs to append this information to make certain usages work (such as showing these in NGL).

This is an issue that combines with the former issue: when opening a PyMOL saved mmCIF file, PyRosetta will fail to load the coordinates and as suggested when the output is not muted and dummy extra item will fix it.


import pymol2

filename = '1brs.original.cif' # 1BRS is barnase-barnstar complex
roundname = filename.replace('.original', '.round')

with pymol2.PyMOL() as pymol:
    pymol.cmd.set("cif_keepinmemory") # does not affect save
    pymol.cmd.load(filename)
    pymol.cmd.save(roundname)

with open(roundname, 'a') as fh:
    fh.write('\n_citation.title  ""\n')
    
    
import pyrosetta

pyrosetta.init(extra_options='-mute all')
from types import ModuleType
prc: ModuleType = pyrosetta.rosetta.core

pose = pyrosetta.Pose()
prc.import_pose.pose_from_file(pose, filename, False, pyrosetta.rosetta.core.import_pose.FileType.CIF_file)
pose.sequence(), pose.total_residue()

In the docs searching for mmCIF may lead someone to find either

chemical.mmCIF.mmCIFParser()

(which is for chemical components) or


oc: pyrosetta.rosetta.utility.options.OptionCollection = pyrosetta.rosetta.basic.options.initialize()
sfr_opts = prc.io.StructFileReaderOptions(oc)
prc.io.mmcif.create_sfr_from_cif_file_op(❓, sfr_opts)

(which requires to be passed a C++ CIFfile object, which is not accessible AFAIK). Instead the regular pose import from file does require a wrapper to the init options, but passed to a different class, which from what I can tell does something on the lines of:

rts: prc.chemical.ResidueTypeSet = pose.residue_type_set_for_pose()
sfr_opts = prc.io.StructFileReaderOptions(oc)
builder = prc.io.pose_from_sfr.PoseFromSFRBuilder(rts=rts,
                                        options= sfr_opts)
with open(filename,'r') as fh:
    pdblock = fh.read()
# this is wrong for some reason and will segfault
filerep: prc.io.StructFileRep = prc.io.pdb.create_sfr_from_pdb_file_contents(pdblock, sfr_opts)
builder.build_pose(filerep, pyrosetta.Pose())

# then foldtree information is added
ipo = prc.import_pose.ImportPoseOptions(oc)
prc.import_pose.read_additional_pdb_data('1brs.cif', pose, ipo)

Which is rather convoluted and not actually viable as it will segfault (I am making a mistake at some step). The pdbinfo label + foldtree business is controlled in the more straightforward pose_from_file by the third argument, which I set to False, because the CIF was from PyMol or the PDB, so irrelevant. Given that the CIF reader will preserve non-peptide backbone connections there is not too much to be gained from this route.

Labels

There are some pieces of information one may want to associate with a residue, for example above I mentioned _pdbx_sifts_xref_db information, which is not always present anyway. Where one to want to annotate for selection certain residues one could use the residue labels in the pdbinfo, which is easy to set up:

pi.add_reslabel(1, 'foo')

print(pi.get_reslabels(1))  # ['foo']

res_sele: ModuleType = prc.select.residue_selector
pru: ModuleType = pyrosetta.rosetta.utility
v: pru.vector1_bool = pres_sele.ResiduePDBInfoHasLabelSelector('foo').apply(pose)
res_sele.ResidueVector(v) # 1

The art of blowing up protein

Pages

Sunday, 5 February 2023