The mmCIF (PDBx as in extended PDB) format is meant to replace PDB format. Soon the RCSB PDB will have to adopt 4-letter codes for novel chemical components, which will break the PDB format. PDBx format is space separate as opposed to the really annoying column position in the PDB format and in the PDBx format the metadata can be stored in a nearly sensible manner. However, PDBx is solely a deposition format, but it is not really used as analysis format regardless of what the PDB claims. I personally had to add support for it because a reviewer asked me to. This lack of adoption is often attributed to the "if it ain't broken don't fix it" principle. Although I personally would argue that it may due to how it's implemented: opening a PDBx from one program ought to work in another, but this is not often the case. An example of this is PyMOL files read in PyRosetta.
Format
In a PDBx format each line is a dictionary keys prefixed with an underscore followed by the space separated values (single quote marks allow spaces within a value). Whereas tables are declared with the loop_ directive followed by the various headers line by line, followed by the table, closed by... violating the formatting of the table, such as declaring a new entry or EOF.
The PDB ATOM
entries become _atom_site
loop_
:
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . VAL A 1 3 ? 16.783 48.812 26.447 1.00 30.15 ? 3 VAL A N 1
ATOM 2 C CA . VAL A 1 3 ? 17.591 48.101 25.416 1.00 27.93 ? 3 VAL A CA 1
ATOM 3 C C . VAL A 1 3 ? 16.643 47.160 24.676 1.00 25.52 ? 3 VAL A C 1
The dictionary-item definitions are non-trival to find, beyond the basics (which can be found in the pdb101 help page), which roughly covers the mandatory fields.
However, say I want to add a SMILES of a chemical component and add a post-it like comment for a future self, what would I do? Honestly, making a separate text file or adding a # comment would be as constructive as following the rules. There is no official _chem_comp
subfield called ".smiles" or ".inchi" or ".systematic_name" and instead a loop_
table is used with _pdbx_chem_comp_descriptor.id
for the three letter code, _pdbx_chem_comp_descriptor.type for the work SMILES or InChi or systematic name and _pdbx_chem_comp_descriptor.descriptor for the actual value. In terms of comments, a loop_
with _database_PDB_remark.text
is AFAIK the best option, despite actually being a backwards compatibility hack (in a traditional PDB, REMARK
are messily classified notes).
A nice addition is _pdbx_sifts_xref_db
, which allows the addition of per-residue annotations such as PFam domains (not ranges, so it's a bit verbose).
Tools
For something simple, i.e. reading a file, there are a wealth of tools listed in the PDB help pages. This is normally a cause for alarm bells: the wheel reinvention happens because either the solution was hidden or it was broken. Signs point to the latter case: the provided C++ library, CIFPARSE-OBJ
is rather cryptic and rough around the edges. PyRosetta uses it in reading mmCIF files and it carries over the glitch that if the entry ends in a loop_
the EOF does not trigger it to complete, but instead the table is lost.
A python library that is nice to use that does not suffer from it is pdbx
.
with open(filename, 'r') as fh:
dcs: List[pdbx.DataContainer] = pdbx.load(fh)
pdbblock: str = pdbx.dumps(dcs)
dc: pdbx.DataContainer = dcs[0]
items: List[str] = dc.get_object_name_list()
cat: pdbx.Category = dc.get_object('entity_poly')
import pandas as pd
print(cat.name, cat.row_count)
pd.DataFrame(cat.row_list, columns=cat.attribute_list)
\n | entity_id | \ntype | \nnstd_linkage | \nnstd_monomer | \npdbx_seq_one_letter_code | \npdbx_seq_one_letter_code_can | \npdbx_strand_id | \npdbx_target_identifier | \n
---|---|---|---|---|---|---|---|---|
0 | \n1 | \npolypeptide(L) | \nno | \nno | \nAQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS\\nGFRNSDRILYSSDWLIYKTTDHYQTFTKIR | \nAQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS\\nGFRNSDRILYSSDWLIYKTTDHYQTFTKIR | \nA,B,C | \nNone | \n
1 | \n2 | \npolypeptide(L) | \nno | \nno | \nKKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE\\nGADITIILS | \nKKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE\\nGADITIILS | \nD,E,F | \nNone | \n
A file generally has a single entry (data_
in PDBx, MODEL
in old PDB), but it can have multiple, hence the list business. The last step shows how useful the loop_
table can be when used in a civilised manner.
PyMOL
When saving a file from PyMOL most details bar for the atomic coordinates get lost, in both PDB and PDBx format. This includes the secondary structure information (HELIX
& SHEET
in PDB and _struct_conf
in PDBx) and connections beyond the standard peptide backbone, such as isopeptide bonds, disulfide bonds and crosslinks to ligands (LINK
& SSBOND
in PDB and _struct_conn
in PDBx). This is not great as one needs to append this information to make certain usages work (such as showing these in NGL).
This is an issue that combines with the former issue: when opening a PyMOL saved mmCIF file, PyRosetta will fail to load the coordinates and as suggested when the output is not muted and dummy extra item will fix it.
import pymol2
filename = '1brs.original.cif' # 1BRS is barnase-barnstar complex
roundname = filename.replace('.original', '.round')
with pymol2.PyMOL() as pymol:
pymol.cmd.set("cif_keepinmemory") # does not affect save
pymol.cmd.load(filename)
pymol.cmd.save(roundname)
with open(roundname, 'a') as fh:
fh.write('\n_citation.title ""\n')
import pyrosetta
pyrosetta.init(extra_options='-mute all')
from types import ModuleType
prc: ModuleType = pyrosetta.rosetta.core
pose = pyrosetta.Pose()
prc.import_pose.pose_from_file(pose, filename, False, pyrosetta.rosetta.core.import_pose.FileType.CIF_file)
pose.sequence(), pose.total_residue()
In the docs searching for mmCIF may lead someone to find either
chemical.mmCIF.mmCIFParser()
(which is for chemical components) or
oc: pyrosetta.rosetta.utility.options.OptionCollection = pyrosetta.rosetta.basic.options.initialize()
sfr_opts = prc.io.StructFileReaderOptions(oc)
prc.io.mmcif.create_sfr_from_cif_file_op(❓, sfr_opts)
(which requires to be passed a C++ CIFfile object, which is not accessible AFAIK). Instead the regular pose import from file does require a wrapper to the init options, but passed to a different class, which from what I can tell does something on the lines of:
rts: prc.chemical.ResidueTypeSet = pose.residue_type_set_for_pose()
sfr_opts = prc.io.StructFileReaderOptions(oc)
builder = prc.io.pose_from_sfr.PoseFromSFRBuilder(rts=rts,
options= sfr_opts)
with open(filename,'r') as fh:
pdblock = fh.read()
# this is wrong for some reason and will segfault
filerep: prc.io.StructFileRep = prc.io.pdb.create_sfr_from_pdb_file_contents(pdblock, sfr_opts)
builder.build_pose(filerep, pyrosetta.Pose())
# then foldtree information is added
ipo = prc.import_pose.ImportPoseOptions(oc)
prc.import_pose.read_additional_pdb_data('1brs.cif', pose, ipo)
Which is rather convoluted and not actually viable as it will segfault (I am making a mistake at some step). The pdbinfo label + foldtree business is controlled in the more straightforward pose_from_file
by the third argument, which I set to False, because the CIF was from PyMol or the PDB, so irrelevant. Given that the CIF reader will preserve non-peptide backbone connections there is not too much to be gained from this route.
Labels
There are some pieces of information one may want to associate with a residue, for example above I mentioned _pdbx_sifts_xref_db
information, which is not always present anyway. Where one to want to annotate for selection certain residues one could use the residue labels in the pdbinfo, which is easy to set up:
pi.add_reslabel(1, 'foo')
print(pi.get_reslabels(1)) # ['foo']
res_sele: ModuleType = prc.select.residue_selector
pru: ModuleType = pyrosetta.rosetta.utility
v: pru.vector1_bool = pres_sele.ResiduePDBInfoHasLabelSelector('foo').apply(pose)
res_sele.ResidueVector(v) # 1
No comments:
Post a Comment