PDB numbering rollercoaster

The position in a crystal structure and the protein sequence rarely match. In fact, there are four parts of start-end:

position in whole protein,
position in extracted sequence,
position in residues stated in the PDB/mmCIF structure and
position which actually has coordinates.

I should preface, that all indexing discussed here is Fotran-style (starting at 1), but a lot of code will be dealing with them C-style (0), but that is an implementation thing.

Why the mismatch?

A structure may be of a protein cloned from a larger sequence. Often with N-terminal and C-terminal tags or cloning scar.
In a PDB file, the first residue can be 1 or any number. Assuming a standard is a bad idea.
The first position pair to care about is the position within the protein as a whole, this datum is already present in Uniprot, but not the other ones.Then the problem of where in the structure is this residue, ie. where does the cloning scar/tag end. The chain id is a letter and there is no issue there —although, assuming it's A is a silly mistake that I seem to fall for.
Two different ways of representing the residue in the structure appear.One start position is the position within the extracted sequence, where the first residue in the sequence is 1. While in the PDB/mmCIF it could be anything. Therefore the second start position is the id that residue has within the PDB definition.
This PDB position is, however, a wee bit problematic as it may not have coordinates — the gray'ed residues in PyMol. Namely, in a PDB file there is the field SEQRES, or in an mmCIF file _entity_poly_seq, which can contain residues without coordinates (i.e. ATOM lines) of the protein that was crystallised and often termini cannot be solved.

Solutions

The RCSB PDB has some of derived data, but none are useful for this purpose. They have the data, but no decent API or downloadable dataset.

PDBe API

The PDBe has a really nice API that gives the protein names and gene names for each chain and position from the extracted sequence. It is a bit satanic, but manageable —it actually reflects mmCIF dictionaries. Unfortunately, the protein names are whatever the depositor wrote, say "Truncation a–actine (chain)" —this is an exaggeration but you get extraneous words like truncation or chain, Greek letter as Latin letters or spelt out, encoding artefacts and, my favorite, gene names as they are in French. The gene names are more normal.

SIFTS

EBI's SIFTS dataset (https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html) maps the three possible ways of listing this and gives the Uniprot (Swissprot) identifier, which is great*.
Let's take as examples the structure 2X5P and 2WMN.

PDB	CHAIN	SP_PRIMARY	RES_BEG	RES_END	PDB_BEG	PDB_END	SP_BEG	SP_END
2x5p	A	Q8G9G1	4	121	1	118	440
2wmn	A	Q9BZ29	1	428	None	None	1605	2069

Namely, the crystal structure 2X5P protein corresponds to residues 118-440 in the species's peptide (SP_BEG & SP_END). There is a cloning scar that is crystallised and are the first three residues. Yet the index starts at -2 on the PDB ATOM entries (albeit perfectly logical, this is actually unusual). So the RES_BEG is 4. While the PDB_BEG is 1. In the crystal structure 2WMN, we a very common occurence: the termini did not crystallise. Consequently, we have the sequence starting from 1, but residues 1-4 aren't solved. Hence the None in the PDB fields.

To get the solved first residue the only way is to read the PDB/mmCIF itself and read the first residue of that chain —assuming it is sorted, but it ought to be, if it isn't its someone's weirdo structure not to be trusted. And from there shift the SP_BEG and SP_END.

There are some nasty pitfalls that may cause issues. Majorly, it is not safe to assume that length is last index minus first index plus one. One reason for this are "insertion codes", letters to indicate where an extra amino acid is inserted akin to house numbering (say, houses 11, 12A, 12B, 13 in a road), although these are sometimes (mis)used for alternative occupancy. Inserted tags may also have some odd numbering jumps. But luckily these cases are quite uncommon. Proteopedia has a nice list of corner cases, which is very interesting as it shows why such complex cases are created.

∗ Biopython does not have a Uniprot parser, but I have (see my github!).

The art of blowing up protein

Pages

Wednesday, 4 September 2019