Saturday, 3 August 2019

When will the PDB run out of 4-letter codes?

The PDB ids are really nice and short: 4 letter codes. But when will all the combinations run out? Actually, not for a long long time.
The current total is 155,618 structures and new ones are added at a rate of 12000 structures per year, which means that, assuming a constant growth, in 125 years —(36 ^ 4 - 155,618 ) /  12,000 —the PDB will finish codes to allocate.
2145. That is a few years after the setting of Kim Robinson's New York 2140, where New York is a flooded super-Venice, so I am guessing the RCSB PDB, in San Diego, will have long been flooded so lack of 4-letter codes is not top of their concerns.

The date may be a bit sooner. There are no codes starting from zero (the first public entry is PDB:100D).
A fair chunk of codes are burnt as they correspond to predicted models that got remove and albeit random there are a few older structures that have non-random names, like 1UBQ and 1GFL. The highest number is PDB:9XIM. But the latest entries are on the 6Pxx range. 7 * 36 ^ 3 is 320,000, which means that about half the entries are skipped or private, so a more accurate estimate is in 60 years time.


However, what is highly problematic and highly discussed in CCP4/PDB circles is finishing all the three letter residue names. Namely, each residue, canonical and non-canonical, and ligands have a three letter name, e.g. biotin is BTN. These will finish in a few years. Although, there are no strong safegards against the same name getting used (or different atom names in the same named compound), so I strongly suspect that many names will be reused.

