Top 10 silliest PDB residue names for ligands!

Sunday, 19 June 2022

Top 10 silliest PDB residue names for ligands!

UPDATE: The PDB will finish 3 letter chemical component IDs sometime before 2024 at which point they will switch to 5 letter codes, which will be usable solely in CIF format: https://www.wwpdb.org/news/news?year=2022#630fee4cebdf34532a949c34

In some situations it is handy to use in an in silico experiment a 3-letter residue name that is not taken in the PDB. For example, PyRosetta has a system of pregenerated topologies for PDB components, which can cause issues when a ligand is loaded and the movers may use that over an incorrectly provided residue type / param file, resulting in a blown up mishapen ligand —an overly common incident*. As a result, having a list handy of what is taken is helpful. Herein are some silly observations about what the taken and untaken names are —but not ranked as a top 10, because this is not a science blog, not my local newspaper.

∗) This feature can be disabled with load_PDB_components, but often one may want to keep it on and not use ignore_unrecognized_res either.

Untaken chemical components

The European PDB provides several formats of the chemical components that are in the database, one of them is just a list of entries (chem_comp.list). So let's extract these and see how many are available

import string
import itertools
from typing import List, Tuple

with open('chem_comp.list') as fh:
    chem_resns:Tuple[str] = *map(str.strip, fh),
    
possibles:Set[str] = set()
for repeat in (1,2,3):
    possibles.update(map(str.strip, map(''.join, itertools.product(string.digits+string.ascii_uppercase, repeat=repeat))))
available:List[str] = [resn for resn in possibles if resn not in chem_resns]

The itertools.product was called with repeat 1,2 and 3, because single and double letters are valid names. For example, residue A is adenosine monophosphosate —the polymeric nucleobase in RNA. Parethetically, chemical compound definitions are nominally neutral and in the unreacted/monomeric state and with neutral protonation: for example alanine in the PDB definition is with the OXT atom. However, confusingly some covalent compounds may be submitted in the reacted form without a dummy atom (* in SMILES, R in drawn/mol files).

As of March 2022, of the possible 47,988 names, 36,757 are taken leaving 11,231 available (JSON of untaken 1–3 letter names).

So what does the distribution of untaken names look like?

As expected few of the names starting with early letters are untaken, but the available ones of these appear random. For example for A, we have untaken: A0E, A0F, A0I, A0M, A0N, A0X, A10, A1D, A1W, A2A, A2S, A2U, A30, A3I, A3L, A3O, A3U, A3Z, A4U, A5S, A67, A6F, A6X, A7F, A7J, A7P, A7U, A7V, A7X, A8A, A8I, A8J, A8R, A8Y, A9A, A9I, A9X, AAJ, AAW, AC3, AD0, AEB, AEC, AGZ, AH5, AHJ, AHV, AI0, AI4, AI5, AI6, AID, AIE, AII, AIP, AIY, AJ0, AJ9, AJC, AJF, AJO, AJS, AJT, AJW, AK9, AKF, AKQ, AMK, AO0, AQF, AQI, AQL, AQR, AQX, ARY, AT0, ATB, ATN, AU0, AU9, AUA, AUM, AUS, AV8, AVS, AWI, AX9, AY3, AYF, AYP, AYY. In reality these are likely embargoed structures that never saw the light of day —there are many similarly missing PDB codes, but most of these are due to models that were initially allowed but then were removed. In the list above the AQX strikes me as rather cyberpunk. Of the taken letters in A, ALF is tetrafluoroaluminate, which is a pretty alien inorganic compound.

Residue AWS is not taken up by Amazon Web Services, but by an Oxford SGC/Diamond fragment based drug discovery screen (a panDDA event). Given that accidents with AWS can be rather costly, it would have been nice if it was the most expensive compound in the PDB. Unfortunately, that would be near impossible to determine, consequently I do not know what that is, but is probably some natural compound extracted from a tropical plant only found on an inaccessible island.

Bar for odd spellings, the only giggle words in three letters I can think of are SEX, WEE, PEE, POO, CUM and POP. Possibly on purpose the ligand SEX is untaken —Sildenafil is VIA. PEE is a phospholipid. POO is a herpes C polymerase inhibitor, while POP is pyrophosphate, which also has PPV in a different protonation states (crystal structures do not have protons so the chemical components are a mess). Odd spellings seem okay: FUK is ASTX660 (Astex), an apoptosis inhibition inhibitor targeted at cancers, so an unfortunate allocation from the PDB's system.

Scrabble legal

I cross-referenced the taken names with Scrabble legal words. As expected Qi-like scrabble words dominate.

scrabble: List[str] = requests.get('https://raw.githubusercontent.com/benjamincrom/scrabble/master/scrabble/dictionary.json').json()
possible: List[str] = sorted(set(map(str.upper, scrabble)).intersection(takens))
print(', '.join(possible))

AA, AAH, AAL, AAS, ABA, ABO, ABS, ABY, ACE, ACT, ADD, ADO, ADS, ADZ, AFF, AFT, AG, AGA, AGE, AGO, AHA, AI, AIL, AIM, AIN, AIR, AIS, AIT, AL, ALA, ALB, ALE, ALL, ALP, ALS, ALT, AM, AMA, AMI, AMP, AMU, ANA, AND, ANE, ANI, ANT, ANY, APE, APT, AR, ARB, ARC, ARE, ARF, ARK, ARM, ARS, ART, AS, ASH, ASK, ASP, ASS, ATE, ATT, AUK, AVA, AVE, AVO, AWA, AWE, AWL, AWN, AXE, AYE, AYS, AZO, BA, BAA, BAG, BAH, BAL, BAM, BAN, BAP, BAR, BAS, BAT, BAY, BED, BEE, BEG, BEL, BEN, BET, BEY, BIB, BID, BIG, BIN, BIO, BIS, BIT, BIZ, BOA, BOB, BOD, BOG, BOP, BOS, BOT, BOW, BOX, BOY, BRA, BRO, BRR, BUB, BUD, BUG, BUM, BUN, BUR, BUT, BUY, BYE, BYS, CAB, CAD, CAM, CAN, CAP, CAR, CAT, CAW, CAY, CEE, CEL, CEP, CHI, CIS, COB, COD, COG, COL, CON, COO, COP, COR, COS, COT, COW, COX, COY, COZ, CRY, CUB, CUD, CUE, CUM, CUP, CUR, CUT, CWM, DAB, DAD, DAG, DAH, DAK, DAL, DAM, DAP, DAW, DAY, DEB, DEE, DEL, DEN, DEV, DEW, DEX, DEY, DIB, DID, DIE, DIG, DIM, DIN, DIP, DIS, DIT, DOC, DOE, DOG, DOL, DOM, DON, DOR, DOS, DOT, DOW, DRY, DUB, DUD, DUE, DUG, DUN, DUO, DUP, EAR, EAT, EAU, EBB, ECU, EDH, EEL, EFF, EFS, EFT, EGG, EGO, EKE, EL, ELD, ELF, ELK, ELL, ELM, ELS, EME, EMF, EMS, EMU, END, ENG, ENS, EON, ERA, ERE, ERG, ERN, ERR, ERS, ESS, ET, ETA, ETH, EVE, EWE, EYE, FA, FAD, FAG, FAN, FAR, FAS, FAT, FAX, FAY, FED, FEE, FEH, FEM, FEN, FER, FEU, FEW, FEY, FEZ, FIB, FID, FIG, FIL, FIN, FIR, FIT, FIX, FLU, FLY, FOB, FOE, FOG, FOH, FON, FOP, FOR, FOU, FOX, FOY, FRO, FRY, FUB, FUD, FUG, FUN, FUR, GAB, GAD, GAE, GAG, GAL, GAM, GAN, GAP, GAR, GAS, GAT, GEE, GEL, GEM, GEN, GET, GEY, GIG, GIN, GIP, GIT, GNU, GOA, GOB, GOD, GOO, GOT, GOX, GOY, GUL, GUM, GUN, GUT, GUV, GUY, GYM, GYP, HAD, HAE, HAG, HAH, HAJ, HAM, HAO, HAP, HAS, HAT, HAW, HAY, HEH, HEM, HEN, HEP, HER, HES, HET, HEW, HEX, HEY, HIC, HID, HIE, HIN, HIP, HIS, HIT, HMM, HO, HOB, HOE, HOG, HOP, HOT, HOW, HUB, HUE, HUG, HUH, HUM, HUN, HUP, HUT, HYP, ICE, ICH, ICY, IDS, IFS, ILL, IMP, IN, INK, INN, INS, ION, IRE, IRK, ISM, ITS, JAB, JAG, JAR, JAW, JAY, JEE, JET, JEU, JEW, JIN, JOB, JOE, JOG, JOT, JOW, JOY, JUG, JUN, JUS, JUT, KAB, KAF, KAS, KAT, KAY, KEA, KEF, KEG, KEN, KEP, KEY, KIF, KIN, KIR, KOB, KOP, KOR, KOS, KUE, LA, LAB, LAC, LAD, LAG, LAM, LAP, LAR, LAS, LAT, LAV, LAX, LAY, LEA, LED, LEE, LEG, LEI, LEK, LET, LEU, LEV, LEX, LEY, LEZ, LI, LIB, LID, LIE, LIN, LIP, LIS, LIT, LOB, LOG, LOP, LOT, LOW, LOX, LUG, LUM, LUV, LUX, LYE, MA, MAC, MAD, MAE, MAG, MAN, MAP, MAR, MAS, MAT, MAW, MAX, MAY, MED, MEL, MEM, MEN, MET, MEW, MHO, MIB, MID, MIG, MIL, MIM, MIR, MIS, MIX, MO, MOA, MOB, MOC, MOD, MOG, MOL, MOM, MON, MOO, MOP, MOR, MOS, MOT, MOW, MUD, MUG, MUM, MUN, MUS, MUT, NA, NAB, NAE, NAG, NAH, NAM, NAN, NAP, NAW, NAY, NEB, NEE, NET, NEW, NIL, NIM, NIP, NIT, NIX, NO, NOB, NOD, NOG, NOH, NOM, NOO, NOR, NOS, NOT, NOW, NTH, NUB, NUT, OAF, OAK, OAR, OBE, OBI, OCA, ODD, ODE, ODS, OES, OFF, OFT, OH, OHM, OHO, OHS, OIL, OKA, OKE, OLD, OLE, OMS, ONE, ONS, OOH, OPE, OPS, OPT, ORA, ORB, ORC, ORE, ORS, ORT, OS, OSE, OUD, OUR, OUT, OVA, OWL, OWN, OX, OXO, OXY, PAC, PAD, PAH, PAL, PAM, PAN, PAP, PAR, PAS, PAT, PAW, PAX, PAY, PEA, PEC, PED, PEE, PEG, PEH, PEP, PER, PET, PEW, PHI, PHT, PI, PIA, PIC, PIE, PIG, PIN, PIP, PIS, PIT, PIU, PIX, PLY, POD, POH, POI, POL, POM, POP, POT, POW, POX, PRO, PRY, PSI, PUB, PUD, PUG, PUL, PUN, PUP, PUR, PUS, PUT, PYA, PYE, PYX, QAT, QUA, RAD, RAH, RAJ, RAM, RAN, RAP, RAS, RAT, RAW, RAX, RAY, RE, REB, REC, RED, REE, REF, REG, REI, REM, REP, RES, RET, REV, REX, RHO, RIA, RIB, RID, RIF, RIM, RIN, RIP, ROB, ROC, ROD, ROE, ROM, RUB, RUE, RUG, RUM, RUN, RUT, RYA, RYE, SAB, SAC, SAD, SAE, SAG, SAL, SAP, SAT, SAU, SAW, SAX, SAY, SEA, SEC, SEE, SEG, SEI, SEL, SEN, SER, SET, SHA, SHH, SHY, SIB, SIC, SIM, SIN, SIP, SIR, SIS, SIX, SKA, SKY, SLY, SOD, SOL, SON, SOP, SOS, SOT, SOX, SOY, SPA, SPY, SRI, STY, SUB, SUE, SUN, SUP, SUQ, SYN, TAB, TAD, TAE, TAG, TAJ, TAM, TAN, TAO, TAP, TAR, TAS, TAT, TAU, TAV, TAW, TAX, TEA, TED, TEE, TEG, TEL, TEN, TET, TEW, THE, THO, THY, TIC, TIL, TIN, TIS, TIT, TOD, TOE, TOG, TOM, TON, TOP, TOR, TOT, TOW, TOY, TRY, TSK, TUB, TUG, TUI, TUN, TUP, TUT, TUX, TWA, TWO, TYE, UDO, UGH, UKE, UMM, UMP, UNS, UPS, URB, URD, URN, USE, UTA, UTS, VAC, VAR, VAS, VAT, VAU, VAW, VEE, VEG, VET, VIA, VIG, VIS, VUG, WAD, WAN, WAS, WAY, WHA, WHY, WIN, WOE, WOG, WOO, WOP, WOS, WOT, WOW, WRY, WUD, WYE, XIS, YAK, YAM, YAP, YEN, YES, YET, YIN, YIP, YOD, YOK, YOM, YUK, YUM, YUP, ZAP, ZED, ZEE, ZIG, ZIN, ZIP, ZIT, ZOA, ZOO.
Most of these are synthetic compounds, few are natural products. To figure out what is a natural product, it is a rather convoluted process. Cofactor are easily spotted in the PBDe data, via the 'mapping' API route, wherein they will be in the 'function' field:

x = requests.post(f'https://www.ebi.ac.uk/pdbe/api/pdb/compound/mappings/', data=','.join(possibles)).json()
xref_dict: Dict[str, Any] = {n: l[0] if l else {} for n,l in _.items()}
xref = pd.DataFrame(xref_dict).transpose()

The 'mapping' PDBe API route gives a series of cross-reference keys, but they are not full complete, as some records will have ChemEBI, ChemEMBL or DrugBank IDs —PubChem IDs are absent. ChEBI contains information regarding a compound if it's a natural product (e.g. phaseolin, a bean pterocarpan (a fancy flavinoid).

Of the above list, I would say (non-empirically) that SIR is a curious one because the autogenerated model of the cobalt bound porphirin ring is completely and utterly wrong because metal coordinated compounds crash most compchemistry tools.

No comments:

Post a Comment