tag:blogger.com,1999:blog-90151742348714422372024-03-17T00:33:11.561-07:00The art of blowing up proteinA segfault and NaN driven series of disconnected ideas, analyses and just plain silly posts about computational biochemistry, synthetic biology and microbiology.Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.comBlogger126125tag:blogger.com,1999:blog-9015174234871442237.post-52860385956364807342024-03-10T07:12:00.000-07:002024-03-10T07:12:43.185-07:00Crossposting<p> I have been rather quiet here, my personal blog for a variety of reasons, mainly because past collaborations encroaching into the spare after-hours I would otherwise dedicate to it (<i>i.e.</i> weekend early mornings). I have been involved in various projects, many of which I would like to write blog post about or people would like me to write about, so I am well behind on what I would like to post. However, as part of OPIG, I have written a few posts there (<a href="https://www.blopig.com/blog/author/nuben/" target="_blank">Blopig</a>), some on requests (who would willingly write about fixing <a href="https://www.blopig.com/blog/2024/01/tip-and-tricks-to-correct-a-cuda-toolkit-installation-in-conda/" target="_blank">CUDA installations</a> or exposing <a href="https://www.blopig.com/blog/2023/10/ssh-the-boss-fight-level-jupyter-notebooks-from-compute-nodes/" target="_blank">Jupyter notebooks in a compute node via reverse port forwarding</a>?) and some out of personal choice.</p><span><a name='more'></a></span><p>As of February 2024, these are the comp chem/biochem ones:</p><div><ul><li><a href="https://www.blopig.com/blog/2023/11/the-workings-of-fragmensteins-rdkit-neighbour-aware-minimisation/" target="_blank">The workings of Fragmenstein’s RDKit neighbour-aware minimisation</a></li><li><a href="https://www.blopig.com/blog/2023/11/demystifying-the-thermodynamics-of-ligand-binding/" target="_blank">Demystifying the thermodynamics of ligand binding</a></li><li><a href="https://www.blopig.com/blog/2023/08/placeholder-compounds-distraction-vs-accuracy/" target="_blank">Placeholder compounds: distraction vs. accuracy</a></li><li><a href="https://www.blopig.com/blog/2023/06/customising-mcs-mapping-in-rdkit/" target="_blank">Customising MCS mapping in RDKit</a></li></ul><div>I have a few upcoming posts I am slowly writing, namely about</div><div><ul><li>Escape from Flatland, and into Cthulhu's realm</li><li>Compound picking: balancing predicted affinity, risk, and cost while sampling novel interactions at the Pareto frontier </li><li>Primer on pathogenic protein variants beyond simple destabilisation</li><li>Programmatically working with XChem data and the multilayered nature of non-binders</li><li>Cranking up the amount of Rosetta analyses in RFdiffusion</li></ul><div>If you'd like a sneak-peak feel free to email me at matteo.ferla (gmail or stats.ox.ac.uk). I get one or two out-of-the-blue emails a week and I do tend to reply to them.</div><div><br /></div><div>Lastly, I ought to mention that I started posting <a href="https://gist.github.com/matteoferla/2001c5578c24651bb04ef910004c7d52" target="_blank">my coding-problem solutions as comments in a Gist</a>, so if you, like someone who emailed me the other day, found a coding solution of mine but could not find it again try there!</div><div><br /></div></div><div><br /></div></div>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-48976160723490685642024-01-22T10:08:00.000-08:002024-03-10T07:16:10.876-07:00Custom carbon colours in py3Dmol<p> Due to NGLView (the Python module) having a frozen older IPywidget version it breaks Colab and the major change for the latter library was a year ago (early 2023), so I am forced to revisit old code and switch to py3Dmol in my Colab demos. Today I figured out how to use custom carbon colours.</p><span><a name='more'></a></span><p>Given a hex code for a colour (<code>#📕📕📗📗📘📘</code>), in PyMOL one can set it to the carbons via <code>color 0x📕📕📗📗📘📘, element C and 👾👾</code>, which is a bit cryptic (0x is for hex numbers): so even with a common program that's a desktop binary and not a JS widget in Python the synthax can be tricky. In py3Dmol it is very tricky. If the color is a CSS built-in, in py3Dmol / 3Dmol.js name+'Carbon' will work. Example:</p><pre><code>import py3Dmol
from rdkit import Chem
from rdkit.Chem import AllChem
# A fun 3D molecule (if you don't like the look of a tropane/adamantane ring please go to A&E / ER)
atropine = Chem.MolFromSmiles('CN(C)C(=O)OC1C2CC3CC1CC(C2)(C3)OC(=O)C')
AllChem.EmbedMolecule(atropine)
viewer = py3Dmol.view(width=350, height=350)
viewer.addModel(Chem.MolToMolBlock(atropine), 'sdf')
viewer.setStyle({'model':-1}, {'stick':{'colorscheme':'coralCarbon', 'opacity': 1}})
viewer.zoomTo()
viewer.show()
</code></pre><pre><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTDzzT6T-TnADHKgckMOarMC-1kQZ1WzE00LINmjsoINbvUf2G8D3ycTjaULMvTdFsJI4Br7ueBpoSOjI9unuvc1kcm60XH7NOg688j1lEMyPoH9-MNYem0Xbj0Ti4YL8P4IYLRNmMpJnqg8-TQxksdGzmMrdBatJ2DHzZUxIn2cSe_iHmpqhYO6moxXI/s347/Screenshot%202024-01-22%20at%2017.42.55.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="121" data-original-width="347" height="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTDzzT6T-TnADHKgckMOarMC-1kQZ1WzE00LINmjsoINbvUf2G8D3ycTjaULMvTdFsJI4Br7ueBpoSOjI9unuvc1kcm60XH7NOg688j1lEMyPoH9-MNYem0Xbj0Ti4YL8P4IYLRNmMpJnqg8-TQxksdGzmMrdBatJ2DHzZUxIn2cSe_iHmpqhYO6moxXI/s320/Screenshot%202024-01-22%20at%2017.42.55.png" width="320" /></a></div><br /><code><br /></code></pre>
<p>If a hex code is passed it will not. The 3Dmol.js documentation talks of passing a JS function to Python to define custom colour schemes <a href="https://3dmol.csb.pitt.edu/doc/tutorial-code.html#:~:text=%7D%3B-,Then%20we%20can%20apply,-that%20colouring%20function">here</a>. So how does py3Dmol inject JS?</p>
<p>This depends on whether a viewer is displayed or not. There are three string attributes <code>startjs</code>, <code>endjs</code>, and <code>updatejs</code>. The method <code>show</code> calls <code>_make_html</code>, which concatenates <code>startjs</code> and <code>endjs</code>, and adds it via <code>IPython.display.publish_display_data</code> —which is really cool and I only learnt about it this way. The <code>update</code> method does the same but on the <code>updatejs</code> string.<br /> So to inject JS in a yet to be shown viewer one can do:</p>
<pre><code>viewer = py3Dmol.view(width=350, height=350)
viewer.addModel(Chem.MolToMolBlock(atropine), 'sdf')
viewer.startjs += '''\n
let customColorize = function(atom){
// attribute elem is from https://3dmol.csb.pitt.edu/doc/AtomSpec.html#elem
if (atom.elem === 'C'){
return "#00FF00"
}else{
return $3Dmol.getColorFromStyle(atom, {colorscheme: "whiteCarbon"});
}
}
\n'''
viewer.setStyle({'model':-1}, {'stick':{'colorfunc': 'customColorize', 'opacity': 0.7}})
# make it a function not a string "customColorize"
viewer.startjs = viewer.startjs.replace('"customColorize"', 'customColorize')
viewer.zoomTo()
viewer.show()</code></pre><p style="text-align: left;">In the above, a trick happens: <code>customColorize</code> is added as a string, but the code is changed to make it a variable. Actually this is not the only case where this happens: the code will have a viewer_UNIQUEID main namespace variable going on, which gets replaced by the value of <code>viewer.uniqueid</code>. Main namespace pollution is frowned upon by purists, but this is very handy for debugging as one can make the dev console pop-up and type <code>window.viewer_👾👾👾👾👾</code> and do whatever test!</p><h2 style="text-align: left;">Footnote</h2><div>The Pythonic repo of py3Dmol is on GitHub (https://github.com/3dmol/3Dmol.js/tree/master/py3Dmol). But at the time, I was unable to find it, so I used this function:</div><br /><pre><code>def display_source(function):
"""
Display the source code formatted
"""
import inspect
from IPython.display import HTML, display
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
code:str = inspect.getsource(function)
html:str = highlight(code, PythonLexer(), HtmlFormatter(style='colorful'))
stylesheet:str = f"<style>{HtmlFormatter().get_style_defs('.highlight')}</style>"
display(HTML(f"{stylesheet}{html}"))
display_source(type(viewer))</code></pre>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com2tag:blogger.com,1999:blog-9015174234871442237.post-12780816551972285042023-12-31T04:49:00.000-08:002024-01-02T02:18:44.946-08:00A possible BioB bipass routeNearly a decade ago there was something bugging me and I believe I have figured it out —although it's pointless now. Namely, is another way of making biotin possible without using BioB, biotin synthase, an incredibly slow multistep radical SAM enzyme. <span><a name='more'></a></span><div><h3 style="text-align: left;">Biotin, the mysterious cofactor</h3><div>In my option, there are many things that make biotin interesting.</div><div><br /></div><div>Starting with its boat-like (endo envelope) puckered structure: it has two aliphatic fused 5-membered rings, one with a ureido group and the other with a bridging sulfur, the former is involved in catalysis (cf. acetyl-CoA carboxylase mechanism), while the latter is not conjugated with it and acts solely as a steric hinderance.</div><div><br /></div><div>One of the few enzymes using it, acetyl-CoA carboxylase, make a metabolite needed to make it. Chicken-and-egg paradoxical pathways are not too uncommon and here a simple metabolite (malonate) is made, so can be explained via the Horowitz retrograde hypothesis of pathway formation. </div><div><br /></div><div>In urishiol and olivetolic acid (THC precursor) the fatty acid part is an off-the-shelf metabolite that gets used in the usual polyketide way; For biotin, a methylester is smuggled into the fatty acid biosynthesis pathway.</div><div><br /></div><div>However, the strangest part is that the thiophane is really hard to make. This ring is the last step in biotin biosynthesis and made by the radical SAM enzyme BioB. Radical SAM enzymes have a vast repertoire generally C-H activation via a radical reaction, but generally it centres on methylation or skeletal rearrangement possibly with carbon insertion, but this enzymes inserts the SAM's sulfur itself into desthiobiotin —sulfur being the only part missing.</div><div><br /></div><h3 style="text-align: left;">Thiophane ring</h3><div>A further oddity is that the C-H activation part to make the thiophane ring happens on a carbon that was the beta-carbon of an alanine, which was added by BioF via the usual PLP-dependent decarboxylative aldol condensation.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTLjcmPXvNVY9gfjz2vnU1qEcHRK7sE2LUHuX428FvcfdQcRvnznOERk8dRPB4cOEgORGSQRZtaX5FfbbEW0F7MEj06I6HEfFVwArMxlMZC5SOg33qJfTz18s7NyjY2oJn2FfDOHmR48sUNAB6_yhObF2e1bSDHseil80kIFKxA6NIiv4TSjPmTZ_KTVE/s4961/bioF.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="785" data-original-width="4961" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTLjcmPXvNVY9gfjz2vnU1qEcHRK7sE2LUHuX428FvcfdQcRvnznOERk8dRPB4cOEgORGSQRZtaX5FfbbEW0F7MEj06I6HEfFVwArMxlMZC5SOg33qJfTz18s7NyjY2oJn2FfDOHmR48sUNAB6_yhObF2e1bSDHseil80kIFKxA6NIiv4TSjPmTZ_KTVE/w400-h64/bioF.png" width="400" /></a></div><div><br /></div><div>So why does BioF not use cysteine?</div><div><br /></div><div>If it used cysteine then the ring would need to be closed in a reaction requiring C-H activation, simpler than BioB. Several enzymes do this kind of reaction. Penicillins have a thiophane-like ring, but it's made differently, namely a short non-ribosomal peptide is polymerised at the Cys-Val part via a curious oxidoreductase enzyme, isopenicillin N synthase, which as a αKG-dependent hydroxylase family member uses a non–heam-chelated iron to generate a radical from dioxygen (cytochrome P450 are predominantly eukaryotic). In this enzyme the electron is passed to the sulfur, which attacks the highly substituted beta-carbon of the valine sidechain. For isopenicillin N it's a textbook anti-Markovnikov's rule radical reaction, for the thiophane the strain would probably favour a 5 membered ring. So it is not that.</div><div><br /></div><div>Cysteine cannot undergo a PLP-carbanion step without undergoing a spontaneous elimination with the leaving of the thiol. During my PhD, I encountered this partially with serine racemase, which led me to look into cysteine racemase, an enzyme reported in the 1970s and then not spoken of, as it is likely impossible as I have <a href="https://blog.matteoferla.com/2016/08/cysteine-racemase-impossible-enzyme.html" target="_blank">blogged about nearly a decade ago</a>. Methionine racemase and methionine transaminase have it easier as they have to fight not beta- but gamma-elemination.</div><div><br /></div><div>However, thioethers don’t have this problem (or not as badly). Out of curiosity recently I read up on <a href="https://www.pnas.org/doi/10.1073/pnas.1902095116," target="_blank">the biosynthesis of the fireworm (not firefly) luciferin</a> not because I had ever heard of <i>Odontosyllis</i>, but because its luciferin sports a cata-fused tricyclic with thiazole and a thiazine rings. These are made by forming two thioethers with levodopa and two cysteines, which are then transaminated triggering a spontaneous dehydrative intramolecular cyclisation (like glutamic semialdehyde cyclisation and several others).</div><div>Therefore, whereas it is impossible that nature could start with pimeloyl-CoA and cysteine via BioF, it is <i>concievable</i> that nature could start not with pimeloyl-CoA and alanine, but with beta-enoyl version of pimeloyl-CoA and cysteine via a hypothetical enzyme and a BioF-like enzyme, but operating intramolecularly. The acyl compound (pimelenoyl-CoA) is generated in biosynthesis/degration albeit with a different carrier, but then again pimeloyl-CoA is too. The missing hypothetical enzyme to make the S-conjugated cysteine-pimeloyl-CoA would need to so a thiol-Michael addition (like MetB), so could be an oxidase similarly to fireworm tyrosinase, or could be a PLP-dependent enzyme that acts on an "activated" cysteine such as O-acetylcysteine or O-succinylcysteine or phosphoserine (maybe) as seen in methionine metabolism —if you want to know more <a href="https://doi.org/10.1099/mic.0.077826-0" target="_blank">Ferla and Patrick 2014</a> is a nice read! The enzyme that does the operation of BioF but intramolecular with the cysteine conjugate may struggle with the tightness given the boat-like conformation as after all the sulfur-bridge is there to sterically hinder carbonyls binding instead of carbon dioxide. BioA would need to contend with a geometry change, but a transamination or reductive amination is easy for enzymes. If this hypothesis were correct there would be zero need for a BioB as the tetrathydro-thiophene part is already made at the bioF step.</div><div><br /></div><div>There are several clades that inexplicably lack BioB, so I would bet in its favour. The major counter argument is that there is a clade without it, the <i>Rhizobiales</i> if I recall correctly, whose members secrete a lot of desthiobiotin so this would favour the hypothesis that their BioF accepts cysteine and has a poor yield (unusual for an enzyme but unheard of in secondary metabolism) and for the unwanted elimination the PLP gets unstuck (somehow magically) and for the last step some enzyme radically activates the sulfur for ring closure (either another, simpler, radical SAM or an oxidoreductase working like isopenicillin N synthase).</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheZ4kLKlvsuOjI1Rqp3G7vfEroBu5SQnOR6UIPS9_avcfEtLsKohXPs-MGm78RuRkFiLrzfOpeAv71AKUgiS2MfcmuJebY_6czR83hCLhRwGlBjtW2bgbGy0P_TCf00SQU2X2_6lgLbNNlypN8K3x0HcjPXrnSL73Kh0ORPlVq3yYhotjBzaoOATtl__g/s3759/bioF-02.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="670" data-original-width="3759" height="71" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheZ4kLKlvsuOjI1Rqp3G7vfEroBu5SQnOR6UIPS9_avcfEtLsKohXPs-MGm78RuRkFiLrzfOpeAv71AKUgiS2MfcmuJebY_6czR83hCLhRwGlBjtW2bgbGy0P_TCf00SQU2X2_6lgLbNNlypN8K3x0HcjPXrnSL73Kh0ORPlVq3yYhotjBzaoOATtl__g/w400-h71/bioF-02.png" width="400" /></a></div><br /><div><br /></div><div>Getting hypothetical answers is always nice, even if it comes a decade too late and is unimportant. And what better candidate enzyme to possible solve the riddle that a PLP enzyme (I might be bias)!</div></div>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-59242562161003549442023-08-24T03:27:00.004-07:002024-02-09T00:20:48.453-08:00Reading compressed molecular files on NFS<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghs_R4yClJOV-8hT2QoMCF9HB6qWGNEwnL2SwfiSSubB9pu-YmNzrJZGk5-dmNTc27Xmb70H5JP0fzNJW7BDBNKq-iHxkfJYUkCrdMSSdfxJevPee6UJlkXzuFzqkdyzjo11bnzcnlAV9VqZf6jiUAtJDFCkLdj15KNNSVqM0uhcJh-w0FhwqfJijBmXo/s1024/DALL%C2%B7E%202024-02-09%2008.20.17%20-%20Imagine%20a%20cartoon-style%20illustration%20that%20features%20a%20box%20filled%20to%20the%20brim%20and%20overflowing%20with%20colorful%20molecular%20structures,%20all%20compressed%20within%20.webp" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="1024" data-original-width="1024" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghs_R4yClJOV-8hT2QoMCF9HB6qWGNEwnL2SwfiSSubB9pu-YmNzrJZGk5-dmNTc27Xmb70H5JP0fzNJW7BDBNKq-iHxkfJYUkCrdMSSdfxJevPee6UJlkXzuFzqkdyzjo11bnzcnlAV9VqZf6jiUAtJDFCkLdj15KNNSVqM0uhcJh-w0FhwqfJijBmXo/w200-h200/DALL%C2%B7E%202024-02-09%2008.20.17%20-%20Imagine%20a%20cartoon-style%20illustration%20that%20features%20a%20box%20filled%20to%20the%20brim%20and%20overflowing%20with%20colorful%20molecular%20structures,%20all%20compressed%20within%20.webp" width="200" /></a></div><br />There are some tasks that make one feel like a failed door-to-door evangelist, one amongst these is proselyting about using compressed files on networked file systems. Namely, NFS are slower than local SSD drives, so most often it is actually quicker to read compressed files in memory rather than decompress them to disk. Here are two Python snippets for dealing with small molecule files.<p></p><span><a name='more'></a></span>
<h2 style="text-align: left;"><span>Zip files</span></h2>
<p>These exist for Windows users and are annoying [intentional ambiguous subject].<br />A zip is a multifile archive, a tarball-equivalent. Below it's assumed there's a single file in the archive, but if there are more one could iterate across the list given by the <code>.infolist()</code> method of the compressed archive filehandle object (<code>zipfile.ZipFile</code>).<br />In RDKit, a filehandle of a SDF can be read, but not via <code>Chem.SDMolSupplier</code>, but via <code>Chem.ForwardSDMolSupplier</code>. In particular it needs to be a binary stream, not unencoded text, but that is fine here, otherwise things would get complicated in order to avoid doing <code>binary = text.encode('utf8')</code> which would fill memory up quickly. Another thing to note is that a text and a binary stream can be typehinted with <code>typing.TextIO</code> or <code>typing.BinaryIO</code>, while the io classes are <code>io.StringIO</code> and <code>io.BytesIO</code></p>
<pre><code>import zipfile
from rdkit import Chem
from typing import BinaryIO
with zipfile.ZipFile(👾👾👾_sdf.zip', mode="r") as zah:
zfh: BinaryIO = zah.open(zah.infolist().pop())
with Chem.ForwardSDMolSupplier(zfh) as sdfh:
mol: Chem.Mol
for mol in sdfh:
...</code></pre>
<h2 style="text-align: left;"><span>GNU Zip files</span></h2><p>Gunzip is a comically named command, but let's put that down.</p>
<pre><code>import gzip
from rdkit import Chem
from typing import BinaryIO
with gzip.open('👾👾👾_sdf.zip', mode="r") as zfh:
with Chem.ForwardSDMolSupplier(zfh) as sdfh:
mol: Chem.Mol
for mol in sdfh:
...
</code></pre>
<h2 style="text-align: left;"><span>Writing</span></h2>
<p>In RDKit, <code>Chem.SDWriter</code> can accept text streams. So the command would be <code>gzip.open('👾👾👾_sdf.zip', mode="wt")</code>.
</p><h2 style="text-align: left;"><span>Bash</span></h2>
<p>One limitation is that bashfu gets a bit more complicated. <code>zcat</code> replaces <code>cat</code>, but for head one has to pipe the decompressed stream <code>gunzip -c 👾👾👾.gz | head -n 🤖</code>.
</p><p></p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-17141053793543411032023-07-02T02:58:00.005-07:002023-07-21T11:25:52.237-07:00A note on PLIP interactions<p><a href="https://github.com/pharmai/plip" target="_blank">PLIP</a> is a handy tool to enumerate the interactions of a given ligand.
However, a few of tripping point I keep having is related to the fact the interactions are namedtuples. Here are some notes to circumvent the traps.
</p><span><a name='more'></a></span>
<h2>No SMILES and no connectivity?</h2>
<p>One thing to note is that PLIP does not accept a SMILES. It runs <code>OBMol.PerceiveBondOrder</code> to determine bond order (<code>OEDetermineConnectivity</code> and <code>OEPerceiveBondOrders</code> are the OpenEye equivalent) prior to that for the connectivity if absent.<br/>
This is far from ideal for compounds that are far from ideal in the first place. As a result generating a protonated structure with correct bond order by repeated CONECT lines is preferable in my opinion (<a href="https://github.com/matteoferla/Fragment-hit-follow-up-chemistry/blob/main/notebooks/plip_initials.ipynb">example script</a>).<br>
To tell PLIP you have protons already, use:</p>
<pre><code>from plip.basic import config
config.NOHYDRO = True</code></pre>
<p><b>EDIT</b>. Something funky goes on behind the scenes, so it might be best to remove the hydrogens and accept any errors from incorrectly percieved bond orders.</p><br>
<pre><code>from rdkit import Chem
from rdkit.Chem import AllChem
from openbabel import pybel
from plip.basic import config
from plip.structure.preparation import PDBComplex
config.NOHYDRO = True
mol = AllChem.AddHs(Chem.MolFromSmiles('c1ccccc1'))
AllChem.EmbedMolecule(mol)
block = Chem.MolToPDBBlock(mol) # rubbish CONECT is flavor=8, but it's still okay
pb_mol = pybel.readstring('pdb', block, {'s': None})
print([a.type for a in pb_mol.atoms])
pb_mol.OBMol.PerceiveBondOrders()
print([a.type for a in pb_mol.atoms])
p = PDBComplex()
p.load_pdb(block, as_string=True)
print('this is wrong!')
print([a.type for a in p.ligands[0].mol.atoms])</code></pre>
<h2>These are not the interactions you are looking for</h2>
<p>
It is easy to trip up with namedtuples. The <code>collections.namedtuple</code> is a class factory, a function that returns a class. This returned class is a subclass of <code>tuple</code> (a class), not of <code>collections.namedtuple</code>, which is a function, so would be impossible. It is far from the labrythm of madness of Boost types, but I do trip up on this.</p>
<pre><code>from collections import namedtuple
from types import FunctionType
Foo = namedtuple('Foo', ['a', 'b', 'c'])
foo = Foo(1,2,3)
assert isinstance(foo, Foo)
assert isinstance(namedtuple, FunctionType)
assert isinstance(Foo, type) # The type of a class is type
print(Foo.__mro__) # ('__main__.Foo','tuple','object')
assert isinstance(tuple, type)
</code></pre>
<p>The name given to namedtuple becomes <code>__name__</code>, which is not used for isinstance checking.</p>
<pre><code>Foo = namedtuple('Foo', ['a', 'b', 'c'])
FauxFoo = namedtuple('Foo', ['a', 'b', 'c'])
isinstance(Foo(1,2,3), FauxFoo), Foo.__name__, FauxFoo.__name__ # (False, 'Foo', 'Foo')</code></pre>
<p>This is obvious, but is a problem as the interaction classes are not exposed, cf. <a href="#%20https://github.com/pharmai/plip/blob/1fced62c8aeacfaf41008624579bd241e0a3a271/plip/structure/detection.py">code</a>.<br>
For the sake of sanity here are the relevant lines.</p>
<pre><code>
from collections import namedtuple
hydroph_interaction = namedtuple('hydroph_interaction', 'bsatom bsatom_orig_idx ligatom ligatom_orig_idx '
'distance restype resnr reschain restype_l, resnr_l, reschain_l')
hbond = namedtuple('hbond', 'a a_orig_idx d d_orig_idx h distance_ah distance_ad angle type protisdon resnr '
'restype reschain resnr_l restype_l reschain_l sidechain atype dtype')
pistack = namedtuple(
'pistack',
'proteinring ligandring distance angle offset type restype resnr reschain restype_l resnr_l reschain_l')
pication = namedtuple(
'pication', 'ring charge distance offset type restype resnr reschain restype_l resnr_l reschain_l protcharged')
saltbridge = namedtuple(
'saltbridge', 'positive negative distance protispos resnr restype reschain resnr_l restype_l reschain_l')
halogenbond = namedtuple('halogenbond', 'acc acc_orig_idx don don_orig_idx distance don_angle acc_angle restype '
'resnr reschain restype_l resnr_l reschain_l donortype acctype sidechain')
waterbridge = namedtuple('waterbridge', 'a a_orig_idx atype d d_orig_idx dtype h water water_orig_idx distance_aw '
'distance_dw d_angle w_angle type resnr restype reschain resnr_l restype_l reschain_l protisdon')
metal_complex = namedtuple('metal_complex', 'metal metal_orig_idx metal_type target target_orig_idx target_type '
'coordination_num distance resnr restype '
'reschain restype_l reschain_l resnr_l location rms, geometry num_partners complexnum')
itxn_names = ('hydroph_interaction', 'hbond', 'pistack', 'pication', 'saltbridge', 'halogenbond', 'waterbridge', 'metal_complex')
itxn_classes = (hydroph_interaction, hbond, pistack, pication, saltbridge, halogenbond, waterbridge, metal_complex)
</code></pre>
<p>However, I cannot check if something is an instance of any of these as the classes are redeclared on each call (i.e. the class of one Interaction of one call of <code>plip.structure.preparation.PDBComplex.analyze</code> will be different from another, but will be identical within). There are three options as a result: (a) override the builtin <code>isinstance</code> to check by name (a cosmic no-no) (b) rummage through the garbage collector (a big no-no), (c) subclass the namedtuple-generated class and give it a custom metaclass with a custom <code>__instancecheck__</code> dunder method that compares the <code>__name__</code> not the memory address (so faffy), or (d) don't touch <code>isinstance</code> and have a function that does the check, without summoning Cthulhu in the process:</p>
<pre><code>def isintxn(instance, cls_or_name: Union[type, str]):
"""
Is ``interaction`` an instance of a class with the same name as ``cls_or_name``?
"""
if isinstance(cls_or_name, str):
name: str = cls_or_name
else:
name: str = cls_or_name.__name__
return instance.__class__.__name__ == name</code></pre>
<h2>Is it dist or distance or distance_l?</h2>
<p>Each interaction seems to have slightly different fields. Here are the fields:</p>
<pre><code>import pandas as pd
key_tally = pd.DataFrame([{'name': cls.__name__, **{f: True for f in cls._fields}} for cls in itxn_names]).fillna(False).set_index('name')
prefered_order = key_tally.sum().sort_values(ascending=False).index.to_list()
key_tally = key_tally[prefered_order]
key_tally.transpose()</code></pre>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>name</th>
<th>hydroph_interaction</th>
<th>hbond</th>
<th>pistack</th>
<th>pication</th>
<th>saltbridge</th>
<th>halogenbond</th>
<th>waterbridge</th>
<th>metal_complex</th>
</tr>
</thead>
<tbody>
<tr>
<th>restype</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>resnr</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>reschain</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>restype_l</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>resnr_l</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>reschain_l</th>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
</tr>
<tr>
<th>distance</th>
<td>True</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>type</th>
<td>False</td>
<td>True</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>h</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>offset</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>dtype</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>atype</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>sidechain</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>protisdon</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>angle</th>
<td>False</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>d_orig_idx</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>d</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>a_orig_idx</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>a</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>distance_aw</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>distance_dw</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>metal_type</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>d_angle</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>water_orig_idx</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>w_angle</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>metal</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>water</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>metal_orig_idx</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>bsatom</th>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>target</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>target_orig_idx</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>target_type</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>coordination_num</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>donortype</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>location</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>rms</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>geometry</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>num_partners</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>acctype</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>protcharged</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>acc_angle</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>charge</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>ligatom</th>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>ligatom_orig_idx</th>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>distance_ah</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>distance_ad</th>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>proteinring</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>ligandring</th>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>ring</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>bsatom_orig_idx</th>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>don_angle</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>positive</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>negative</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>protispos</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>acc</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>acc_orig_idx</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>don</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>don_orig_idx</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
</tr>
<tr>
<th>complexnum</th>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
</tr>
</tbody>
</table>
<h2>Get atom name</h2>
<p>Atom names are a problem throughout compchem (cf. <a href="https://blog.matteoferla.com/2020/03/atom-names-purely-in-rdkit.html">past post</a>) and bar for saving atom names in the <code>molFileAlias</code> atom property in a mol file they are easily lost. PLIP does not care for them, but they can be rescued (<a href="https://github.com/matteoferla/PLIP-PyRosetta-hotspots-test/blob/main/plipspots_docking/plipspots/serial.py">code from my PLIPspots repo</a>):</p>
<pre><code>from typing import Union
from openbabel.pybel import Atom, Residue
from openbabel.pybel import ob
def get_atomname(atom: Union[Atom, ob.OBAtom]) -> str:
"""
Given an atom, return its name.
"""
if isinstance(atom, Atom):
res: ob.OBResidue = atom.residue.OBResidue
obatom = atom.OBAtom
elif isinstance(atom, ob.OBAtom):
obatom: ob.OBAtom = atom
res: ob.OBResidue = obatom.GetResidue()
else:
raise TypeError
return res.GetAtomID(obatom)
def get_atom_by_atomname(residue: Union[ob.OBResidue, Residue], atomname: str) -> ob.OBAtom:
"""
Get an atom by its name in a residue.
"""
if isinstance(residue, Residue):
residue = residue.OBResidue
obatom: ob.OBAtom
for obatom in ob.OBResidueAtomIter(residue):
if residue.GetAtomID(obatom).strip() == atomname:
return obatom
else:
raise ValueError(f'No atom with name {atomname} in residue {residue.GetName()}')
</code></pre>
<p>I am not going to into the rabbit hole of <code>Atom</code> vs. <code>ob.OBAtom]</code> as it's not a rabbit hole, but an eldritch gate directly to R'lyeh...</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-44221217294166098592023-03-05T02:06:00.005-08:002023-03-05T03:50:36.258-08:007 colour electronic paper<p>For Christmas I recieved a 5.65" seven-colour e-paper display, which is awesome. The catch as everything with a Raspberry Pi or Arduino is that beyond the gloss of the advert is something that is far from a flexible plug and play system. I enjoyed my voyage, but it was rather odd even if typical of a Raspberry Pi project.</p><a name='more'></a>
<p>I was given a 5.65" Waveshare e-ink/e-paper display. It's a nice size, although I am not sure about why it's point 65: weird things in freedom units like UNC screws are in ratios like 2/3 (which is .67) yet not ratio seems to match and this coverted into metric is 136.51 mm. But that is the least of the issues. The problems come from the data processing side.</p>
<h3>Module</h3>
<p>Waveshare provides a Python repo with demo scripts, <a href="https://github.com/waveshare/e-Paper" target="_blank">waveshare/e-Paper</a> as does <a href="https://github.com/pimoroni/inky" target="_blank">Pimoroni inky</a>. I have the former as after all it is cheaper as it's from China and not Sheffield unfortunately, so I cannot comment much on the latter, which is written far better than the former, but seems to have the same issues. Neither module is pip installable, but given how Circuit-python modules are moved to the lib folder in a Pi Nano one would expect that is what is going on, but no.</p>
<h3>Plan A: Pillow vs. Pi Nano</h3>
<p>I put this in a picture frame with a Pi Nano on its back: I remove it from the wall, connect the Pi Nano to my laptop via a microUSB and change the image.<br />
The Python script from Waveshare requires Pillow. The Pillow requirement is deep throughout the code and is not a mockable modular affair. This module is a no-go when it comes to the Pi Nano. For that matter numpy on a Pi Nano is also a red flag, therefore, this means that the code has to be run on kernel that can run Pillow. Pillow works on a Pi Zero, so hello plan B.</p>
<h3>Plan B: Floyd-Steinberg dithering algorithm vs. Pi Zero</h3>
<p>I put this in a picture frame with a Pi Zero W on its back: I remove it from the wall and connect it to a power source, go to my laptop and upload the image I want for fully processing.<br />
Pillow and Numpy can be installed on a Pi Zero. There is an annoyance in that one of the modules is released as a universal Arm-architecture wheel, compiled for Arm7, yet the Pi Zero requires Arm6 compiled code, resulting in a segmental fault like error on import. <code>ARCHFLAGS='-arch arm6' python3 -m pip install pillow</code> avoids this.<br />
The image needs to be made into a 7-colour image. The 7 colors are on or off with no gradient, so the images need to be a style like a Roy Lichtenstein artwork with nice <a href="https://en.wikipedia.org/wiki/Ben_Day_process" target="_blank">Ben Day dots</a>. The scant documentation describes lowering the image to 7-colour via the <a href="https://en.wikipedia.org/wiki/Floyd%E2%80%93Steinberg_dithering">Floyd-Steinberg dithering algorithm</a> and suggests using their Windows exe to do so. As a Linux user, I ignored this and sought a Python implementation. A simple example is found in a <a href="https://scipython.com/blog/floyd-steinberg-dithering/" target="_blank">SciPy blog post</a>, which goes through the basics and has a commented out alternative function, which addresses the case when one has a palette (uninformatively called <code>p</code> it that post). There is also <a href="https://gist.github.com/bzamecnik/33e10b13aae34358c16d1b6c69e89b01" target="_blank">a post on GitHub Gist</a> wherein the numba.jit is used to speed up the inference to milliseconds. However, whereas Numba is awesome, its antipathy towards functions and other simple things makes it painful to implement, plus it does not work with a Pi Zero in my hands anyway, so I will give it a miss.<br />
The Waveshare module provided uses <code>Image.quantize()</code> upon a <code>Image.putpalette(...)</code> palette, which actually does exactly this, so my coding adventure was utterly unnecessary. One step that is necessary is scaling the image:</p>
<pre><code>from PIL import Image
def scale(image: Image, target_width=600, target_height=448) -> Image:
"""
Given an Pillow image and the two dimensions scale it,
cropping centrally if required.
"""
width, height = image.size
if height/width < target_height/target_width:
warn('too wide: cropping')
new_height = target_height
new_width = int(width * new_height / height)
else:
warn('too tall: cropping')
new_width = target_width
new_height = int(height * new_width / width)
print(height, width, target_height, target_width, new_height, new_width)
# Image.ANTIALIAS is depracated --> Image.Resampling.LANCZOS
# but a fresh install of pillow via ``ARCHFLAGS='-arch arm6' python3 -m pip install pillow``
# yielded 8.1.2 as of 26/02/23
ANTIALIAS = Image.Resampling.LANCZOS if hasattr(Image, 'Resampling') else Image.ANTIALIAS
img = image.resize((new_width, new_height), ANTIALIAS)
# (left, top, right, bottom)
half_width_delta = (new_width - target_width) // 2
half_height_delta = (new_height - target_height) // 2
img = img.crop((half_width_delta, half_height_delta,
half_width_delta + target_width, half_height_delta + target_height
))
return img
</code></pre>
<p><code>Image.crop</code> accepts negative numbers too in which case the image gets padded, making it a viable alternative solution.</p>
<p>A classic consolation statement is always, at least it was didactic... In my case I am not sure I learnt anything in modifying the code for the Floyd-Steinberg dithering algorithm as it felt very unnumpyish iterating pixel by pixel. Here is my code which is mostly the same bar for typehints.</p>
<pre><code>import numpy as np
import numpy.typing as npt
from typing import Iterator, TypeVar
import operator, itertools
from PIL import Image
from numba import jit
from warnings import warn
colors = [
{"name": "black", "hex": "#000000", "rgb": (0, 0, 0)},
{"name": "white", "hex": "#FFFFFF", "rgb": (255, 255, 255)},
{"name": "green", "hex": "#008000", "rgb": (0, 255, 0)},
{"name": "blue", "hex": "#0000FF", "rgb": (0, 0, 255)},
{"name": "red", "hex": "#FF0000", "rgb": (255, 0, 0)},
{"name": "yellow", "hex": "#FFFF00", "rgb": (255, 255, 0)},
{"name": "orange", "hex": "#FFA500", "rgb": (255, 165, 0)}
]
_c: Iterator = map(operator.itemgetter('rgb'), colors)
color_palette: npt.NDArray[np.float64] = np.array(list(map(list, _c))) / 255
RGBTriplet = npt.NDArray[np.float64] # These will have 3 elements
def get_new_val(old_val: float) -> RGBTriplet:
idx: npt.NDArray[np.float64] = np.argmin(np.sum((old_val[None,:] - color_palette)**2, axis=1))
return color_palette[idx]
#@jit(nopython=True)
def _fs_inner(pixels: npt.NDArray[np.float64]) ->None:
"""
``Pixels`` gets modified in place
"""
height, width, _ = pixels.shape
for ir in range(height):
for ic in range(width):
old_val: RGBTriplet = pixels[ir, ic].copy()
idx: int = np.argmin(np.sum((old_val[None,:] - color_palette)**2, axis=1))
new_val: RGBTriplet = color_palette[idx]
pixels[ir, ic] = new_val
err: RGBTriplet = old_val - new_val
if ic < width - 1:
pixels[ir, ic+1] += err * 7/16
if ir < height - 1:
if ic > 0:
pixels[ir+1, ic-1] += err * 3/16
pixels[ir+1, ic] += err * 5/16
if ic < width - 1:
pixels[ir+1, ic+1] += err / 16
def fs_dither(img: Image) -> Image:
"""
Floyd-Steinberg dither the image img into a palette with colors specified
in a global variable ``color_palette``.
... code-block :: python
dithered_image: Image = fs_dither(scaled_image)
"""
pixels = np.array(img, dtype=float) / 255
_fs_inner(pixels)
corr_pixels = np.array(pixels/np.max(pixels, axis=(0,1)) * 255, dtype=np.uint8)
return Image.fromarray(corr_pixels)</code></pre>
<p>Traditionally in image processing a lady with a hat called Lenna (cropped from a naked Playboy image) was used, now everyone has their own standard. Here I will use the image I want to use, Altas in the Yorkshire Dales.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6P7L7mdh8ozwg73NqbGRmTsRGdWOoB-YiCU7HV4NSjbXAE_vX0qXxaIa8r90SN6PUe3AAPUuxi4e1o8esmADOnV1PgCcxRmmd4Miwi1ngAA1qxM7cu-z-UmyZPPE7jS44JcVSydsgSZVFdX2Smi0GAFN2J1Lzp2yx8bA9PbXReD-sg5UVVGBGCSWy/s600/scaled.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="448" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6P7L7mdh8ozwg73NqbGRmTsRGdWOoB-YiCU7HV4NSjbXAE_vX0qXxaIa8r90SN6PUe3AAPUuxi4e1o8esmADOnV1PgCcxRmmd4Miwi1ngAA1qxM7cu-z-UmyZPPE7jS44JcVSydsgSZVFdX2Smi0GAFN2J1Lzp2yx8bA9PbXReD-sg5UVVGBGCSWy/s320/scaled.png" width="320" /></a></div>
<p>This is the image after my code:</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWnjVQ4yo2ffAWHNKdJXuX8Ej6VvKPQS7g_eIHhmrOnDFJ_h1OIJ1js8m_F5NUA62sOZdJR_-qMUfbFZtXgzbnm-VIjHfc-ShkW7_d_MplZ5GXqvQmsP7edkk1p5ybR7YRh4e0EoW4WuhdtQhCWmQOnURhdeCCr9wbhwhEBZZcg5wIRJeuYp7IO05Z/s600/dithered.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="448" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWnjVQ4yo2ffAWHNKdJXuX8Ej6VvKPQS7g_eIHhmrOnDFJ_h1OIJ1js8m_F5NUA62sOZdJR_-qMUfbFZtXgzbnm-VIjHfc-ShkW7_d_MplZ5GXqvQmsP7edkk1p5ybR7YRh4e0EoW4WuhdtQhCWmQOnURhdeCCr9wbhwhEBZZcg5wIRJeuYp7IO05Z/s320/dithered.png" width="320" /></a></div>
<p>Whereas using the Pillow quantize method I get the following:</p><p>
</p><pre><code>pal_image = Image.new("P", (1,1))
pal_image.putpalette( (0,0,0, 255,255,255, 0,255,0, 0,0,255, 255,0,0, 255,255,0, 255,128,0) + (0,0,0)*249)
quanti = scaled_image.convert("RGB").quantize(palette=pal_image)
</code></pre>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe5UtO2WxDscZKGKTT0aSD5onEiXH2dKBywgSndBOOdZTbWI1xqrpZztuza9_DMEqDy8Mi2mu7en6LgMhqxM6-nUG5MWZVuGS17o3eerp0M1yXk7kNDSt0L5RLijYYeKbLthZDI7UMDxXnBxrUVHddhZAwDdagk1IAAcaQoPms1tU1DDsmWljGsfx_/s600/quanti.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="448" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe5UtO2WxDscZKGKTT0aSD5onEiXH2dKBywgSndBOOdZTbWI1xqrpZztuza9_DMEqDy8Mi2mu7en6LgMhqxM6-nUG5MWZVuGS17o3eerp0M1yXk7kNDSt0L5RLijYYeKbLthZDI7UMDxXnBxrUVHddhZAwDdagk1IAAcaQoPms1tU1DDsmWljGsfx_/s320/quanti.png" width="320" /></a></div>
<p>The pillow method actually is better: My code is very slow and takes 10 seconds, whereas the Pillow method is under a second. So I best use that!</p>
<h3>Parenthesis on colours</h3>
<p>The pillow method actually is better: My code is very slow and takes 10 seconds, whereas the Pillow method is under a second. In the above I copied the palette RGB triplets from the Waveshare module and orange is full red channel (255) and half green channel (128). Whereas the enum-like value <code>epd.ORANGE</code> the green channel is meant to be 168. I have not tweaked the code to test whether the colours are better on the display wholly with the latter scheme. The colour scheme is curious anyway as it has black, white, the three additive primary colours (green, blue, red) and yellow and orange. Yellow, cyan and magenta are secondary colours, but only the first is present. Orange is a tertiary colour. The latter is more akin to the historical <a href="https://en.wikipedia.org/wiki/RYB_color_model" target="_blank">RYB colour model</a> (red, yellow, blue => orange, green and purple), but without purple.</p>
<pre><code>colors = [epd.BLACK, epd.ORANGE, epd.GREEN, epd.BLUE, epd.RED, epd.YELLOW]
print(', '.join(map('{0:0>6x}'.format, colors)))</code></pre>
Gives '000000, 0080ff, 00ff00, ff0000, 0000ff, 00ffff' rather oddly this is Blue-Green-Red not RGB and this is not an endianess issue.
<pre><code>from IPython.display import display, HTML
colors = [
{"name": "black", "hex": "#000000", "rgb": (0, 0, 0)},
{"name": "white", "hex": "#FFFFFF", "rgb": (255, 255, 255)},
{"name": "green", "hex": "#008000", "rgb": (0, 255, 0)},
{"name": "blue", "hex": "#0000FF", "rgb": (0, 0, 255)},
{"name": "red", "hex": "#FF0000", "rgb": (255, 0, 0)},
{"name": "yellow", "hex": "#FFFF00", "rgb": (255, 255, 0)},
{"name": "orange", "hex": "#FFA500", "rgb": (255, 165, 0)},
{"name": "mid-orange", "hex": "#FF8000", "rgb": (255, 165, 0)}
]
to_color_span = lambda color: f'<span style="color:{color["hex"]}">{color["name"]} ▀ </span>'
display(HTML('\n'.join(map(to_color_span, colors))))</code></pre>
<div dir="auto"><span style="color:#000000">black ▀ </span>
<span style="color:#FFFFFF">white ▀ </span>
<span style="color:#008000">green ▀ </span>
<span style="color:#0000FF">blue ▀ </span>
<span style="color:#FF0000">red ▀ </span>
<span style="color:#FFFF00">yellow ▀ </span>
<span style="color:#FFA500">orange ▀ </span>
<span style="color:#FF8000">mid-orange ▀ </span></div>
<h3>Plan C: Pi Nano, full Pillow</h3>
<p>One thing that stands out is that the image is rather whitewashed after the dithering. Therefore boosting the saturation helps</p><pre><code>from PIL import ImageEnhance
enhanced_image: Image = ImageEnhance.Color(scaled_image).enhance(2)
endithered: Image = enhanced_image.convert("RGB").quantize(palette=pal_image)</code></pre>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggUgQq_I3WU8xhuMRoM2CVQPg53wvAkX4RvdmkdoBKwwsdN4bHyvDrCEvluFRV_u8pSgvpLVIXq9iXaq4FSBFFl94tnqhHOb2Fo_lw32dTuGBPIjHibq69CLu9fYSd_Pzm_zgOREm2b3ocDHIGdMft3ZmuxMmOLn9NX5Ju2-Qiv4vTv7c5b1jjCCsw/s600/enhanced.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="448" data-original-width="600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggUgQq_I3WU8xhuMRoM2CVQPg53wvAkX4RvdmkdoBKwwsdN4bHyvDrCEvluFRV_u8pSgvpLVIXq9iXaq4FSBFFl94tnqhHOb2Fo_lw32dTuGBPIjHibq69CLu9fYSd_Pzm_zgOREm2b3ocDHIGdMft3ZmuxMmOLn9NX5Ju2-Qiv4vTv7c5b1jjCCsw/s320/enhanced.png" width="320" /></a></div>
<h3>Setup</h3>
<p>As mentioned previously, I set up my Raspberry Pis to serve Jupyter notebooks (<a href="https://github.com/matteoferla/Somewhat-Smart-Home/blob/main/setting_up.md">instructions</a>). Then:</p>
<pre><code>python -m pip install -q pillow waveshare-epaper
raspi-config # enable SPI</code></pre>
<p>A kind user uploaded Waveshare's module to pypi, which is rather telling...</p>
<pre><code>import logging
import time
from PIL import Image,ImageDraw,ImageFont,ImageEnhance
from warnings import warn
import traceback
# ------------
def scale(image: Image, target_width=600, target_height=448) -> Image:
.... # See above
original_image = Image.open('DSC_0168.JPG')
scaled_image = scale(original_image)
scaled_image
# ------------
enhanced_image: Image = ImageEnhance.Color(scaled_image).enhance(3)
pal_image = Image.new("P", (1,1))
pal_image.putpalette( (0,0,0, 255,255,255, 0,255,0, 0,0,255, 255,0,0, 255,255,0, 255,128,0) + (0,0,0)*249)
endithered: Image = enhanced_image.convert("RGB").quantize(palette=pal_image)
endithered
# ------------
import epaper
epd = epaper.epaper('epd5in65f').EPD()
epd.init()
epd.Clear()
epd.Clear() # double tap.
# ------------
epd.display(epd.getbuffer(endithered))
</code></pre>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBGYrxTJ5o5Y0n95T20yqxs9e4hcCmadhrDkpJ2eTlb2W0-s0tj-Qc7AL36mLHAQp1QuqUwFhSzQpQ0gzRY5jnwsfqHW30jbVFFFxIVSbldapHI-zreEfmpsmxeycY23e3822RW913yEYzqbniVE4hULyiF1cIvKagCa60xtnzAz172tecV8A4GEr9/s4032/IMG_1622.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="3024" data-original-width="4032" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBGYrxTJ5o5Y0n95T20yqxs9e4hcCmadhrDkpJ2eTlb2W0-s0tj-Qc7AL36mLHAQp1QuqUwFhSzQpQ0gzRY5jnwsfqHW30jbVFFFxIVSbldapHI-zreEfmpsmxeycY23e3822RW913yEYzqbniVE4hULyiF1cIvKagCa60xtnzAz172tecV8A4GEr9/s320/IMG_1622.jpg"/></a></div>
<h3>Plan D: Pi Nano with L shaped headers</h3>
<p>Well, I have a deep picture frame from Ikea (Ribba) and a sheet of plasticard to hold the frame, however with straight header pins on the Pi Nano it does not work at all as it's thicker than the picture frame... This is not my best thought out project.</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSuZWKs1t_5wcutsBCiEuXcNCCrzbe99iFXAnlpMJBamvXQZoZrfZtXz-5fdaqmfQwlOhdLz7RHGWgyLs04d556Kin5vXzyV_tQ4tgYiDHovwtq1MU0fMhyN6xWifxTw2qUUOIKrzFaa82J3l409BCoR4anqQSZGqGb5jPOG4mRqlHcq4uHUCBcaco/s4032/IMG_1627.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="3024" data-original-width="4032" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSuZWKs1t_5wcutsBCiEuXcNCCrzbe99iFXAnlpMJBamvXQZoZrfZtXz-5fdaqmfQwlOhdLz7RHGWgyLs04d556Kin5vXzyV_tQ4tgYiDHovwtq1MU0fMhyN6xWifxTw2qUUOIKrzFaa82J3l409BCoR4anqQSZGqGb5jPOG4mRqlHcq4uHUCBcaco/s320/IMG_1627.jpg"/></a></div>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-28914882686668662642023-02-18T06:30:00.002-08:002023-02-18T09:49:30.004-08:00Swapped university logo colour generator<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBDQNvfkLUAIHdL3l2wkCZ8-ACJKJtvOD_HKkPjPmXfW_rUAeYZa6_EnUIfb3SZx7NL5b7e4DwiM8vm9MU_fmZhnYQbQNKVa3uWTRpkbiYI0DUBJwd3oWJyoaDHjeJW9vEaK0vXet3hc9NksT2YY0Phf-xYqnBpBe8cRYdirVdeuSczF5RW6PT87m3/s960/heresy.jpg" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="641" data-original-width="960" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBDQNvfkLUAIHdL3l2wkCZ8-ACJKJtvOD_HKkPjPmXfW_rUAeYZa6_EnUIfb3SZx7NL5b7e4DwiM8vm9MU_fmZhnYQbQNKVa3uWTRpkbiYI0DUBJwd3oWJyoaDHjeJW9vEaK0vXet3hc9NksT2YY0Phf-xYqnBpBe8cRYdirVdeuSczF5RW6PT87m3/s320/heresy.jpg" width="320" /></a></div><p>Like many in academia I have moved across a few universities, each with their own colours, blue, gold, grey (I think) and even pine green (yes, like John Deer merch). Universities are quite possessive of their logos and have guidelines on their 'brand identity', which feels alien to academia as we are used to logos for tools being made in PowerPoint if they even have one. One thing that is frowned upon is changing the colours. But the fondness for ones former and present affiliations should not stand in the way. Luckily I have written a JS tool to help you swap the colours!</p>
<a name='more'></a>
<p>For the university name you can type (case insensitive) the placename and the suggestion would likely match, whereas alternative names may fail as it is a simple string match (for example 'Oxford University' won't work for 'University of Oxford').</p>
<div id="swapper"></div>
<script>fetch("https://www.matteoferla.com/unicolors/universities.json").then(e=>e.json()).then(e=>(window.universities=e,import("https://www.matteoferla.com/unicolors/university.js"))).then(e=>{window.University=e.University,window.UniCombineColor=e.UniCombineColor,window.combiner=new UniCombineColor(document.getElementById("swapper"))});</script><br />
<h3>Technical details</h3>
<p>This blog post was originally intended for the <a href="https://www.blopig.com/" target="_blank">Blopig blog</a> (the OPIG's blog), but ended up here as I was not able to embed JS or a HTML frame. However I did write there <a href="https://www.blopig.com/blog/2023/02/datamining-wikipedia-and-writing-js-with-chatgtp-just-to-swap-the-colours-on-university-logos/" target="_blank">the technical issues and strategies employed in the making of this tool</a>, which, in my opinion, is more interesting that the tool itself. For more see
<a href="https://github.com/matteoferla/wiki-university-colours" target="_blank">The GitHub repository with the code and data</a>.
</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-90775238625551010422023-02-05T09:40:00.001-08:002023-02-05T09:40:13.971-08:00Reading a mmCIF from PyMOL in PyRosetta<p>The mmCIF (PDBx as in extended PDB) format is meant to replace PDB format. Soon the RCSB PDB will have to adopt 4-letter codes for novel chemical components, which will break the PDB format. PDBx format is space separate as opposed to the really annoying column position in the PDB format and in the PDBx format the metadata can be stored in a nearly sensible manner. However, PDBx is solely a deposition format, but it is not really used as analysis format regardless of what the PDB claims. I personally had to add support for it because a reviewer asked me to. This lack of adoption is often attributed to the "if it ain't broken don't fix it" principle. Although I personally would argue that it may due to how it's implemented: opening a PDBx from one program ought to work in another, but this is not often the case. An example of this is PyMOL files read in PyRosetta.</p><span><a name='more'></a></span>
<h4>Format</h4>
<p>In a PDBx format each line is a dictionary keys prefixed with an underscore followed by the space separated values (single quote marks allow spaces within a value). Whereas tables are declared with the loop_ directive followed by the various headers line by line, followed by the table, closed by... violating the formatting of the table, such as declaring a new entry or EOF.<br />The PDB <code>ATOM</code> entries become <code>_atom_site</code> <code>loop_</code>:</p>
<pre><code>loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM 1 N N . VAL A 1 3 ? 16.783 48.812 26.447 1.00 30.15 ? 3 VAL A N 1
ATOM 2 C CA . VAL A 1 3 ? 17.591 48.101 25.416 1.00 27.93 ? 3 VAL A CA 1
ATOM 3 C C . VAL A 1 3 ? 16.643 47.160 24.676 1.00 25.52 ? 3 VAL A C 1 </code></pre>
<p>The dictionary-item definitions are non-trival to find, beyond the basics (which can be found in the <a href="https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/beginner%E2%80%99s-guide-to-pdb-structures-and-the-pdbx-mmcif-format" target="_blank">pdb101 help page</a>), which roughly covers the mandatory fields.<br />However, say I want to add a SMILES of a chemical component and add a post-it like comment for a future self, what would I do? Honestly, making a separate text file or adding a # comment would be as constructive as following the rules. There is no official <code>_chem_comp</code> subfield called ".smiles" or ".inchi" or ".systematic_name" and instead a <code>loop_</code> table is used with <code>_pdbx_chem_comp_descriptor.id</code> for the three letter code, _pdbx_chem_comp_descriptor.type for the work SMILES or InChi or systematic name and _pdbx_chem_comp_descriptor.descriptor for the actual value. In terms of comments, a <code>loop_</code> with <code>_database_PDB_remark.text</code> is AFAIK the best option, despite actually being a backwards compatibility hack (in a traditional PDB, <code>REMARK</code> are messily classified notes).</p>
<p>A nice addition is <code>_pdbx_sifts_xref_db</code>, which allows the addition of per-residue annotations such as PFam domains (not ranges, so it's a bit verbose).
<h4>Tools</h4>
<p>For something simple, i.e. reading a file, there are a <a href="http://mmcif.rcsb.org/docs/software-resources.html" target="_blank"> wealth of tools listed in the PDB help pages</a>. This is normally a cause for alarm bells: the wheel reinvention happens because either the solution was hidden or it was broken. Signs point to the latter case: the provided C++ library, <code>CIFPARSE-OBJ</code> is rather cryptic and rough around the edges. PyRosetta uses it in reading mmCIF files and it carries over the glitch that if the entry ends in a <code>loop_</code> the EOF does not trigger it to complete, but instead the table is lost.</p>
<p>A python library that is nice to use that does not suffer from it is <code>pdbx</code>.</p>
<pre><code>with open(filename, 'r') as fh:
dcs: List[pdbx.DataContainer] = pdbx.load(fh)
pdbblock: str = pdbx.dumps(dcs)
dc: pdbx.DataContainer = dcs[0]
items: List[str] = dc.get_object_name_list()
cat: pdbx.Category = dc.get_object('entity_poly')
import pandas as pd
print(cat.name, cat.row_count)
pd.DataFrame(cat.row_list, columns=cat.attribute_list)
</code></pre>
<table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th></th>\n <th>entity_id</th>\n <th>type</th>\n <th>nstd_linkage</th>\n <th>nstd_monomer</th>\n <th>pdbx_seq_one_letter_code</th>\n <th>pdbx_seq_one_letter_code_can</th>\n <th>pdbx_strand_id</th>\n <th>pdbx_target_identifier</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>polypeptide(L)</td>\n <td>no</td>\n <td>no</td>\n <td>AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS\\nGFRNSDRILYSSDWLIYKTTDHYQTFTKIR</td>\n <td>AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS\\nGFRNSDRILYSSDWLIYKTTDHYQTFTKIR</td>\n <td>A,B,C</td>\n <td>None</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>polypeptide(L)</td>\n <td>no</td>\n <td>no</td>\n <td>KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE\\nGADITIILS</td>\n <td>KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE\\nGADITIILS</td>\n <td>D,E,F</td>\n <td>None</td>\n </tr>\n </tbody>\n</table>
<p>A file generally has a single entry (<code>data_</code> in PDBx, <code>MODEL</code> in old PDB), but it can have multiple, hence the list business. The last step shows how useful the <code>loop_</code> table can be when used in a civilised manner.</p>
<h4>PyMOL</h4>
<p>When saving a file from PyMOL most details bar for the atomic coordinates get lost, in both PDB and PDBx format. This includes the secondary structure information (<code>HELIX</code> & <code>SHEET</code> in PDB and <code>_struct_conf</code> in PDBx) and connections beyond the standard peptide backbone, such as isopeptide bonds, disulfide bonds and crosslinks to ligands (<code>LINK</code> & <code>SSBOND</code> in PDB and <code>_struct_conn</code> in PDBx). This is not great as one needs to append this information to make certain usages work (such as showing these in NGL).</p>
<p>This is an issue that combines with the former issue: when opening a PyMOL saved mmCIF file, PyRosetta will fail to load the coordinates and as suggested when the output is not muted and dummy extra item will fix it.</p>
<pre><code>
import pymol2
filename = '1brs.original.cif' # 1BRS is barnase-barnstar complex
roundname = filename.replace('.original', '.round')
with pymol2.PyMOL() as pymol:
pymol.cmd.set("cif_keepinmemory") # does not affect save
pymol.cmd.load(filename)
pymol.cmd.save(roundname)
with open(roundname, 'a') as fh:
fh.write('\n_citation.title ""\n')
import pyrosetta
pyrosetta.init(extra_options='-mute all')
from types import ModuleType
prc: ModuleType = pyrosetta.rosetta.core
pose = pyrosetta.Pose()
prc.import_pose.pose_from_file(pose, filename, False, pyrosetta.rosetta.core.import_pose.FileType.CIF_file)
pose.sequence(), pose.total_residue()
</code></pre>
In the docs searching for mmCIF may lead someone to find either
<pre><code>chemical.mmCIF.mmCIFParser()</code></pre>
<p>(which is for chemical components) or</p>
<pre><code>
oc: pyrosetta.rosetta.utility.options.OptionCollection = pyrosetta.rosetta.basic.options.initialize()
sfr_opts = prc.io.StructFileReaderOptions(oc)
prc.io.mmcif.create_sfr_from_cif_file_op(❓, sfr_opts)</code></pre>
<p>(which requires to be passed a C++ CIFfile object, which is not accessible AFAIK).
Instead the regular pose import from file does require a wrapper to the init options, but passed to a different class, which from what I can tell does something on the lines of:
<pre><code>rts: prc.chemical.ResidueTypeSet = pose.residue_type_set_for_pose()
sfr_opts = prc.io.StructFileReaderOptions(oc)
builder = prc.io.pose_from_sfr.PoseFromSFRBuilder(rts=rts,
options= sfr_opts)
with open(filename,'r') as fh:
pdblock = fh.read()
# this is wrong for some reason and will segfault
filerep: prc.io.StructFileRep = prc.io.pdb.create_sfr_from_pdb_file_contents(pdblock, sfr_opts)
builder.build_pose(filerep, pyrosetta.Pose())
# then foldtree information is added
ipo = prc.import_pose.ImportPoseOptions(oc)
prc.import_pose.read_additional_pdb_data('1brs.cif', pose, ipo)
</code></pre>
<p>Which is rather convoluted and not actually viable as it will segfault (I am making a mistake at some step). The pdbinfo label + foldtree business is controlled in the more straightforward <code>pose_from_file</code> by the third argument, which I set to False, because the CIF was from PyMol or the PDB, so irrelevant. Given that the CIF reader will preserve non-peptide backbone connections there is not too much to be gained from this route.</p>
<h4>Labels</h4>
<p>There are some pieces of information one may want to associate with a residue, for example above I mentioned <code>_pdbx_sifts_xref_db</code> information, which is not always present anyway. Where one to want to annotate for selection certain residues one could use the residue labels in the pdbinfo, which is easy to set up:</p>
<pre><code>pi.add_reslabel(1, 'foo')
print(pi.get_reslabels(1)) # ['foo']
res_sele: ModuleType = prc.select.residue_selector
pru: ModuleType = pyrosetta.rosetta.utility
v: pru.vector1_bool = pres_sele.ResiduePDBInfoHasLabelSelector('foo').apply(pose)
res_sele.ResidueVector(v) # 1
</code></pre>
Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-3011592319525413552023-01-22T01:15:00.006-08:002023-01-23T05:19:59.060-08:00Typing emoji with a Pico keypad<h2>Typing emoji with a Pico keypad</h2>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJl-GgK3ao--QlY8q0-5soCFmXIHN4jw0WIlgE0o6Cf9aQhydaVYFVjr-WR9hAjlyAz7DGDNy3QabYJ6EtDgIxdDxliAjp12ZCP10fw1_9uvm1F7puBM4fHyevUYsY7p8QtyEb2e9AjCHcCSPvSaOvurL19cTT7bm0LyxH3KeZt1wbyrcXe_W5dfYD/s4032/IMG_1199.jpeg" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="4032" data-original-width="3024" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJl-GgK3ao--QlY8q0-5soCFmXIHN4jw0WIlgE0o6Cf9aQhydaVYFVjr-WR9hAjlyAz7DGDNy3QabYJ6EtDgIxdDxliAjp12ZCP10fw1_9uvm1F7puBM4fHyevUYsY7p8QtyEb2e9AjCHcCSPvSaOvurL19cTT7bm0LyxH3KeZt1wbyrcXe_W5dfYD/w150-h200/IMG_1199.jpeg" width="150" /></a></div><br />I got myself a Pimoroni RGB keypad, a keypad with 16 coloured buttons controlled by a Raspberry Pico.
So the first thing I wanted to do was code it to output emoji, because I am very professional person.
However, this was not a simple task as I had hoped.<p></p>
<a name='more'></a>
<h3>Raspberry Pico</h3>
<p>A Pico is a board with the Raspberry Pi Foundation's own RPi2040 microcontroller (the big chip) and a 2MB flash memory (square chip to the right of it) which is partitioned into controller firmware and mass storage, the former can run MicroPython (or its derivative CircuitPython), a bare-metal python compiler, with the code run stored in the latter.
It is less powerful than a Raspberry Pi, which runs a full OS,
but it is smaller and in most applications you don't need an OS. The ESP32 works similarly to the Pico, while an an Arduino you don't generally touch the bootloader and upload Arduino-flavoured C, while on a STM32 Nucleo you go full-on bare-metal coding. A Pico is simple but not hard-core to control, which is nice.</p><p>A funny thing about the Pico is that it has some Rubber Ducky traits to it: one flashes the firmware albeit by pressing a boot selection button as opposed cracking open a USB memory stick and shorting two pins, the main memory appears as a regular USB memory stick, and with the hid module it also is seen as a keyboard...</p><p>One detail to note is that the Python is a slimmed down version: several missing standard library modules, copying of packages in <code>lib</code> folder in lieu of installation and so forth.</p>
<h3>Pimoroni RGB Keypad</h3>
<p>The Pimoroni RGB Keypad is a 4x4 matrix of buttons with RGB LEDs.
The buttons are connected to the Pico via I2C, and the LEDs are controlled via PWM.
The Python library controlling it is <code>adafruit_hid</code> (HID is Human Interface Device).</p>
<h3>The problems</h3>
<p>There are two issues to outputting emoji:</p>
<ul>
<li>emoji are not part of the Basic Multilingual Plane (BMP) of Unicode,</li>
<li>Mac does not support Unicode input on a standard GB/US keyboard layout.</li>
</ul>
<p>When you type a character, the keyboard sends a code to the computer,
which then looks up the character in a table. This code page is different than ASCII or Unicode.
For example, the code for the letter <code>a</code> is <code>0x61</code> in ASCII,
but in the US keyboard layout it is <code>0x04</code>.</p>
<p>The notation <code>0xdd</code> is hexadecimal, where <code>0x</code> is the prefix and <code>nn</code> is a hex number between <code>00</code> and <code>FF</code>.
Where I to want to write a binary number I would use <code>0bnn</code>, where <code>nn</code> is a binary number between <code>00</code> and <code>11</code>.
Hexadecimal, base 16, is easier to read than binary, base 2, and encodes 4-bits (a nibble):
as a byte is 8 bits, so 2 hex digits.
For Unicode specifically, there is also the notation <code>U+nnnn</code>, where <code>nnnn</code> is a number between <code>0000</code> and <code>FFFF</code>.
In Python the latter is expressed as <code>\unnnn</code> in a string.</p>
<p>The BMP is the first 65,536 characters of Unicode, that means that 2 bytes are enough to represent a character.
The rest of Unicode is called Supplementary Planes, and they are represented by 4 bytes (UTF-16).</p>
<h3>Unicode keyboard on a Mac</h3>
<p>It supports only the keys on a standard US keyboard.
On a Windows machine AltGr+number will output unicode keys,
but this is not the case with the default Mac keyboard layout.
Namely, one needs to add an alternative keyboard layout to the Mac in
<code>System Preferences</code> > <code>Keyboard</code> > <code>Input Sources</code> > <code>Other</code> > <code>Unicode Hex Input</code>.
One can switch between keyboard layouts with <code>control</code>+<code>option</code>+<code>space</code>.
Once this is done one can type ⌥+Unicode hex digits.
Say <code>alt</code> + <code>00E1</code> is for the letter <code>á</code>, an acute-accented <code>a</code>.</p>
<p>In the regular layout <code>alt</code> + <code>e</code>
is the dead key <code>´</code> which waits for the next letter to be pressed to modify that.
In Unicode there are two ways actually to write <code>á</code>: the other is with the combining acute accent <code>´</code> at U+0301,
combined with <code>a</code>.</p>
<p>On a Windows the AltGr key is combined with a decimal number (not hex) so this step is not needed.</p>
<h3>Unicode digression</h3>
<p>The Basic Multilingual Plane covers a fair amount of characters, but not all of them.
The CJK (Chinese, Japanese, Korean) characters are not all in the BMP, only 27,000 of them.
Actually this covers a large amount of characters: Japanese school children learn 2,136 kanji.
There are differences between the Japanese, Chinese and Korean characters,
for example the traditional Kangxi character for an East-Asian dragon is <code>龍</code> (U+9F8D), while its descendants are the Japanese character <code>竜</code> (U+7AC1) and the simplified Chinese character <code>龙</code> (U+9F99). All of these fit in 2 bytes. The most complicated character in Japanese (a joke like hippopotomonstrosesquippedaliophobia) is <code>𱁬</code> (U+3106C),
which is 3 bytes long as it is not in the BMP,
however the mad character for the snaking flight-movement of an East-Asian dragon (wingless) is <code>龘</code> (U+9F98). Korean script, Hangul, is syllabic and dragon happens to be a single 2-byte character <code>용</code> (U+C6A9), but in one of Japanese syllabic scripts (hiragana) it'sりゅう, which is 3 characters. A western dragon in Japanese is written in the other syllabic script (katakana) is ドラゴン, so 2*4=8 bytes are needed
and the word 'dragon' is 6 ASCII-characters/bytes. So stroke count does not correlate with memory footprint!</p>
<p>In summary, 65,536 is a crazy amount of characters. In Unicode 15.0 there are 149,186 characters.</p>
<p>Funny factoid:
neither Tolkien's Elvish script (Tengwar) or the Klingon script (pIqaD in thlIngal Hol) are not in unicode:
they were in the former ConScript Unicode Registry area, but now emoji have taken the place of Tengwar,
while the warriors of Qro'noS are holding out.</p>
<p>Why is this important?
When typing or copypasting or saving a non-BMP character you often get a weird gibberish.
Say 😊 (U+001F60A) will become ὠ (curiously 'uh?' is both my reaction and the sound of an aspirated omega).
This is because only the first 2 bytes are being read.
This happens with the keypad.</p>
<h3>Surrogate pairs</h3>
<p>To circumvent this one can encode them with high surrogates.
Wiktionary describes these as:
'A code point in the range U+D800 through U+DBFF (the High Surrogates and High Private Use Surrogates blocks),
used in UTF-16 to encode the high 10 bits of the 20-bit offset
above U+FFFF of the code point belonging to a supplementary character.'</p>
<p>That verbiage basically tells us, it's an uninteresting technical hack,
wherein the surrogate character � acts as kind of modifier for the next 2 bytes.
So 😀 (U+1F600) becomes � (U+D83D) Þ00 (U+DE00)</p>
<p>Herein I am using U+FFFD for a placeholder,
as this glyph may appear when displaying an actual encoding error
and is different than □ (U+25A1), which is a missing representation in the font used.</p>
<p>To convert a non-BMP character to a high surrogate pair one can use the <code>utf-16</code> encoding and <code>str.encode</code>.
Let's look at the character <code>á</code> (U+00E1): </p>
<div class="codehilite"><pre><span></span><code>>>> ord('á')
225
>>> hex(ord('á'))
'0xe1'
>>> 'à' #: string
'à'
>>> 'à'.encode('utf-8') #: bytes
b'\xc3\xa1'
>>> 'á'.encode('utf-16') #: bytes
b'\xff\xfe\xe1\x00'
>>> 'á'.encode('utf-16-le') #: bytes
b'\xe1\x00'
</code></pre></div>
<p>'LE' stands for little-endian, which is a way of reading bits derived
from a massive pointless dispute theLilliputians have in <em>Gulliver's travels</em> over which way up should an egg go,
herein the egg is a byte, and for more on the pettiness over the discussion of how to read a byte
see <a href="https://en.wikipedia.org/wiki/Endianness">this</a>.
x86 and arm machines are little-endians, but can do both.
And there you thought this Unicode discussion could not get any more pedantic.</p>
<p>Where we have to more character in the decoded string, in the utf-16 and utf-16-le encodings
the bytes encoding each character will still be left to right.
Okay, there's the right-to-left mark, but let's not get into that.</p>
<p>So now back to our non-BMP character:</p>
<div class="codehilite"><pre><span></span><code>>>> ord('😊')
128522
>>> hex(ord('😊'))
'0x1f60a'
>>> '😊'.encode('utf-16')
b'\xff\xfe\x0a\xd8\x3d\xde'
>>> '😊'.encode('utf-16-le')
b'\x0a\xd8\x3d\xde' # or b'=\xd8\n\xde'
</code></pre></div>
<p>So in the interest of sanity let's assume magic and skip ahead into convert the last one to a surrogate pair:</p>
<div class="codehilite"><pre><span></span><code>>>> le16_emoji: bytes = '😊'.encode('utf-16-le')
>>> len(le16_emoji)
4
>>> f"U+{int.from_bytes(le16_emoji[:2], 'little'):0>4x} U+{int.from_bytes(le16_emoji[2:], 'little'):0>4x}"
'U+d83d U+de0a'
</code></pre></div>
<p>Typing <code>⌥</code>+<code>d83dde0a</code> will give you 😊.</p>
<p>Colours are also stored as RGB hex values, so <code>#ff0000</code> is red, <code>#00ff00</code> is green, <code>#0000ff</code> is blue.
One byte per channel. Colour theory is a whole other fun numerical mad-hatter tea party
and is touched upon a past post,
<a href="https://blog.matteoferla.com/2022/01/ggplot-colours-in-python.html">ggplot colours in Python</a>.</p>
<p>So on a real machine one can pre-make the surrogate pairs and store them in a lookup table.
Doing it on the fly on the pico does not work (I am unsure why), but here is a simple table:</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># define what is wanted</span>
<span class="n">emoji_settings</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">0</span><span class="p">,</span> <span class="s1">'😀'</span><span class="p">,</span> <span class="s1">'green'</span><span class="p">),</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s1">'🤩'</span><span class="p">,</span> <span class="s1">'yellow'</span><span class="p">),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s1">'🤣'</span><span class="p">,</span> <span class="s1">'blue'</span><span class="p">),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s1">'😭'</span><span class="p">,</span> <span class="s1">'red'</span><span class="p">),</span>
<span class="c1"># new row</span>
<span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="s1">'👾'</span><span class="p">,</span> <span class="s1">'purple'</span><span class="p">),</span>
<span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="s1">'🪲'</span><span class="p">,</span> <span class="s1">'lime'</span><span class="p">),</span>
<span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="s1">'🤖'</span><span class="p">,</span> <span class="s1">'cerulean'</span><span class="p">),</span>
<span class="c1"># 7</span>
<span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s1">'🤦'</span><span class="p">,</span> <span class="s1">'coral'</span><span class="p">),</span>
<span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="s1">'🤷'</span><span class="p">,</span> <span class="s1">'sage'</span><span class="p">),</span>
<span class="p">]</span>
<span class="n">colors</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">green</span><span class="o">=</span><span class="s1">'#00FF00'</span><span class="p">,</span>
<span class="n">yellow</span><span class="o">=</span><span class="s1">'#FFFF00'</span><span class="p">,</span>
<span class="n">blue</span><span class="o">=</span><span class="s1">'#0000FF'</span><span class="p">,</span>
<span class="n">red</span><span class="o">=</span><span class="s1">'#FF0000'</span><span class="p">,</span>
<span class="n">purple</span><span class="o">=</span><span class="s1">'#A020F0'</span><span class="p">,</span>
<span class="n">lime</span><span class="o">=</span><span class="s1">'#32CD32'</span><span class="p">,</span>
<span class="n">cerulean</span><span class="o">=</span><span class="s1">'#2a52be'</span><span class="p">,</span>
<span class="n">coral</span><span class="o">=</span><span class="s1">'#FF7F50'</span><span class="p">,</span>
<span class="n">sage</span><span class="o">=</span><span class="s1">'#B2AC88'</span><span class="p">)</span>
<span class="n">rows</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">emoji2hexseq</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">emoji</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">emoji2hexseq</span><span class="p">(</span><span class="n">emoji</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="k">return</span> <span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="nb">int</span><span class="o">.</span><span class="n">from_bytes</span><span class="p">(</span><span class="n">emoji</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-16-le'</span><span class="p">)[:</span><span class="mi">2</span><span class="p">],</span> <span class="s1">'little'</span><span class="p">)</span><span class="si">:</span><span class="s2">0>4x</span><span class="si">}</span><span class="s2">"</span><span class="o">+</span>\
<span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="nb">int</span><span class="o">.</span><span class="n">from_bytes</span><span class="p">(</span><span class="n">emoji</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-16-le'</span><span class="p">)[</span><span class="mi">2</span><span class="p">:],</span> <span class="s1">'little'</span><span class="p">)</span><span class="si">:</span><span class="s2">0>4x</span><span class="si">}</span><span class="s2">"</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">emoji</span><span class="p">,</span> <span class="n">color_name</span> <span class="ow">in</span> <span class="n">emoji_settings</span><span class="p">:</span>
<span class="n">rows</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">i</span><span class="p">,</span>
<span class="n">emoji</span><span class="o">=</span><span class="n">emoji</span><span class="p">,</span>
<span class="n">color_name</span><span class="o">=</span><span class="n">color_name</span><span class="p">,</span>
<span class="n">color_hex</span><span class="o">=</span> <span class="n">colors</span><span class="p">[</span><span class="n">color_name</span><span class="p">],</span>
<span class="n">hexsequence</span> <span class="o">=</span> <span class="n">emoji2hexseq</span><span class="p">(</span><span class="n">emoji</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">)</span>
</code></pre></div>
<p>Once that is done, I load the following in my <code>code.py</code> script on the pico:</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># define settings</span>
<span class="n">emoji_settings</span> <span class="o">=</span> <span class="p">[{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'😀'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'green'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#00FF00'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83dde00'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'🤩'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'yellow'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#FFFF00'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83edd29'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'🤣'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'blue'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#0000FF'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83edd23'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'😭'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'red'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#FF0000'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83dde2d'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'👾'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'purple'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#A020F0'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83ddc7e'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'🪲'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'lime'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#32CD32'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83edeb2'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">6</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'🤖'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'cerulean'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#2a52be'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83edd16'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">8</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'🤦'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'coral'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#FF7F50'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83edd26'</span><span class="p">},</span>
<span class="p">{</span><span class="s1">'index'</span><span class="p">:</span> <span class="mi">9</span><span class="p">,</span>
<span class="s1">'emoji'</span><span class="p">:</span> <span class="s1">'🤷'</span><span class="p">,</span>
<span class="s1">'color_name'</span><span class="p">:</span> <span class="s1">'sage'</span><span class="p">,</span>
<span class="s1">'color_hex'</span><span class="p">:</span> <span class="s1">'#B2AC88'</span><span class="p">,</span>
<span class="s1">'hexsequence'</span><span class="p">:</span> <span class="s1">'d83edd37'</span><span class="p">}]</span>
<span class="kn">import</span> <span class="nn">usb_hid</span>
<span class="c1"># from circuitpython_typing import List</span>
<span class="kn">from</span> <span class="nn">adafruit_hid.keyboard</span> <span class="kn">import</span> <span class="n">Keyboard</span>
<span class="kn">from</span> <span class="nn">adafruit_hid.keycode</span> <span class="kn">import</span> <span class="n">Keycode</span>
<span class="kn">from</span> <span class="nn">pmk</span> <span class="kn">import</span> <span class="n">PMK</span><span class="p">,</span> <span class="n">Key</span><span class="p">,</span> <span class="n">hsv_to_rgb</span>
<span class="kn">from</span> <span class="nn">pmk.platform.rgbkeypadbase</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">RGBKeypadBase</span> <span class="k">as</span> <span class="n">Hardware</span><span class="p">,</span>
<span class="n">_ROTATED</span><span class="p">,</span>
<span class="p">)</span> <span class="c1"># for Pico RGB Keypad Base</span>
<span class="c1"># this is not needed, as I am circumventing this</span>
<span class="kn">from</span> <span class="nn">adafruit_hid.keyboard_layout_base</span> <span class="kn">import</span> <span class="n">KeyboardLayoutBase</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="c1"># import logging</span>
<span class="n">debug</span> <span class="o">=</span> <span class="nb">print</span>
<span class="c1"># debug = lambda *args, **kwargs: None</span>
<span class="n">pmk</span> <span class="o">=</span> <span class="n">PMK</span><span class="p">(</span><span class="n">Hardware</span><span class="p">())</span>
<span class="n">keys</span><span class="p">:</span> <span class="nb">list</span> <span class="o">=</span> <span class="n">pmk</span><span class="o">.</span><span class="n">keys</span> <span class="c1">#: List[Key]</span>
<span class="n">keyboard</span> <span class="o">=</span> <span class="n">Keyboard</span><span class="p">(</span><span class="n">usb_hid</span><span class="o">.</span><span class="n">devices</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">switch_keyboard</span><span class="p">():</span>
<span class="sd">"""Switch the keyboard layout</span>
<span class="sd"> This assumes only one real layout and one unicode layout.</span>
<span class="sd"> """</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">press</span><span class="p">(</span><span class="n">Keycode</span><span class="o">.</span><span class="n">CONTROL</span><span class="p">,</span> <span class="n">Keycode</span><span class="o">.</span><span class="n">OPTION</span><span class="p">,</span> <span class="n">Keycode</span><span class="o">.</span><span class="n">SPACEBAR</span><span class="p">)</span>
<span class="c1"># time.sleep(0.1)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release_all</span><span class="p">()</span>
<span class="n">debug</span><span class="p">(</span><span class="s2">"Switched keyboard"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">type_letter</span><span class="p">(</span><span class="n">letter</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="n">Keycode</span><span class="p">:</span>
<span class="sd">"""</span>
<span class="sd"> The numbers of the Keycode enum are words"""</span>
<span class="k">if</span> <span class="n">letter</span><span class="o">.</span><span class="n">isdigit</span><span class="p">():</span>
<span class="n">letter</span> <span class="o">=</span> <span class="p">[</span>
<span class="s2">"ZERO"</span><span class="p">,</span>
<span class="s2">"ONE"</span><span class="p">,</span>
<span class="s2">"TWO"</span><span class="p">,</span>
<span class="s2">"THREE"</span><span class="p">,</span>
<span class="s2">"FOUR"</span><span class="p">,</span>
<span class="s2">"FIVE"</span><span class="p">,</span>
<span class="s2">"SIX"</span><span class="p">,</span>
<span class="s2">"SEVEN"</span><span class="p">,</span>
<span class="s2">"EIGHT"</span><span class="p">,</span>
<span class="s2">"NINE"</span><span class="p">,</span>
<span class="p">][</span><span class="nb">int</span><span class="p">(</span><span class="n">letter</span><span class="p">)]</span>
<span class="n">code</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">Keycode</span><span class="p">,</span> <span class="n">letter</span><span class="o">.</span><span class="n">upper</span><span class="p">())</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">press</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
<span class="k">return</span> <span class="n">code</span>
<span class="k">def</span> <span class="nf">type_unicode</span><span class="p">(</span><span class="n">char</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> The high surrogate conversion does not work in the pico</span>
<span class="sd"> """</span>
<span class="k">if</span> <span class="nb">ord</span><span class="p">(</span><span class="n">char</span><span class="p">)</span> <span class="o"><</span> <span class="mh">0xFFFF</span><span class="p">:</span>
<span class="n">hexed</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="nb">ord</span><span class="p">(</span><span class="n">char</span><span class="p">)</span><span class="si">:</span><span class="s2">0>4x</span><span class="si">}</span><span class="s2">"</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">debug</span><span class="p">(</span><span class="s2">"Non-multilingual basic plate character... High surrogate"</span><span class="p">)</span>
<span class="n">hexed</span> <span class="o">=</span> <span class="p">(</span>
<span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="nb">int</span><span class="o">.</span><span class="n">from_bytes</span><span class="p">(</span><span class="n">char</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-16-le'</span><span class="p">)[:</span><span class="mi">2</span><span class="p">],</span> <span class="s1">'little'</span><span class="p">)</span><span class="si">:</span><span class="s2">0>4x</span><span class="si">}</span><span class="s2">"</span>
<span class="o">+</span> <span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="nb">int</span><span class="o">.</span><span class="n">from_bytes</span><span class="p">(</span><span class="n">char</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-16-le'</span><span class="p">)[</span><span class="mi">2</span><span class="p">:],</span> <span class="s1">'little'</span><span class="p">)</span><span class="si">:</span><span class="s2">0>4x</span><span class="si">}</span><span class="s2">"</span>
<span class="p">)</span>
<span class="n">debug</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">char</span><span class="si">}</span><span class="s2"> -> 0x</span><span class="si">{</span><span class="n">hexed</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">type_unicode_sequence</span><span class="p">(</span><span class="n">hexed</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
<span class="n">switch_keyboard</span><span class="p">()</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">press</span><span class="p">(</span><span class="n">Keycode</span><span class="o">.</span><span class="n">ALT</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.05</span><span class="p">)</span>
<span class="k">for</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">hexed</span><span class="p">:</span>
<span class="n">type_letter</span><span class="p">(</span><span class="n">letter</span><span class="p">)</span>
<span class="c1"># time.sleep(0.05)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release_all</span><span class="p">()</span>
<span class="n">switch_keyboard</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">test</span><span class="p">():</span>
<span class="n">debug</span><span class="p">(</span><span class="s2">"TEST"</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">press</span><span class="p">(</span><span class="n">Keycode</span><span class="o">.</span><span class="n">SHIFT</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">press</span><span class="p">(</span><span class="n">Keycode</span><span class="o">.</span><span class="n">ONE</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release</span><span class="p">(</span><span class="n">Keycode</span><span class="o">.</span><span class="n">ONE</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release</span><span class="p">(</span><span class="n">Keycode</span><span class="o">.</span><span class="n">SHIFT</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release_all</span><span class="p">()</span>
<span class="c1"># ---------------------------------------------------------------------------</span>
<span class="c1"># ## Set colours</span>
<span class="n">k2c</span> <span class="o">=</span> <span class="p">{</span><span class="n">row</span><span class="p">[</span><span class="s1">'index'</span><span class="p">]:</span> <span class="n">row</span><span class="p">[</span><span class="s1">'color_hex'</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">emoji_settings</span><span class="p">}</span>
<span class="k">def</span> <span class="nf">set_color</span><span class="p">(</span><span class="n">key</span><span class="p">:</span> <span class="n">Key</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Set the colours of the keys as Tuple[int, int, int],</span>
<span class="sd"> whereas for sanity they are rgb hexes.</span>
<span class="sd"> Uses ``k2c``, which is a dict of key index to color hex derived from ``emoji_settings`` </span>
<span class="sd"> """</span>
<span class="n">color</span> <span class="o">=</span> <span class="n">k2c</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">_ROTATED</span><span class="p">[</span><span class="n">key</span><span class="o">.</span><span class="n">number</span><span class="p">],</span> <span class="kc">None</span><span class="p">)</span>
<span class="n">rgb</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="n">color</span><span class="p">[</span><span class="mi">1</span> <span class="o">+</span> <span class="n">i</span> <span class="p">:</span> <span class="mi">3</span> <span class="o">+</span> <span class="n">i</span><span class="p">],</span> <span class="mi">16</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">2</span><span class="p">)]</span> <span class="k">if</span> <span class="n">color</span> <span class="k">else</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">key</span><span class="o">.</span><span class="n">set_led</span><span class="p">(</span><span class="o">*</span><span class="n">rgb</span><span class="p">)</span>
<span class="c1"># ---------------------------------------------------------------------------</span>
<span class="c1"># ## Set actions</span>
<span class="n">k2e</span> <span class="o">=</span> <span class="p">{</span><span class="n">row</span><span class="p">[</span><span class="s1">'index'</span><span class="p">]:</span> <span class="n">row</span><span class="p">[</span><span class="s1">'hexsequence'</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">emoji_settings</span><span class="p">}</span>
<span class="n">key</span><span class="p">:</span> <span class="n">Key</span>
<span class="k">for</span> <span class="n">key</span> <span class="ow">in</span> <span class="n">keys</span><span class="p">:</span>
<span class="n">set_color</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="nd">@pmk</span><span class="o">.</span><span class="n">on_press</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">press_handler</span><span class="p">(</span><span class="n">key</span><span class="p">:</span> <span class="n">Key</span><span class="p">):</span>
<span class="n">debug</span><span class="p">(</span><span class="sa">f</span><span class="s2">"pressed #</span><span class="si">{</span><span class="n">key</span><span class="o">.</span><span class="n">number</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">key</span><span class="o">.</span><span class="n">set_led</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">hexed</span> <span class="o">=</span> <span class="n">k2e</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">_ROTATED</span><span class="p">[</span><span class="n">key</span><span class="o">.</span><span class="n">number</span><span class="p">],</span> <span class="kc">None</span><span class="p">)</span>
<span class="k">if</span> <span class="n">hexed</span><span class="p">:</span>
<span class="n">type_unicode_sequence</span><span class="p">(</span><span class="n">hexed</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">debug</span><span class="p">(</span><span class="s2">"Unassigned"</span><span class="p">)</span>
<span class="nd">@pmk</span><span class="o">.</span><span class="n">on_release</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">release_handler</span><span class="p">(</span><span class="n">key</span><span class="p">:</span> <span class="n">Key</span><span class="p">):</span>
<span class="n">debug</span><span class="p">(</span><span class="sa">f</span><span class="s2">"released #</span><span class="si">{</span><span class="n">key</span><span class="o">.</span><span class="n">number</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">set_color</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="n">keyboard</span><span class="o">.</span><span class="n">release_all</span><span class="p">()</span>
<span class="n">debug</span><span class="p">(</span><span class="s2">"Loaded successfully"</span><span class="p">)</span>
<span class="c1"># ---------------------------------------------------------------------------</span>
<span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
<span class="n">pmk</span><span class="o">.</span><span class="n">update</span><span class="p">()</span>
</code></pre></div>
<p>Three things are painful in that snippet.</p>
<ol>
<li>
<p>There is no typehinting via the <code>typing</code> module.
CircuitPython does not have the <code>typing</code> module in its standard library
—there is a library for it but it does not work as expected.</p>
</li>
<li>
<p>There is no <code>logging</code> module,
hence the debug function, which can be <code>print</code> or <code>lambda *args, **kwargs: None</code>.</p>
</li>
<li>
<p>My use of British spelling for comments and docstrings,
while American spelling in code. There is no PEP order from Guido banning British spelling,
but due to dependencies etc. I find it easier to use American spelling in code.</p>
</li>
</ol>
<p>About the print business, when plugged in the stdout is sent via the serial connection (USB).
On a Unix machine this will be <code>/dev/tty.usbmodem*</code> or <code>/dev/ttyACM*</code> or <code>/dev/ttyUSB*</code>.
The <code>tty</code> stands for teletype, which is a device that can be used to send and receive text,
also called a controlling terminal, which is a different thing than a terminal emulator.
The mu editor can be used to view the output (serial button),
But the <code>/dev/tty.usbmodem14101</code> can be used in a Jupyter notebook via:</p>
<div class="codehilite"><pre><span></span><code><span class="sx">!screen /dev/tty.usbmodem14101</span><span class="w"></span>
</code></pre></div>
<p>I will admit that it is a bit crude, but it works...
A better solution would be using cell magic, <code>ipython_widgets</code> and <code>threading</code>.
Another time: I need to figure out what emoji to add to my keypad...</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-23238723986388049122022-11-20T01:18:00.004-08:002023-01-15T22:55:45.798-08:00glibc 2.36 vs. CentOS 7: a tale of failure<p>My favourite part of coding is planning and implementing some cool idea for doing something,
especially if it involves some fun maths I read up on Wikipedia a minute beforehand.
In reality polishing dirty data,
refactoring someone-else's bad code,
reverse engineering the use of a module and trying to get stuff to work
is what take up most of my time.</p>
<p>Having got cocky I thought I could get the latest GNU library for C (<code>glibc</code>) working
on CentOS 7. I failed miserably, here is my sorry tale down the rabbit hole.</p>
<a name='more'></a>
<p>In the cluster I am working in the OS is still CentOS 7.
A dead distro that came out in 2014 and is no longer supported bar for security support,
which even that has it's end of life (EOL) in 2024.
It is still in use as Red Hat abandoned the project and CentOS 8 was never properly completely.
CentOS Stream 9 is an attempt at moving it along,
but was never recommended by Red Hat nor gained universal usage.
Rocky Linux is the unofficial successor.
Most clusters are moving towards Ubuntu as a result.
For example, HTCondor was CentOS 7 only: CentOS Stream 9 is not an option supported by HTCondor, but Ubuntu and Rocky Linux are. This is a rather common situation unfortunately.</p>
<p>This is a problem for an increasing number of Python packages with C-bindings as the system glibc (GNU library for C) is version 2.17, which cannot be updated or circumvented to the best of my knowledge as discussed here.</p>
<p>Example of such packages in compbiochem are pytorch, pyrosetta, rdkit and pymol.</p>
<p>When a package is installed from a wheel or conda and glibc version is not satisfied one gets <code>/lib64/libm.so.6: version GLIBC_2.27' not found (required by package_name)</code>.
Installing from source may or may not work. For example,
pyrosetta will complain about missing functions and if you force it
with different tricks the compiled result does not work, in my experience at least.</p>
<h2>Classic work-arounds</h2>
<p>There are two ways to normally circumvent an old glibc with conda.
The first is setting the <code>CONDA_OVERRIDE_GLIBC</code> variable before environment creation in conda or mamba:</p>
<div class="codehilite"><pre><span></span><code><span class="nv">CONDA_OVERRIDE_GLIBC</span><span class="o">=</span><span class="m">2</span>.36 conda create -n my_new_py38_env <span class="nv">python</span><span class="o">=</span><span class="m">3</span>.8
</code></pre></div>
<p>The other is using the tool <code>patchelf</code>, which can replace the libraries used by a given package as <a href="https://gist.github.com/michaelchughes/85287f1c6f6440c060c3d86b4e7d764b">documented here</a>:</p>
<div class="codehilite">
<pre><span></span><code>patchelf --add-rpath /path/newer_glibc broken_package
</code></pre></div>
<p>This requires a compiled 2.36 glibc library. However, in the package distributions, there is not a glibc version greater than 2.17 availble for CentOS 7. This explains why the former method does not work:</p>
<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">platform</span><span class="o">,</span> <span class="nn">os</span>
<span class="k">assert</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'CONDA_DEFAULT_ENV'</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'my_new_py38_env'</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'glibc_version = </span><span class="si">{</span><span class="n">platform</span><span class="o">.</span><span class="n">libc_ver</span><span class="p">()[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span> <span class="c1"># 2.17</span>
</code></pre></div>
<p>There is in conda a package called <code>glibc</code>, but this 9 years old and 2.19, so utterly pointless if it even were to work.</p>
<h2>Compiling glibc 2.36</h2>
<p>To compiled glibc 2.36 modern kernel-headers are requires as CentOS 7 is runs the linux kernel 3, not 6.
In <a href="https://bitsanddragons.wordpress.com/2020/08/26/glibc_2-25-compile-on-centos-7-8/">a blog post</a> there is a snippet, which makes it sound straightforward, but I failed to compile it myself. Here is what I tried:</p>
<ul>
<li>providing different kernel-headers</li>
<li>with the flag to not raise warnings as errors</li>
<li>using modern C compilers thanks to conda (clang or gcc)</li>
</ul>
<p>There is a conda module called <code>kernel-headers_linux-64</code>, which does not seem to take effect,
but there are modules for <code>clang</code> (C-language), <code>clangxx</code> (C++ language), <code>ninja</code>, <code>gcc</code> (GNU C compiler), <code>libgc</code> which are handy (because I do not have root access and they system ones are ancient).</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># throwing everything at it, including the kitchen sink:</span>
mamba install -y -c anaconda -c conda-forge cmake make kernel-headers_linux-64 clang clangxx ninja gcc libgcc ld_impl_linux-64
mkdir <span class="nv">$CONDA_PREFIX_1</span>/custom_lib
wget https://ftp.gnu.org/gnu/glibc/glibc-2.25.tar.gz
tar -xvzf glibc-2.25.tar.gz
<span class="nb">cd</span> glibc-2.25/
mkdir build
<span class="nb">cd</span> build/
../configure --prefix<span class="o">=</span><span class="nv">$CONDA_PREFIX_1</span>/custom_lib/glibc-2.25/
</code></pre></div>
<p>In the above <code>$CONDA_PREFIX_1</code> is the path to base conda, while <code>$CONDA_PREFIX</code> is the venv.
The c-compiler can be specified with <code>$BUILD_CC</code> or <code>$CC</code>:</p>
<div class="codehilite"><pre><span></span><code><span class="nv">CC</span><span class="o">=</span><span class="sb">`</span>which gcc<span class="sb">`</span> ../configure --prefix<span class="o">=</span><span class="nv">$CONDA_PREFIX_1</span>/custom_lib/glibc-2.25/
</code></pre></div>
<p>The above says the compiler is too old with <code>clang</code> (10.0), but with <code>gcc</code> (12.2) it gives:</p>
<div class="codehilite"><pre><span></span><code><span class="n">configure</span><span class="o">:</span><span class="w"> </span><span class="n">error</span><span class="o">:</span><span class="w"> </span><span class="n">GNU</span><span class="w"> </span><span class="n">libc</span><span class="w"> </span><span class="n">requires</span><span class="w"> </span><span class="n">kernel</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="n">from</span><span class="w"></span>
<span class="n">Linux</span><span class="w"> </span><span class="mf">3.2</span><span class="o">.</span><span class="mi">0</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="n">later</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">be</span><span class="w"> </span><span class="n">installed</span><span class="w"> </span><span class="n">before</span><span class="w"> </span><span class="n">configuring</span><span class="o">.</span><span class="w"></span>
<span class="n">The</span><span class="w"> </span><span class="n">kernel</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="n">are</span><span class="w"> </span><span class="n">found</span><span class="w"> </span><span class="n">usually</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="sr">/usr/include/</span><span class="n">asm</span><span class="w"> </span><span class="n">and</span><span class="w"></span>
<span class="sr">/usr/include/</span><span class="n">linux</span><span class="o">;</span><span class="w"> </span><span class="n">make</span><span class="w"> </span><span class="n">sure</span><span class="w"> </span><span class="n">these</span><span class="w"> </span><span class="n">directories</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="n">from</span><span class="w"></span>
<span class="n">Linux</span><span class="w"> </span><span class="mf">3.2</span><span class="o">.</span><span class="mi">0</span><span class="w"> </span><span class="n">or</span><span class="w"> </span><span class="n">later</span><span class="o">.</span><span class="w"> </span><span class="n">This</span><span class="w"> </span><span class="n">check</span><span class="w"> </span><span class="n">uses</span><span class="w"> </span><span class="o"><</span><span class="n">linux</span><span class="o">/</span><span class="n">version</span><span class="o">.</span><span class="na">h</span><span class="o">>,</span><span class="w"> </span><span class="n">so</span><span class="w"></span>
<span class="n">make</span><span class="w"> </span><span class="n">sure</span><span class="w"> </span><span class="n">that</span><span class="w"> </span><span class="n">file</span><span class="w"> </span><span class="n">was</span><span class="w"> </span><span class="n">built</span><span class="w"> </span><span class="n">correctly</span><span class="w"> </span><span class="n">when</span><span class="w"> </span><span class="n">installing</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">kernel</span><span class="w"> </span><span class="n">header</span><span class="w"></span>
<span class="n">files</span><span class="o">.</span><span class="w"> </span><span class="n">To</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="n">kernel</span><span class="w"> </span><span class="n">headers</span><span class="w"> </span><span class="n">not</span><span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="sr">/usr/include/</span><span class="n">linux</span><span class="o">,</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="n">the</span><span class="w"></span>
<span class="n">configure</span><span class="w"> </span><span class="n">option</span><span class="w"> </span><span class="o">--</span><span class="k">with</span><span class="o">-</span><span class="n">headers</span><span class="o">.</span><span class="w"></span>
</code></pre></div>
<h2>Kernel headers</h2>
<p>The <code>kernel-headers_linux-64</code> conda package seems relevant, but it does not seem to add <code>linux</code> or <code>asm</code> to <code>$CONDA_PREFIX/include</code>, so I am not sure what it does.</p>
<p>Downloading the highest version 3 kernel-headers of CentOS 7 x86_64 and providing those will fail:</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># https://centos.pkgs.org/7/centos-x86_64/kernel-headers-3.10.0-1160.el7.x86_64.rpm.html</span>
<span class="nb">cd</span>
wget http://mirror.centos.org/centos/7/os/x86_64/Packages/kernel-headers-3.10.0-1160.el7.x86_64.rpm
rpm2cpio kernel-headers-3.10.0-1160.el7.x86_64.rpm <span class="p">|</span> cpio -idmv
<span class="nb">cd</span> ~/glibc-2.25/build/
<span class="nv">CC</span><span class="o">=</span><span class="sb">`</span>which gcc<span class="sb">`</span> <span class="nv">LIBS</span><span class="o">=</span><span class="nv">$HOME</span>/usr/include ../configure --prefix<span class="o">=</span><span class="nv">$CONDA_PREFIX_1</span>/custom_lib/glibc-2.25/ --with-headers<span class="o">=</span><span class="nv">$HOME</span>/usr/include --disable-werror
</code></pre></div>
<p>As you can see the steps tried were a few and way more than most would try before giving up and using an older version of the glibc-grumpy packages or using a Docker or Singularity image if possible.</p>
<p>Ironically in my case I tried the Docker universe in HTCondor, but I am one version behind and that is a rabbit hole tale for another time!</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com1tag:blogger.com,1999:blog-9015174234871442237.post-88671804511104468632022-11-04T00:57:00.004-07:002023-02-24T23:20:32.309-08:00In ML a module is not a namespace but a base class, because... ?<p>Deep learning is changing the world and fast. The list of achievements is impressive,
however, why focus on the positive, when we can moan about the negative?
In this blog post I will discuss three minor details that I find annoying about deep learning,
namely the key word Module, the limited use of Google/Coral Edge TPUs and the coding quality of the field.<span></span></p><a name='more'></a><p></p>
<h2>Module</h2>
<p>In programming a 'module' is a contained collection of objects, functions and variables.
In deep learning a module is an abstract base class that is used to build neural networks classes.
This is confusing because the word module is used in both contexts.
The latter is most commonly found in Title case, 'Module', as it's a class.</p>
<p>How on earth something like this was allowed to happen is nonsensical. I could not find any mathematical precedence, so the only explanation I can give it was chosen in haste (<i>vide infra </i>the cowboy coding point).</p>
<p>As deep learning is cooler and shiny than programming theory,
advocating that the community use the word 'base class' would not work,
so what synonyms can be used for the standard module?</p>
<ul>
<li><code>namespace</code>: there is a technical difference (scope) between a namespace and a module,
just like there's a difference between a list and an array, but it's close enough.
In Python there is no namespace type, but there is a module type,
in C++ you have only the former and in TypeScript you have both.</li>
<li>Not <code>package</code>: a package is a directory of modules, a filesystem thing. A module or submodule is generally a file,
but you can declare a module dynamically thanks to <code>types.ModuleType</code>.</li>
<li>Not <code>library</code> or <code>collection</code>: a library is a collection of functionality/modules.
A module ought to encapsulate a single functionality per the principle of information hiding.</li>
<li>Not <code>container</code>: containerization is an OS-level virtualization technique.</li>
<li>Not <code>bucket</code>: an Amazon S3 bucket is a storage container accessed via its web service interface.</li>
<li>Not <code>bundle</code>: that is a videogame sale term for something old that would not sell otherwise.</li>
<li>Not <code>shelf</code>, <code>shelving</code>, <code>rack</code>: these are all physical cluster objects.</li>
<li>Not <code>bag</code>: a not quite synoym for <code>Counter</code></li>
<li>Not any other physical object like <code>crate</code>, <code>chest</code>, <code>trunk</code>, <code>bag</code>, <code>sack</code>, <code>pouch</code>, <code>basket</code>, <code>box</code>, <code>can</code>, <code>jar</code>,
<code>pot</code>, <code>vessel</code>, <code>vase</code>, <code>case</code>, <code>cabinet</code>, <code>cupboard</code> as most likely there is already taken...</li>
</ul>
<h2>Google Coral Edge TPUs</h2>
<p>An Nvidia A100 CPU costs as much as a car. A Google Coral Edge TPU costs as much as a beer in a Youngs pub (~£50).
A further cool feature is that there's a USB stick version of the latter.
However, it can only be used with TensorFlow Lite, which is a subset of TensorFlow.
Models cannot be trained in TensorFlow Lite, only converted from TensorFlow.
A model trained with TensorFlow can be used in TensorFlow Lite, if the numbers are reduced from 32-bit floating point numbers to
fixed point 8-bit integers (post-training integer quantization).
This is not great as precision near zero is important.
There was criticism earlier this year about several transformer models
tinkering with the kernel's subnormals (numbers between 0 and the smallest normal number) which is a terrible thing to do,
but illustrates why precision near zero is important and a reducing this precision is not great.
The Posit format of number storage is gaining momentum: in float-point arithmetic a real number is stored
as a power, e.g. <code>-2.1e10</code>, where you have a sign bit, some exponent bits and some significand/mantissa bits (<code>sign * mantissa * 2^exponent</code>),
in a posit you have an extra terms (<code>sign * mantissa * 2^exponent</code> times <code>unseed^regime-ish</code>),
where the extra term is based on the exponential part, making it an exponential of an exponential,
making it more precise near zero (tapered precision) and allowing it to be system-friendly (i.e. no new hardware).
So potentially with better number storage, the fixed point 8-bit integer quantisation could be replaced with a Posit format.
But that is in the future, but I am sure it will come fast and we will stop drooling over GPU cards with
5 digit price tags in the same way folk stopped drooling about clock speeds in the 90s.</p>
<h2>Cowboy coding</h2>
<p>The field of deep learning is young and the code is not great.
The speed in which the field advances favours 'cowboy coding': write fast, document and fix latter, never refactor.
This is not great as the code is harder to read and to evolve.
Jax is expanded by Haiku to replace Keras —Haiku is inspired by Sonnet, neither related to Poetry. TensorFlow is losing ground to PyTourch.
The different packages gain and lose popularity quickly.
A prime example is installing them as they were dumped on PyPI without any thought for dependencies.
There is a big difference between CUDA and CPU based arithmetic, but a drop-in replacement is a terrible choice as instead happens for tensorflow-gpu.
Many of these packages change their API frequently, which is not great for the maintainability of
the packages that depend on them. As a result the solution adopted by the community is to pin the versions,
which results in installation woes, which are resolved by conda environments, dockerisation and containerisation
(these are meant for other purposes).
This is not Pythonic in any way, but this laziness is becoming more and more common that some argue that this will kill Python.</p><p>Parenthetical disclaimer, installing tensorflow is not the simplest task partially not due to its own fault, but due to the fact that it requires CUDA drivers and the CUDA toolkit in the system, although the latter can be installed with conda, but one must make sure that the version matches the drivers (nvcc --version).</p>
<p>I switched from Perl a decade ago, because the package Moose (object-oriented programming) was too messy.
Python is changing however: typehinting is a fantastic addition and 3.10 is said to be faster.
Typehinting allows one to get a quick idea of what a variable is. It makes debugging much quicker
and makes revising old code a lot less painful.
The speed is something I don't quite get, as folk get pendantic over the speed of a few microseconds,
yet code that runs slowly is generally so because of poor design, not because of the language.</p>
<p>Pip is changing too:
The release of a package to PyPI is often bungled due to user-inexperience but also due to its non-straightforwardness.
Packaging non-python code is a common tripping point and the source of many bugs.
Alternative installation routes rise and fall (remember easy_install?) —the flavour of the month is poetry.
Wheels for compiled code and pip-based virtual environments are a great addition
that are making conda a thing of the past.</p>
<p>However, these do not address the root of the problem: cowboy coding.
Hopefully, the deep learning field will mature and move away from cowboy coding.
But there is excitement of AI writing code, so I am not holding my breath...</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-24750401161971548672022-10-08T01:05:00.005-07:002022-10-08T01:11:09.809-07:00Star imports trick<p>Star-imports (<code>from typing import *</code>) in Python are a handy, but dangerous. They are meant for quick coding, <i>i.e.</i> like on a jupyterlab notebook. However they are bad as they can mask other variables and cause issues down the line. They are ubiquitous online as are guides explaining why they are bad, here I just want to share a handy snippet to iron out star-imports.</p><a name='more'></a>
<p>Here is an example:</p>
<pre><code>from collections import Counter
from typing import *
counter = Counter(...)</code></pre>
<p>In typing there is a variable <code>Counter</code>, which by virtue of being imported into the namespace second annuls the first import.
The module <code>typing</code> can be dangerous for the above, but I'd say I have a good idea what is there. PyRosetta is a very large and the exmaple code is full of naughty star imports, which really out to not be mimicked to avoid namespace pollution, but does not really cause odd clashes.</p>
<p>Note: herein obviously the word module means a namespace as per standard Python, not a base class for a ANN model, which is inexplicably named so.</p>
There are four solution to the problem:
<ul>
<li>gain omniscience of the star-imported module: bad choice</li>
<li>use star imports within the scope of a function: repetitive</li>
<li>import the module as itself or a shorthand: makes the code look clunky, but a must for big modules like
<code>pd</code>,
<code>np</code>
<code>tf</code> and so forth.</li>
<li>use star import and fix it later: quicker</li>
</ul>
<p>For the latter, one can list all the variables in a module:</p>
<pre><code>import typing
print(f'({", ".join(typing.__all__)})')</code></pre>
<p><span style="color: #666666; font-size: x-small;">(Any, Callable, ClassVar, ForwardRef, Generic, Optional, Tuple, Type, TypeVar, Union, AbstractSet, ByteString, Container, ContextManager, Hashable, ItemsView, Iterable, Iterator, KeysView, Mapping, MappingView, MutableMapping, MutableSequence, MutableSet, Sequence, Sized, ValuesView, Awaitable, AsyncIterator, AsyncIterable, Coroutine, Collection, AsyncGenerator, AsyncContextManager, Reversible, SupportsAbs, SupportsBytes, SupportsComplex, SupportsFloat, SupportsInt, SupportsRound, ChainMap, Counter, Deque, Dict, DefaultDict, List, OrderedDict, Set, FrozenSet, NamedTuple, Generator, AnyStr, cast, get_type_hints, NewType, no_type_check, no_type_check_decorator, NoReturn, overload, Text, TYPE_CHECKING)</span><br />I have this copy-pasted in my notes: I replace the star with this and optimise imports in PyCharm and done!*</p><p><br /></p><p>*) Well, mostly, as a lot of these are missing in 3.7 or fun things like <code>Unpack</code> appear in 3.11, so one ends up doing a sneaky like a monkeypatch early on, so subsequent imports of typing will have it as the variable name of a module points to the same object as stored in sys.modules dictionary.</p>
<pre><code>import sys
import typing
import typing_extensions
if sys.version_info < (3, 8):
typing.TypedDict = typing_extensions.TypedDict
</code></pre>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-65175528012030900852022-10-01T14:23:00.008-07:002022-11-06T02:51:50.963-08:00Move aside coIP Westerns, ColabFold has got this!<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMyjaPp-nwzzMeYglKlRgRCl7QMHUweA7cffdfVgcmrycjJwvFIhcFFiLqAPJdr82P4xgkaQ-glDhaAeS1ESt9t7g7r9ahNnDKZqTtIqH3-8D9Y6xixzXbs5TTKBroLVs6ctazINdbjWHRsOiABsXbSxf0bHvSzvKBzz41l_ZqOJ_ty4n5iASgs5N8/s3592/upgrade-01.png" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em; text-align: justify;">
<img border="0" data-original-height="1769" data-original-width="3592" height="158" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMyjaPp-nwzzMeYglKlRgRCl7QMHUweA7cffdfVgcmrycjJwvFIhcFFiLqAPJdr82P4xgkaQ-glDhaAeS1ESt9t7g7r9ahNnDKZqTtIqH3-8D9Y6xixzXbs5TTKBroLVs6ctazINdbjWHRsOiABsXbSxf0bHvSzvKBzz41l_ZqOJ_ty4n5iASgs5N8/s320/upgrade-01.png" width="320" />
</a>
</div>
<p style="text-align: justify;">Recently AlphaFold2 released a new batch of models, this time covering all of the Trembl sequences in Uniprot, resulting in a huge number, which got hashtag-academic-twitter and some news editors very excited for the stamp-collecting feat. Personally, I find it annoying, not because it's pointless, but as of writing this, it has made any search for a target by name swamped by irrelevant sequences. <br />However, AlphaFold is great for other feats. <br /> I have blogged about it a few times (e.g. <a href="https://blog.matteoferla.com/search/label/alphafold2" target="_blank">link</a>), which gives away my positive view of it! It can predict oligomers, with a lot more precision and confidence than docking. It does not always work either technically or meet the hypothesis. I did a long series of experiments with a hypothesis in mind which wasn't valid in the end (<a href="https://github.com/matteoferla/autophagic-cell-death-complex-models" target="_blank">here</a>), but revealed novel science and took a few minutes to set up and a few hours to run, which would have taken years if done by Western blot of a co-immunoprecipitation or cross-linking mass-spec.</p>
<a name='more'></a>
<p style="text-align: justify;">
<i>First a disclaimer: </i>I have never run a Western and the last time I held a pipette I had to be reminded how to hold it (release button rests towards one's thenar webspace), so I am not sure years is accurate.
</p>
<h2 style="text-align: justify;">ColabFold</h2>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;">
<tbody>
<tr>
<td style="text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjP8YXxEXfHs2g-wEpO7vhTMktcdBYazUfGAbgKQJ1Wo5Z58W4y7y2kHVpPLNmd7Rk5zTxOXzqcXtTqhOXP0m5BIHaTngImQuRC_Aan47oyvBykM3S-qK9WeG1XhPrzFEBzVQ008gdlXupiI4nEnPvFZVM5VMr2tfAWTgwSAvBLZQa-_PIM3hC02DZ3/s408/colab.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;">
<img border="0" data-original-height="175" data-original-width="408" height="137" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjP8YXxEXfHs2g-wEpO7vhTMktcdBYazUfGAbgKQJ1Wo5Z58W4y7y2kHVpPLNmd7Rk5zTxOXzqcXtTqhOXP0m5BIHaTngImQuRC_Aan47oyvBykM3S-qK9WeG1XhPrzFEBzVQ008gdlXupiI4nEnPvFZVM5VMr2tfAWTgwSAvBLZQa-_PIM3hC02DZ3/s320/colab.jpg" width="320" />
</a>
</td>
</tr>
<tr>
<td class="tr-caption" style="text-align: center;">
<span style="color: #666666;">A corgi wandering on the top of a colab notebook <br />(Atlas for comparison) </span>
</td>
</tr>
</tbody>
</table>
<p style="text-align: justify;">
<a href="https://github.com/sokrypton/ColabFold" target="_blank">ColabFold</a> is a descendant of sorts of AlphaFold's colaboratory notebook, which has a faster MSA step, easier sequence input syntax and more accessible devs. It is meant to be run on a Google Colaboratory notebook, but it totally can be run in a Jupyter notebook or in bash.
</p>
<h4 style="text-align: justify;">Showcasing ≠ everyday use</h4>
<p style="text-align: justify;">Google Colab is a great way to <i>showcase</i> tools, but not necessary use. For Fragmenstein I have made a Colab notebook as a demo for the manuscript and shared it on Twitter. This is a very powerful tool as it's a form of "try before you buy" and is well received (if a picture thumbnail is provided for the initial attention grab). I do not use Colab for ColabFold, but a remote Jupyter-lab notebook, but the tool is the same (with some minor changes), which I discuss in a separate blog post TBC. The benefit is power and tidiness, but the disadvantage that Jupyter notebook does not allow wandering corgwn on the title bar, but has the advantage one can use much more compute power. </p>
<h2 style="text-align: justify;">Spaghetti</h2>
<div class="separator" style="clear: both;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7_f-tvl0nMfv6JBBUrGslih-4wnDpfaM9pOQdaBjd9F0nLJOHvdRHfUDVtOZI0Wu5KY69s_WpCTcLi0Mhq7C-mBDOEDb8ThEYpWP33HGQYqcrE2jT_FXx8YIL-Ybl5U3zehi0QX6V9plwvFfbjpO-DHntqKdQTpEdcuDO5_FYKpFmdtTNNGX64MCx/s2559/histone_spaghetti.png" style="clear: right; display: block; float: right; padding: 1em 0px; text-align: justify;">
<img alt="" border="0" data-original-height="2004" data-original-width="2559" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7_f-tvl0nMfv6JBBUrGslih-4wnDpfaM9pOQdaBjd9F0nLJOHvdRHfUDVtOZI0Wu5KY69s_WpCTcLi0Mhq7C-mBDOEDb8ThEYpWP33HGQYqcrE2jT_FXx8YIL-Ybl5U3zehi0QX6V9plwvFfbjpO-DHntqKdQTpEdcuDO5_FYKpFmdtTNNGX64MCx/s320/histone_spaghetti.png" width="320" />
</a>
</div>
<p style="text-align: justify;">With a model from EBI-AlphaFold2 there are a few things to look out for as <a href="https://blog.matteoferla.com/2021/07/what-to-look-out-for-with-alphafold2.html" target="_blank">discussed here</a>. Briefly, the off-the-shelf model is a monomer, which may not be biologically real, and it may have long low confidence loops dubbed "spaghetti", which may bind "something" (see below). This is exceptionally common in transcription factors, which recruit the mediator complex and can bind it in one of a dozen sections given it is such a big protein. <br /> A key concept to discuss is the confidence metric. The model quality is measured in pLDDT, which is available per structure, per residue and between residues. The latter goes into making the green plots in the AF-DB, while the middle replaces the b-factor column and below 70% has low confidence. Parenthetically, the quality metric pLDDT is written with a lowercase L for local in its original paper, but l is ambiguous with I, so most places seem to rightfully not adopt the original use. It's rather like in Klingon (thıngan Hol), where all i are uppercase, which confuses people hence why they angrily glare when seeing a message in Klingon on social media —those petaQpu'. </p><h3 style="text-align: justify;">Colour hexes</h3><p style="text-align: justify;">Parenthetically, one may want to copy the colours used by EBI-Alphafold2, which are not CSS primary colours as some snippets online claim. These are: <br />
</p>
<ul>
<li>The blue colour is <span style="color: #0053d6;">#0053D6</span> (pLDDT ≥ 90) </li>
<li>The cyan/teal colour is <span style="color: #65cbf3;">#65CBF3</span> (90>pLDDT≥70) </li>
<li>The yellow colour is <span style="color: #ffdb13;">#FFDB13</span> (70>pLDDT≥50) </li>
<li>The orange colour is <span style="color: #ff7d45;">#FF7D45</span>, pLDDT<50) </li>
</ul>
<p>(It is not a gradient nor is it a linear shift in hue —it would go blue>teal>green>yellow>red if it were). <br />To apply this in PyMOL, the command is: </p>
<pre><code>color 0x0053D6, element C
color 0x65CBF3, b < 90 and element C
color 0xFFDB13, b < 70 and element C
color 0xFF7D45, b < 50 and element C</code></pre>
<p>To make loops more sausage-like the command set cartoon_loop_radius, 1 will do the trick.</p><p>However, there is more to pLDDT than pretty pictures. In fact, protein-protein complexes may involve these spaghetti loops and if they do, the span involved may likely have a confidence in the 70–40 range, while the rest of the spaghetti loop is less than 40. As we will see.</p>
<h2 style="text-align: justify;">The unexplored complexes</h2>
<p style="text-align: justify;">Having gone over pLDDT, let's talk about complexes.<br /><br />I often get asked what does a given residue do. I streamlined this task into <a href="https://venus.cmd.ox.ac.uk/venus" target="_blank">Venus</a>, but all too often I end up with a surface residue, where I have to hunt down the binding partner. I will admit that I do not do all the reading I am meant to do, but instead read the blurb in the Uniprot entry, read an abstract or two, search for figures of that protein and look on PDB for what models in complex are available. </p>
<p style="text-align: justify;">Parenthetically, the latter step can be a slow unless optimised, but in my opinion one should not overdo it as one loses oversight. It can be automated with the packages pypdb, requests (for the simple PDBe API), bio.pairwise2 and pymol2 in Python, but is beyond the scope of this post and is tricky anyway (species, ranges etc.), so doing it semi-manually is actually more convinient. What I do is go to RCSB PDB in a browser, search for protein name, click on the tabular report dropdown, chose <code>custom</code> and add only <code>Macromolecule Name</code> (or with <code>Molecular Weight (Entity)</code> if I don't want peptides. This gives a table of PDB entry and the protein they contain. If needed Uniprot data can come from <a href="https://www.ebi.ac.uk/pdbe/api/doc/sifts.html" target="_blank">SIFTS which is easy to query automatedly</a>. The chosen parts can be assembled into a PSE with the Python package pymol2, which then can be opened for viewing with PyMOL binary. Even when including paralogues and checking SWISSMODEL (and stealing partners from the templates) one however often only gets only a limited number of interacting partners. Luckily, AlphaFold2 and friends save the day. A nice example of a complex series of interactions is the ubiquitination pathway in Eukaryotes discussed below. </p>
<div style="text-align: justify;">Before AlphaFold2, the main tool was <i>protein-protein docking</i> —cue initial brass fanfare from Richard Strauss's Also Sprach Zarathustra.</div>
<div style="text-align: justify;">Docking consists of finding the conformation of complex with the lowest potential energy. A series of decreasingly weaker poses are returned and generally looking at the top ten should give consistent picture and a very low ∆G_bound. The former is rarely true and the latter is a very hard to contextualise number —remembering that one hydrogen bond is 1 kcal/mol and a salt bridge is 2 kcal/mol helps, but it does not really as larger protein give larger values.</div>
<div style="text-align: justify;">The problem is that protein are somewhat flexible and many interactions are water mediated. The classic example is barnase+barnstar: the interaction is really strong (–19 kcal/mol), but involves lots of water mediated hydrogen bonding. Googling images for this pair will reveal endless graphs showing how complicated this is to predict. ColabFold on the other hand is different. I hate to say it, but it does not molecular thermodynamics and instead revolves around residue covariance and a well trained model. Consequently, it will spot the residues that show epistasis, but will not give a damn if they form water mediated bonding or not —actually, without the final AMBER step one often gets overlapping sidechains at the interface. With ColabFold a dimer will either resolve consistently across models or will be floating in random locations, with rarely cases in between.</div>
<p></p>
<p style="text-align: justify;">In <a href="https://blog.matteoferla.com/2021/08/tweaking-alphafold2-models-with.html" target="_blank">another post</a> and in the <a href="https://colab.research.google.com/github/matteoferla/pyrosetta_help/blob/main/colab_notebooks/colab-pyrosetta-dimer.ipynb" target="_blank">associated notebook</a>, I talk about calculating the Gibbs free energy of the interface. Molecular thermodynamics calculations definitely come in handy to test certain hypothesis. For example, in a model I made of a certain protein the interaction was 5/5 consistent except for a bit of change in angle for two out of five due to an everted loop. When the models were phosphorylated with PyRosetta using data from <a href="https://www.phosphosite.org/" target="_blank">PhosphoSitePlus</a>, it yielded a stronger interface for the everted loop. This is interestingly scAfter all showing alternate conformations is a main strength of AlphaFold2. </p>
<h3 style="text-align: justify;">Ubiquitination, an example</h3>
<p style="text-align: justify;">The ubiquitination system is a great example of the power and limits of AlphaFold2. In the canonical ubiquitin-conjugation system, the C-terminus of ubiquitin is ligated to the sidechain of a lysine via a first enzyme E1 in an ATP dependent manner, which is then transamidated to a carrier E2, which passes it to E3 which passes to the target protein, which is generally presented thanks to an adaptor protein. ColabFold will make a great E1+ubiquitin, E2+ubiquitin, E2+E3+ubiquitin and E3+ubiquitin models, wherein generating 5 models will result in very consistent binding positions (>3 Å RMSD). A problem arises with substrates and adaptors: the diverging paralogues bind different targets, which will not be linked together correctly in the MSA, resulting in loss of their covariance signal.</p>
<p style="text-align: justify;">Recently, I was asked whether a given residue is an interface residue with a given binding partner. Autophagy is controlled by a series of protein (with names in the form ATG👾, where 👾 is a number), which act like the ubiquitination system, with ATG12 as the ubiquitin tag, but actually an adaptor. ATG7 is the E1 ligase (ATG dependant first step), ATG10 is the carrier to which the tag is passed, ATG5 is the E2 ligase, which accepts the tag. But does not pass it to a final substrate binding E3 ligase, but instead binds to a localisation scaffold (ATG16) and a target (ATG8 via ATG3) into order to bring it into place for myristoylation thus altering membrane curvature. This process is inhibited/interfered-with by ASPP members, which have an internal ubiquitin-like domain, which is know to compete with ATG12 for binding to ATG5. So I started off looking at diagrams and testing a few things, which led me this model:</p>
<div class="separator" style="clear: both; text-align: justify;">
<a href="https://github.com/matteoferla/autophagic-cell-death-complex-models/raw/main/ATG-system.png" style="margin-left: 1em; margin-right: 1em;">
<img border="0" data-original-height="116" data-original-width="800" height="58" src="https://github.com/matteoferla/autophagic-cell-death-complex-models/raw/main/ATG-system.png" width="400" />
</a>
</div>
<div style="text-align: justify;">
<br />
</div>
<p style="text-align: justify;">In a first model I was not expecting ATG10 to bind where ATG16 binds, but that makes sense and is possible thanks to the long C-terminal domain of ATG12 (and ubiquitin). Curiously, this means that ATG7 can still bind when ASPP is bound as it is on the opposite side. </p>
<div class="separator" style="clear: both; text-align: justify;">
<a href="https://github.com/matteoferla/autophagic-cell-death-complex-models/raw/main/ATG7_CTD-ASPP2_NTD.png" style="margin-left: 1em; margin-right: 1em;">
<img border="0" data-original-height="670" data-original-width="800" height="335" src="https://github.com/matteoferla/autophagic-cell-death-complex-models/raw/main/ATG7_CTD-ASPP2_NTD.png" width="400" />
</a>
</div>
<p style="text-align: justify;">This all sounds great. However, things do not often go as hoped. A common case is that the protein are not sticking together or if they are they are not sticking consistently.</p>
<div class="separator" style="clear: both; text-align: justify;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiOjeSQ5pqawo8k0Rryh5fjOLsrxevA0Odlf4F__jsILj0zSyNMT4DFxtLso4MQddYfMzKaVX6gyLy9ctVJZnvnCe35i_PSsm0ethyM3BpeeUTkDqsxF7FeiqYY7vhGaGwx0xB-0FYncW9kPb4FZNHod0O2iuQYtoEvSEAUrQHg3BTLn7PyPZzsUHA/s1280/away.png" style="margin-left: 1em; margin-right: 1em;">
<img border="0" data-original-height="1006" data-original-width="1280" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiOjeSQ5pqawo8k0Rryh5fjOLsrxevA0Odlf4F__jsILj0zSyNMT4DFxtLso4MQddYfMzKaVX6gyLy9ctVJZnvnCe35i_PSsm0ethyM3BpeeUTkDqsxF7FeiqYY7vhGaGwx0xB-0FYncW9kPb4FZNHod0O2iuQYtoEvSEAUrQHg3BTLn7PyPZzsUHA/s320/away.png" width="320" />
</a>
</div>
<h2 style="text-align: justify;">Troubleshooting</h2><p style="text-align: justify;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBQhCusmQPjhNWHS_JeccHh9wfa3YkjFkJOoA8yiO2Rmd_8siDyIoa9beiab1Shx_cT-3L2JsDBeGwPLwSI9MrGVtaoCT1jmv3BokS43kbKkJzfZPrOXRQFiphSO0hJC23hup8WVfXFjN-hiMHcOBLWdEpE_fdL5_FZmp91gUTCdrNCgzcdOoMiicD/s2326/colabfold_bingo-01.png" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="2326" data-original-width="2077" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBQhCusmQPjhNWHS_JeccHh9wfa3YkjFkJOoA8yiO2Rmd_8siDyIoa9beiab1Shx_cT-3L2JsDBeGwPLwSI9MrGVtaoCT1jmv3BokS43kbKkJzfZPrOXRQFiphSO0hJC23hup8WVfXFjN-hiMHcOBLWdEpE_fdL5_FZmp91gUTCdrNCgzcdOoMiicD/s320/colabfold_bingo-01.png" width="286" /></a></div>I do not know what works best for a given situation, but there are three routes:<p></p>
<p style="text-align: left;"></p>
<ul style="text-align: left;">
<li style="text-align: justify;">Tinker with the sequence</li>
<li style="text-align: justify;">Tinker with the settings</li>
<li style="text-align: justify;">Tinker with the MSA</li>
</ul><div style="text-align: justify;">In fact, I made a joke bingo card to check off for things tested. Most of the cells are jokes. Whereas the onomori from the Kanda shrine in Akihabara, Tōkyō, are phenomental, I would not considering going to the effort of sourcing them in Europe a good strategy...</div><div style="text-align: justify;">Reinstalling the environment is a good tactic if there's an actual error, but generally with everything tensorflow related, a seemingly innocuous package installation can inexplicably result in the waste of a good chuck of time firefighting, so a reinstall is not too wise.</div>
<h3 style="text-align: justify;">Sequence: remove spaghetti</h3>
<div style="text-align: justify;">The longer a sequence is the more places things can go wrong, so removing end domains that do not matter or removing spaghetti loops can help. For example, here is a complex (unrelated to ubiquitination), where the accuracy goes up by trimming the tail spaghetti (5 models):</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiov6YiO23TJo-DMyBEJe7FrzUm2dbN3zgo98PyjQ9AUviPDX7DXSHNqK6Mt_tc9USVQg1J_tWkFLCHzE_9QWJfmWeMheYdwOq6N_wk4wSO29JcnC1VniJN1LHBNRnYIy-c_DmkiO0l_yj0f-al54IWuXsFZuOtCE0Ms-3o0jNXOk0f0OMzR3BVYgk/s1500/trimmed.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="1500" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiov6YiO23TJo-DMyBEJe7FrzUm2dbN3zgo98PyjQ9AUviPDX7DXSHNqK6Mt_tc9USVQg1J_tWkFLCHzE_9QWJfmWeMheYdwOq6N_wk4wSO29JcnC1VniJN1LHBNRnYIy-c_DmkiO0l_yj0f-al54IWuXsFZuOtCE0Ms-3o0jNXOk0f0OMzR3BVYgk/s320/trimmed.jpg" width="320" /></a></div><br /><div style="text-align: justify;"><br /></div>
<p style="text-align: justify;">If the spaghetti loop is in the middle of the protein, then there are four options:</p><p style="text-align: justify;"></p><ul><li>Split the protein in two separate peptides</li><li>Replacing a long span of residues with a series of glycines (gap distance divided by 3.4 Å with an extra 20–50% should suffice)</li><li>Edit the MSA as an A3M has insertions simplified —very complicated</li><li>Use an older colabfold notebook, wherein cuts within a protein were maked with a / —this is a bad idea</li></ul><div>As mentioned earlier not all spaghetti is junk. Some may be involved in protein-protein interactions. These represent a very small fraction of the PDB so AF2 performs badly at predicting these, but can still managed in some cases —followed by cycles of refinement. AF2, however, can possibly improve on PDB models. For example here is a certain protein bound to tubulin alpha and tubulin beta as predicted by ColabFold: in the PDB model, a dodecapeptide is bound, while in the models a 24-mer (icositettarapeptide) is bound but actually 2 of 5 are different residues (binding in the same way) as it's a repeat so in reality would bind 4 tubulin monomers.</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9jS525Q5IoBIXn_c_IlUxblxfm52EaMZuNxJ3Di0bV-aH1rLZjhYu3owiDmKn1ndTuw6TPmGLKoQBb2oHS5MnjZ1aL_uGveL9XxQMVyg5t0XyUGbMMkfFRHkIo94Hbr4PPvY0vaXQEKF8d52YLqiVFSIAJdiDbyRlDhzrjvPUCIh6RDjlEO1Pe2Dl/s1038/bound.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="780" data-original-width="1038" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9jS525Q5IoBIXn_c_IlUxblxfm52EaMZuNxJ3Di0bV-aH1rLZjhYu3owiDmKn1ndTuw6TPmGLKoQBb2oHS5MnjZ1aL_uGveL9XxQMVyg5t0XyUGbMMkfFRHkIo94Hbr4PPvY0vaXQEKF8d52YLqiVFSIAJdiDbyRlDhzrjvPUCIh6RDjlEO1Pe2Dl/s320/bound.png" width="320" /></a></div><br /><h4 style="text-align: left;">Linguine</h4><div>In isolation the binding sequence is a disordered spaghetti loop, but as mentioned there seems to be a difference in confidence in these regions, and in fact other disordered regions with higher than 40 pLDDT bind other targets, I call these linguine loop:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguFWoYaNM82usE0rvkTIg8oYcz6L1XlQswhJa8DCo3A4SwWM0gU_CB5JsjaMsmn2CzcU58ZrKjRPXBzR-3fXzgDhkQoXgnhhllP5rSaNokBNhcY5mWfA_R-AjC-14hT4t2rqdXStLHZ_L3OEqCWZPo1z2_HN9aLOylfaY2_itrwOxSFh8SchE1Ff1d/s2559/spaghetti.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="2004" data-original-width="2559" height="251" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguFWoYaNM82usE0rvkTIg8oYcz6L1XlQswhJa8DCo3A4SwWM0gU_CB5JsjaMsmn2CzcU58ZrKjRPXBzR-3fXzgDhkQoXgnhhllP5rSaNokBNhcY5mWfA_R-AjC-14hT4t2rqdXStLHZ_L3OEqCWZPo1z2_HN9aLOylfaY2_itrwOxSFh8SchE1Ff1d/s320/spaghetti.png" width="320" /></a></div><div><br /></div><p></p><h4 style="text-align: justify;">Interdomain loops</h4><div>A curious case is interdomain loops. If the pLDDT is very low on the segment removed then there is little change of harm. The only consideration is that the remaining parts ought to be able to close up —measuring the distance in PyMOL between the good residues (edit mode>click on the two atoms or type <code>print cmd.distance('name CA and resi 👾', 'name CA and resi 👾)</code> and dividing it by 3.6 and multiplying that by 1.5, should give a rough number of residues required.</div><div><br /></div><div>Another caveat is that if the alignment is generated by blasting and only the first HSP is taken per hit, even if the HSP are non-overlapping, there will be a bit of the alignable hit missing. MMSeqs2 does not use Blast, but HMM (cf. HH-suite) on a pregenerated dataset of clusters so this does not apply.</div><div><br /></div><div>The major catch with removing termini or removing inner loops is that it becomes complicated to keep track of the sequence, when one want to convert the model back to the correct numbering.</div><div>To correct the numbering in PyMOL, say in chain B by 100 residues, one can do <code>alter chain B, resv+=100</code>, followed by <code>sort</code>. If one has forgetten what the protein splicing was, the a global alignment can do the trick, as happens in this function: <a href="https://pyrosetta-help.readthedocs.io/en/latest/_modules/pyrosetta_help/common_ops/utils.html#correct_numbering">link</a>.</div><h4 style="text-align: left;">Splicing</h4><div>A detail to note about interdomain loops is that these may be because a non-biologically relevant splicing isoform is chosen.</div><div>When a non-biologically relevant isoform is used, the differing sequence will be a disordered spaghetti.</div><div>Unirpot is nice for choosing the isoform which is most frequently found in transcriptomics experiments, while Ensembl choses the longest, which may not be relevant.<br />The <a href="https://gtexportal.org/home/" target="_blank">database GTex</a> is a really nice resource for doublechecking this is correct by showing the abundance of mRNA reads spanning junctions and the exons are found in human tissues —as the genes can be big and the reads small it does require some squinting at times. To play around with the sequences of all the possible isoforms in Python the local ensembl package can help, for example, here is their sizes:</div>
<pre><code>
# remember local data...
!pyensembl install --release 77 --species homo_sapiens
from Bio import SeqUtils
from pyensembl import EnsemblRelease
data = EnsemblRelease(77)
ts = data.transcript_ids_of_gene_name('ALG13')
transcripts = [t for t in map(data.transcript_by_id, ts) if t is not None and t.protein_sequence is not None and 'X']
for t in sorted(transcripts, key=lambda t: SeqUtils.molecular_weight(t.protein_sequence.replace('X', 'A'), 'protein')):
print(t.transcript_id, len(t.protein_sequence), SeqUtils.molecular_weight(t.protein_sequence.replace('X', 'A'), 'protein')//1_000)</code></pre>
<p style="text-align: justify;">A side-effect of this is that one would want the other sequences in the MSA to be correct or else the quality will be poor.</p><h3 style="text-align: justify;">Sequence: try different domains separately</h3><p style="text-align: justify;"><span style="text-align: left;">In generally, longer protein will require greater resources. Therefore, splitting a protein into domain is a very solid plan. In terms of complexes, this may help, although it will result in different MSAs, which can be counterproductive for repeat proteins.</span></p><h3 style="text-align: justify;">Settings</h3>
<div style="text-align: justify;">One can also increase the number of recycles, which is how many cycles are done, which is generally considered a good choice. The AlphaFold2 paper performs 3 recycles, but one can do a lot more. These simply increase the time requirement (linearly I would guess).</div><div style="text-align: justify;"><br /></div>
<div style="text-align: justify;">There is also the option to use templates in ColabFold. These are for specific applications (e.g. alternative conformations or novel structures), not for giving ColabFold a hand. ColabFold has to rediscover structures it already predicted, there is no circumventing that: it actually runs slower with templates. And I suspect it uses in a fragment based approach, because when I altered all non-glycine/proline residues to alanine in a ubiquitin structure (PDB:1UBQ, <a href="https://gist.github.com/matteoferla/c318502491cca500880a03e493bcf7cc" target="_blank">file of the heretical structure</a>) and used its sequence as sole alignment I got the same helical structure repeated:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_sU1xOHUzZuRR2YuMtNyerAdExFo6nav87guxXnZeuNTIyCPfaALcWiQJ5PEi22RXmGbY__GgDoe-BpG_5lqgSsvkSXQL5NSMcsSxrDWNPhHPPlWcreHJXbEAP0m9J6mfOYXii1PyUWoSbU8JgUFMjSJIJScXccqJMU8k0aC4cXEqiNP2lPGELARZ/s2559/templates_test.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="903" data-original-width="2559" height="113" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_sU1xOHUzZuRR2YuMtNyerAdExFo6nav87guxXnZeuNTIyCPfaALcWiQJ5PEi22RXmGbY__GgDoe-BpG_5lqgSsvkSXQL5NSMcsSxrDWNPhHPPlWcreHJXbEAP0m9J6mfOYXii1PyUWoSbU8JgUFMjSJIJScXccqJMU8k0aC4cXEqiNP2lPGELARZ/s320/templates_test.png" width="320" /></a></div>Personally, I have not had much luck with templates, but I am told they could help.<h3 style="text-align: justify;">MSA</h3>
<div style="text-align: justify;">The MSA is a key part and can play a massive role. In <a href="https://github.com/matteoferla/Snippets-for-ColabFold" target="_blank">a GitHub I have put a few snippets for working with A3M files</a>, which can help. Here are some key take home ideas.</div><div style="text-align: justify;"><br /></div><h3 style="text-align: justify;">A3M</h3><div style="text-align: justify;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIjMv9BQ1n31rnnkjDoAu1cL20N6TddsgtqlqQEA4Fa3W4e-joo2WFSXpr4_JNy5WaBNjVNWNRnwmsAXJwkbLMs4Yq2jKvDJIwJy-wy3xwWecNw9DKalKBJV0jMJg21zuw93trt-cKklbkjBQP84tAjqgZBVqvE2dTt66c2q3ElOCt8RzuwbtdiDLW/s854/Spanish_Inquisition.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="640" data-original-width="854" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIjMv9BQ1n31rnnkjDoAu1cL20N6TddsgtqlqQEA4Fa3W4e-joo2WFSXpr4_JNy5WaBNjVNWNRnwmsAXJwkbLMs4Yq2jKvDJIwJy-wy3xwWecNw9DKalKBJV0jMJg21zuw93trt-cKklbkjBQP84tAjqgZBVqvE2dTt66c2q3ElOCt8RzuwbtdiDLW/w200-h150/Spanish_Inquisition.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="color: #999999;">Cardinality of a protein<br />does not involve RC clergy</span></td></tr></tbody></table>An A3M alignment looks like a normal Fasta formatted alignment, but with two differences.</div><div style="text-align: justify;"><ul><li>For a complexes, the first line starts with a hash and is immediately followed by the comma separated lengths of the peptides, then a tab, then the comma separated cardinality (oligomeric number) of the peptides.</li><li>Insertions in the non-target sequence are marked with lowercase letters and the target sequence does not have gaps for them.</li></ul><div>To convert a regular Fasta alignment to a A3M one, in the HH-suite repo there is a Perl script for this:</div>
<pre><code>wget https://github.com/soedinglab/hh-suite/raw/master/scripts/reformat.pl
perl reformat.pl fas a3m out_al.fasta out.a3m</code></pre>
<h4 style="text-align: justify;">Uniprot</h4><div style="text-align: justify;">One detail that should be noted is that in a ColabFold-MMSeqs2 MSA the sequences are Uniprot and/or environmental. Uniprot is really good for curated human protein as these are curated and user-submission curated —I have messaged about a few mistakes I have spotted and within a week I got a human thank-you reply and the page was altered. For other organisms this is less so, to the point that I have found NCBI much better in terms of gene length and presence. This is because a genome sequencing project for a non-human organism is rushed through the doors and is not subsequently re-refined as far as I understand. A case example is the lamprey, which featured heavily in <a href="https://blog.matteoferla.com/2021/10/multiple-sequence-alignments.html" target="_blank">my post about anthropocentric MSAs</a>. Briefly, this fish is basal to boney fish and tetrapods and it's divergence predates two (tetrapods) or more (ray-finned fish like zerbafish) genome duplications. This ugly chap is useful because it has less paralogues, but is very poorly annotated in Uniprot.</div><div style="text-align: justify;">A problem is that the Uniprot accessions are not too helpful for a human reader. Here is how one can get images of what is what thanks to Wikipedia using the snippets in the metioned GitHub:</div>
<pre style="text-align: left;"><code>from IPython.display import display
from gist_import import GistImporter
import operator
import functools
import pandas as pd
import ipyplot
# get codeblocks
gi = GistImporter.from_github('https://github.com/matteoferla/Snippets-for-ColabFold/blob/main/analyse_a3m.py')
AnalyseA3M = gi['AnalyseA3M']
gi = GistImporter('313b5c7e1845f36205b4b9dcf05be10d')
WikiSpeciesPhoto = gi['WikiSpeciesPhoto']
a3m = AnalyseA3M('👾👾👾👾.a3m')
a3m.load_uniprot_data()
df: pd.DataFrame = a3m.to_df()
wikis:pd.Series = df.species.apply(functools.partial(WikiSpeciesPhoto, catch_error=True, retrieve_image=False))
urls:pd.Series = wikis.apply(operator.attrgetter('image_url'))
pretties = wikis.apply(operator.attrgetter('preferred_name'))
ipyplot.plot_images(urls.loc[~urls.isna()].values,
pretties.loc[~urls.isna()].values,
img_width=150)</code></pre>
<div class="separator" style="clear: both; text-align: center;"><a href="https://github.com/matteoferla/Snippets-for-ColabFold/raw/main/animals.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="364" data-original-width="574" height="203" src="https://github.com/matteoferla/Snippets-for-ColabFold/raw/main/animals.png" width="320" /></a></div><h3 style="text-align: justify;">Paralogues</h3>
<p style="text-align: justify;">The loss of signal from diverging paralogues ought to be correctable, but would require a lot of time to set up and I have not found an example of this being achieved or set it up myself. The MSA goes quite far back, allowing the conserved core of the protein to be resolved, but one can make a more shallower one. In <a href="https://blog.matteoferla.com/2021/10/multiple-sequence-alignments.html" target="_blank">another post</a>, I discuss eukaryotic species choice and mention that for non-model organisms the number of missing genes or genes with shorter transcripts is rather high: for this MSA this does not matter, whereas genetic diversity does —i.e. mouse vs. human does little, whereas a series of fish become very helpful unless the double genome duplication gets in the way.<br />One rather crude way to remove paralogues is to filter out in the MSA all genes with the name that does not match a given criterion. From the GitHub repo of snippets:</p>
<pre><code>from IPython.display import display
from gist_import import GistImporter
gi = GistImporter.from_github('https://github.com/matteoferla/Snippets-for-ColabFold/blob/main/analyse_a3m.py')
AnalyseA3M = gi['AnalyseA3M']
a3m = AnalyseA3M('👾👾👾👾.a3m')
a3m.load_uniprot_data()
omnia = a3m.to_df()
boney = a3m.to_boney_subset()
print(f'{len(boney)} out of {len(omnia)} are tetrapods & boney-fish')
a3m.display_name_tallies(a3m.to_boney_subset())
display(boney.sample(5))
# And do whatever filtering:
# names are messy...
cleaned = boney.name_B.str.lower()\
.str.replace(r'( ?\d+)$','')\
.str.replace(r' proteinous','')\
.str.replace(r' protein','')\
.str.replace(r'📛📛','📛📛2')\
.str.strip()
subsetted: pd.DataFrame = boney.loc[filtro & (boney.name_C == 'BH3 interacting domain death agonist')]
a3m.dump_a3m(subsetted, '👾👾👾👾_filtered.a3m')</code></pre>
<p>Doing this on a ColabFold-MMSeqs2 A3M file will result in very few entries, which will not work. As a result one may have to resort to making one's own MSA.</p><h3 style="text-align: left;">Custom MSA</h3><p>Honestly, I have not looked into creating a custom database for MMSeqs2 or its predecessor HH-suite, but I had used the classical method of simply doing a blast or psiblast search in NCBI and making the MSA from that —there's a class for this in the repo.
</p><h3>Removing diversity from known irrelevant parts</h3><p>For a transmembrane protein, I knew which face was used for binding, however, there is no official way to add constraints. Consequently, I changed all non-gap residues in the non-target sequences in the alignment, to be the same as the residue in target sequence. I did not get the result I was hoping (a complex on that side) and instead the protein did not fold as I hoped even when I gave a template, meaning that the other residues were not interfering with the signal, but that the signal was not strong enough: inspection revealed that there were only three vertabrate pairs, so I am not surprised...</p><p>About the inability to impose a constrain/restraint, officially there are no reports, but I am guessing that if one hypothetically vandalised the residue-residue pair edges/representations in an iteration/recycle it ought to be possible, albeit mathematically complicated. From what I can tell one would need to have the time to monkeypatch alphafold.model's AlphaFoldIteration class, which is a haiku Module (=abstract class) subclass used by AlphaFold for each recyle, which in turn uses EmbeddingsAndEvoformer, which does the actual mathemagic. So it looks possible but very painful. This brings me back to the original point from the lead: a headline-grabbing experiment may not actually be the most useful implementation, but is an easy one and the reverse is also true, the most useful implementation may not be easy and may be rather dull... and as a result one ends up running through every possible settings tweak to get a good result.</p><p><br /></p><p><br /></p><p><br /></p><p>I say official as there might be, but I have not tried it or heard it tried .</p><p><br /></p><h1>Footnote about my run</h1>
<h3>My runs</h3>
<p>The installation is well documented within ColabFold and is straightforward. Three things are worth noting though:</p>
<h4>Out of date info</h4><p>Things change fast, so be vigilant about the age of posts: ignore any suggestion to install <code>tensorflow-gpu</code> with pip for example.</p>
<h4>tensorflow</h4>
<p>The tensorflow package(s) can be installed to the latest version will just give you depracation warnings, but work fine. There are two flavours <code>tensoflow-cpu</code> and <code>tensoflow-gpu</code>. Installing the latter on a system with only CPUs will result in a fallback.</p>
<h4>CPU</h4><p>jax and tensorflow can be installed CPU-only via <code>pip install "jax[cpu]" tensoflow-cpu</code> without issues.</p>
<h4>GPU</h4><p>GPU is a different problem. The package <code>tensoflow-gpu</code> will either be installed fine or will be a nightmare. The best way to install it is with conda or mamba (vide infra), which required CUDA drivers and CUDA toolkit. A thing to note is that AlphaFold2 works with CUDA version 11 drivers and not the 10: this is not the end of the world, because one can now install them within the conda environment (a must on a cluster where one does not have root access) with
<code>conda -c nvidia -y install cuda</code>. However, the CUDA toolkit needs to match the CUDA drivers, so installing everything that uses CUDA at once would be idea. The resolution gets slow with conda, hence the mention of mamba, which is a drop in replacement and works like conda:</li>
<pre><code># to be safe
conda update -n base -c defaults conda -y;
# mamba!
conda install mamba -n base -c conda-forge;
# note the -c nvidia is for cuda
# while -c conda-forge is for tensorflow-gpu jax cudnn cudatoolkit does the rest.
# the openmm=7.5.1 pdbfixer are just for the AMBER step, but might muck up the installation if pip installed sepearately
mamba install -c nvidia -y -c conda-forge cuda tensorflow-gpu jax cudnn cudatoolkit tensorflow-gpu openmm=7.5.1 pdbfixer;
<h4>OpenMM</h4>
<p>The module/namespace alphafold required is version 2.2.0 as found on Deepmind's GitHub or pip installed with <code>alphafold-colabfold</code>and <code>alphafold</code> which is version 2.0.<br/>
A step back, the module <code>alphafold</code> in AlphaFold2 is from the package <code>alphafold</code>, while in Colab it is from <code>alphafold-colabfold</code>, unfortunately the two have diverged so the latter works only for ColabFold and does not act as a drop in replacement in AlphaFold —if random errors happen the package corresponding to the module can be checked thusly:</p>
<pre><code>import importlib_metadata
from typing import List, Dict, Iterable
package2module: Dict[str, List[str] = importlib_metadata.packages_distributions()</code></pre>
<p>The package <code>alphafold=2.2.0</code>, the latest and used by ColabFold requires<code>openMM=7.5.1</code> and nothing higher (7.6) will break it.<br/>
Python cannot be 3.10 or higher because of one of dependencies as of November 2022.</p>
<p>
NB. The following is no longer true!<br/>
As <a href="https://blog.matteoferla.com/2020/11/remote-notebooks-and-jupyter-themes.html">discussed in a past post</a>, I used run Jupyterlab in a private 32-CPU 250 GB node in a cluster intended for genome sequencing but I can use it while idle, whence I ssh-port-forward my jupyter-lab. From there I can also launch jobs to the cluster via a regular SGE job submission. <br />My colabfold notebook run is mostly the same except a few cells were removed such as installs and any <code>google.colab</code> function circumvented. But in the regular run the function <code>colabfold.batch.run</code> gets called, which is the same call (after argument-parsing) for the bash command <code>colabfold_batch</code>. I use the latter for SGE job submission, wherein roughly the following command gets called: </p>
<pre><code>$CONDA_PREFIX/bin/colabfold_batch foo.a3m foo --cpu --model-type AlphaFold2-multimer-v2 --data 'data' --pair-mode 'unpaired+paired'</code>
</pre>
<p>The variable $CONDA_PREFIX (and the Conda relative path $CONDA_DEFAULT_ENV) is only available within a Conda python call and is where the Conda installation lies</p>
<pre><code>run(
queries=queries,
result_dir=result_dir,
use_templates=use_templates,
custom_template_path='templates' if use_templates else None,
use_amber=False,
msa_mode='custom',
model_type= model_type,
num_models=num_models,
num_recycles=num_recycles,
model_order=[1, 2, 3, 4, 5],
is_complex=is_complex,
data_dir=Path("../ColabFoldData"),
keep_existing_results=False,
recompile_padding=1.0,
rank_by="auto",
pair_mode="paired",
stop_at_score=float(100),
#prediction_callback=prediction_callback,
dpi=dpi
)
</code>
</pre>
<p>One thing worth noting is that it takes up all the CPUs available, so in the case of a run not as a cluster job, running in a cell <code>os.nice(19)</code> can be handy (ColabFold gets less priority over other tasks). </p>
<p>I prefer using a notebook as I need to retrieve sequences, trim sequences, write notes <i>etc</i>. For example given a uniprot id in <code>uniprot</code> one can retrieve the sequence simply with: </p>
<pre><code>seq:str = requests.get(f'https://rest.uniprot.org/uniprotkb/{uniprot}.fasta').text.split('\n')[1:])</code></pre>
<p>Like most clusters, the compute nodes available to me are not exposed to the internet as a consequence, I need to do a two step process: run ColabFold in my Jupyter node (has internet) asking for zero models, which will generate a a3m MSA file, which I can use for inference in a node without internet by submitting a 'custom' MSA job.<br /><br />If num_models is set to zero, then the run will generate the a3m and stop, which means it can be submitted to the cluster (brun, qsub etc.) say by creating the actual command block with the same parameters passed to run, but assuming msa_mode is custom:</p>
<pre><code>from IPython.display import clear_output
try:
run(...)
except (IndexError, KeyError) as error:
clear_output()
print(f'Retrived A3M: {jobname} {error}')
cmd = os.environ['CONDA_PREFIX']+f'/bin/colabfold_batch '
cmd += f'"{a3m}" "{out_folder}" --cpu'
for k, v in {'model-type': model_type,
'data': data_folder,
'pair-mode': pair_mode,
'num-recycle': num_recycle,
}.items():
cmd += f' --{k} "{v}"'</code></pre>
<p>Parenthetically, I should stress that my setup is unusual and one should not normally run Jupyter notebooks in a log-in or a compile node (unless explicitly approved) and instead one ought to do the MSA step on their machine as it is server based anyway ( <a href="https://colabfold.mmseqs.com/" target="_blank">see MMSeqs2</a>). ColabFold is meant for GPU acceleration, but for me the queue for space on an oversubscribed GPU node is longer than a multicore CPU run even if it's 5x slower, so I use the latter happily. I assume this is common and I would say one ought to hold off investing in a Nvidia Tesla GPU board: things are moving fast with the Edge TPU, which currently supports only TensorFlow Lite and not Jax (it's for <a href="https://store.google.com/intl/en/ideas/articles/what-is-an-ai-camera/" target="_blank">Pixel mobiles</a> and <a href="https://coral.ai/products/" target="_blank">Coral boards for Raspberry Pis</a>), and I would be surprised if within the next 2 years there was not some amazing TPU board (obviously still printed in 5 or 7 nm as lithography seems to have predictably hit an atomic scale wall). </p><p>Now, that was about CPUs.</p>
</div>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-89586099446678722342022-06-19T14:19:00.004-07:002022-10-23T02:24:08.583-07:00Top 10 silliest PDB residue names for ligands!<p><b>UPDATE: </b>The PDB will finish 3 letter chemical component IDs sometime before 2024 at which point they will switch to 5 letter codes, which will be usable solely in CIF format: <a href="https://www.wwpdb.org/news/news?year=2022#630fee4cebdf34532a949c34">https://www.wwpdb.org/news/news?year=2022#630fee4cebdf34532a949c34</a></p><p>In some situations it is handy to use in an in silico experiment a 3-letter residue name that is not taken in the PDB.
For example, PyRosetta has a system of pregenerated topologies for PDB components, which can cause issues when a ligand is loaded and the movers may use that over an incorrectly provided residue type / param file, resulting in a blown up mishapen ligand —an overly common incident*. As a result, having a list handy of what is taken is helpful. Herein are some silly observations about what the taken and untaken names are —but not ranked as a top 10, because this is not a science blog, not my local newspaper.</p><a name='more'></a>
<p>∗) This feature can be disabled with <code>load_PDB_components</code>, but often one may want to keep it on and not use <code>ignore_unrecognized_res</code> either.</p>
<h4><span>Untaken chemical components</span></h4>
<p>The European PDB provides several formats of the chemical components that are in the database, one of them is just a list of entries (<a href="http://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/chem_comp.list" target="_blank">chem_comp.list</a>). So let's extract these and see how many are available</p>
<pre><code>import string
import itertools
from typing import List, Tuple
with open('chem_comp.list') as fh:
chem_resns:Tuple[str] = *map(str.strip, fh),
possibles:Set[str] = set()
for repeat in (1,2,3):
possibles.update(map(str.strip, map(''.join, itertools.product(string.digits+string.ascii_uppercase, repeat=repeat))))
available:List[str] = [resn for resn in possibles if resn not in chem_resns]</code></pre>
<p>The <code>itertools.product</code> was called with repeat 1,2 and 3, because single and double letters are valid names.
For example, residue <a href="https://www.rcsb.org/ligand/A">A</a> is adenosine monophosphosate —the polymeric nucleobase in RNA.
Parethetically, chemical compound definitions are nominally neutral and in the unreacted/monomeric state and with neutral protonation: for example <a href="https://www.rcsb.org/ligand/ALA" target="_blank">alanine</a> in the PDB definition is with the OXT atom. However, confusingly some covalent compounds may be submitted in the reacted form without a dummy atom (* in SMILES, R in drawn/mol files).</p>
<p>As of March 2022, of the possible 47,988 names, 36,757 are taken leaving 11,231 available (<a href="https://gist.githubusercontent.com/matteoferla/c72ba425692e8b240dcf2d3c2b1b5e27/raw/5a54917076409f28366e938b6e52923bfd3371d6/unassigned.json" target="_blank">JSON of untaken 1–3 letter names</a>).</p>
<p>So what does the distribution of untaken names look like?</p>
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEityJq5OD5vVSWByXZyNdOjt0yk1ay1aOvj_KfqhO54yxT_aXGdUH2pJSIQh4t2AKLtEGo_q3D43BvzgxK0eA2DmXfpu7cYANQwudf9IiHasoRE0l2LkPTrcUlkNvwyybECi8LGXtObs4M8wA7COcaecvW5lI8fuYqXY0lHii0Wde_4Yj969vwp_kh9/s985/newplot.png" style="display: block; padding: 1em 0px; text-align: center;"><img border="0" data-original-height="525" data-original-width="985" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEityJq5OD5vVSWByXZyNdOjt0yk1ay1aOvj_KfqhO54yxT_aXGdUH2pJSIQh4t2AKLtEGo_q3D43BvzgxK0eA2DmXfpu7cYANQwudf9IiHasoRE0l2LkPTrcUlkNvwyybECi8LGXtObs4M8wA7COcaecvW5lI8fuYqXY0lHii0Wde_4Yj969vwp_kh9/w400-h214/newplot.png" width="400" /></a></div>
<p>As expected few of the names starting with early letters are untaken, but the available ones of these <i>appear</i> random. For example for A, we have untaken:
<a href="https://www.rcsb.org/ligand/A0E" target="_blank">A0E</a>,
<a href="https://www.rcsb.org/ligand/A0F" target="_blank">A0F</a>,
<a href="https://www.rcsb.org/ligand/A0I" target="_blank">A0I</a>,
<a href="https://www.rcsb.org/ligand/A0M" target="_blank">A0M</a>,
<a href="https://www.rcsb.org/ligand/A0N" target="_blank">A0N</a>,
<a href="https://www.rcsb.org/ligand/A0X" target="_blank">A0X</a>,
<a href="https://www.rcsb.org/ligand/A10" target="_blank">A10</a>,
<a href="https://www.rcsb.org/ligand/A1D" target="_blank">A1D</a>,
<a href="https://www.rcsb.org/ligand/A1W" target="_blank">A1W</a>,
<a href="https://www.rcsb.org/ligand/A2A" target="_blank">A2A</a>,
<a href="https://www.rcsb.org/ligand/A2S" target="_blank">A2S</a>,
<a href="https://www.rcsb.org/ligand/A2U" target="_blank">A2U</a>,
<a href="https://www.rcsb.org/ligand/A30" target="_blank">A30</a>,
<a href="https://www.rcsb.org/ligand/A3I" target="_blank">A3I</a>,
<a href="https://www.rcsb.org/ligand/A3L" target="_blank">A3L</a>,
<a href="https://www.rcsb.org/ligand/A3O" target="_blank">A3O</a>,
<a href="https://www.rcsb.org/ligand/A3U" target="_blank">A3U</a>,
<a href="https://www.rcsb.org/ligand/A3Z" target="_blank">A3Z</a>,
<a href="https://www.rcsb.org/ligand/A4U" target="_blank">A4U</a>,
<a href="https://www.rcsb.org/ligand/A5S" target="_blank">A5S</a>,
<a href="https://www.rcsb.org/ligand/A67" target="_blank">A67</a>,
<a href="https://www.rcsb.org/ligand/A6F" target="_blank">A6F</a>,
<a href="https://www.rcsb.org/ligand/A6X" target="_blank">A6X</a>,
<a href="https://www.rcsb.org/ligand/A7F" target="_blank">A7F</a>,
<a href="https://www.rcsb.org/ligand/A7J" target="_blank">A7J</a>,
<a href="https://www.rcsb.org/ligand/A7P" target="_blank">A7P</a>,
<a href="https://www.rcsb.org/ligand/A7U" target="_blank">A7U</a>,
<a href="https://www.rcsb.org/ligand/A7V" target="_blank">A7V</a>,
<a href="https://www.rcsb.org/ligand/A7X" target="_blank">A7X</a>,
<a href="https://www.rcsb.org/ligand/A8A" target="_blank">A8A</a>,
<a href="https://www.rcsb.org/ligand/A8I" target="_blank">A8I</a>,
<a href="https://www.rcsb.org/ligand/A8J" target="_blank">A8J</a>,
<a href="https://www.rcsb.org/ligand/A8R" target="_blank">A8R</a>,
<a href="https://www.rcsb.org/ligand/A8Y" target="_blank">A8Y</a>,
<a href="https://www.rcsb.org/ligand/A9A" target="_blank">A9A</a>,
<a href="https://www.rcsb.org/ligand/A9I" target="_blank">A9I</a>,
<a href="https://www.rcsb.org/ligand/A9X" target="_blank">A9X</a>,
<a href="https://www.rcsb.org/ligand/AAJ" target="_blank">AAJ</a>,
<a href="https://www.rcsb.org/ligand/AAW" target="_blank">AAW</a>,
<a href="https://www.rcsb.org/ligand/AC3" target="_blank">AC3</a>,
<a href="https://www.rcsb.org/ligand/AD0" target="_blank">AD0</a>,
<a href="https://www.rcsb.org/ligand/AEB" target="_blank">AEB</a>,
<a href="https://www.rcsb.org/ligand/AEC" target="_blank">AEC</a>,
<a href="https://www.rcsb.org/ligand/AGZ" target="_blank">AGZ</a>,
<a href="https://www.rcsb.org/ligand/AH5" target="_blank">AH5</a>,
<a href="https://www.rcsb.org/ligand/AHJ" target="_blank">AHJ</a>,
<a href="https://www.rcsb.org/ligand/AHV" target="_blank">AHV</a>,
<a href="https://www.rcsb.org/ligand/AI0" target="_blank">AI0</a>,
<a href="https://www.rcsb.org/ligand/AI4" target="_blank">AI4</a>,
<a href="https://www.rcsb.org/ligand/AI5" target="_blank">AI5</a>,
<a href="https://www.rcsb.org/ligand/AI6" target="_blank">AI6</a>,
<a href="https://www.rcsb.org/ligand/AID" target="_blank">AID</a>,
<a href="https://www.rcsb.org/ligand/AIE" target="_blank">AIE</a>,
<a href="https://www.rcsb.org/ligand/AII" target="_blank">AII</a>,
<a href="https://www.rcsb.org/ligand/AIP" target="_blank">AIP</a>,
<a href="https://www.rcsb.org/ligand/AIY" target="_blank">AIY</a>,
<a href="https://www.rcsb.org/ligand/AJ0" target="_blank">AJ0</a>,
<a href="https://www.rcsb.org/ligand/AJ9" target="_blank">AJ9</a>,
<a href="https://www.rcsb.org/ligand/AJC" target="_blank">AJC</a>,
<a href="https://www.rcsb.org/ligand/AJF" target="_blank">AJF</a>,
<a href="https://www.rcsb.org/ligand/AJO" target="_blank">AJO</a>,
<a href="https://www.rcsb.org/ligand/AJS" target="_blank">AJS</a>,
<a href="https://www.rcsb.org/ligand/AJT" target="_blank">AJT</a>,
<a href="https://www.rcsb.org/ligand/AJW" target="_blank">AJW</a>,
<a href="https://www.rcsb.org/ligand/AK9" target="_blank">AK9</a>,
<a href="https://www.rcsb.org/ligand/AKF" target="_blank">AKF</a>,
<a href="https://www.rcsb.org/ligand/AKQ" target="_blank">AKQ</a>,
<a href="https://www.rcsb.org/ligand/AMK" target="_blank">AMK</a>,
<a href="https://www.rcsb.org/ligand/AO0" target="_blank">AO0</a>,
<a href="https://www.rcsb.org/ligand/AQF" target="_blank">AQF</a>,
<a href="https://www.rcsb.org/ligand/AQI" target="_blank">AQI</a>,
<a href="https://www.rcsb.org/ligand/AQL" target="_blank">AQL</a>,
<a href="https://www.rcsb.org/ligand/AQR" target="_blank">AQR</a>,
<a href="https://www.rcsb.org/ligand/AQX" target="_blank">AQX</a>,
<a href="https://www.rcsb.org/ligand/ARY" target="_blank">ARY</a>,
<a href="https://www.rcsb.org/ligand/AT0" target="_blank">AT0</a>,
<a href="https://www.rcsb.org/ligand/ATB" target="_blank">ATB</a>,
<a href="https://www.rcsb.org/ligand/ATN" target="_blank">ATN</a>,
<a href="https://www.rcsb.org/ligand/AU0" target="_blank">AU0</a>,
<a href="https://www.rcsb.org/ligand/AU9" target="_blank">AU9</a>,
<a href="https://www.rcsb.org/ligand/AUA" target="_blank">AUA</a>,
<a href="https://www.rcsb.org/ligand/AUM" target="_blank">AUM</a>,
<a href="https://www.rcsb.org/ligand/AUS" target="_blank">AUS</a>,
<a href="https://www.rcsb.org/ligand/AV8" target="_blank">AV8</a>,
<a href="https://www.rcsb.org/ligand/AVS" target="_blank">AVS</a>,
<a href="https://www.rcsb.org/ligand/AWI" target="_blank">AWI</a>,
<a href="https://www.rcsb.org/ligand/AX9" target="_blank">AX9</a>,
<a href="https://www.rcsb.org/ligand/AY3" target="_blank">AY3</a>,
<a href="https://www.rcsb.org/ligand/AYF" target="_blank">AYF</a>,
<a href="https://www.rcsb.org/ligand/AYP" target="_blank">AYP</a>,
<a href="https://www.rcsb.org/ligand/AYY" target="_blank">AYY</a>.
In reality these are likely embargoed structures that never saw the light of day —there are many similarly missing PDB codes, but most of these are due to models that were initially allowed but then were removed. In the list above the <code>AQX</code> strikes me as rather cyberpunk. Of the taken letters in A, <code>ALF</code> is tetrafluoroaluminate, which is a pretty alien inorganic compound.</p>
<p>Residue <code>AWS</code> is not taken up by Amazon Web Services, but by an Oxford SGC/Diamond fragment based drug discovery screen (a panDDA event). Given that accidents with AWS can be rather costly, it would have been nice if it was the most expensive compound in the PDB. Unfortunately, that would be near impossible to determine, consequently I do not know what that is, but is probably some natural compound extracted from a tropical plant only found on an inaccessible island.
</p>
<p>Bar for odd spellings, the only giggle words in three letters I can think of are <code>SEX</code>, <code>WEE</code>, <code>PEE</code>, <code>POO</code>, <code>CUM</code> and <code>POP</code>. Possibly on purpose the ligand <code>SEX</code> is untaken —Sildenafil is <code>VIA</code>. <code>PEE</code> is a phospholipid. <code>POO</code> is a herpes C polymerase inhibitor, while <code>POP</code> is pyrophosphate, which also has <code>PPV</code> in a different protonation states (crystal structures do not have protons so the chemical components are a mess). Odd spellings seem okay: <code>FUK</code> is ASTX660 (Astex), an apoptosis inhibition inhibitor targeted at cancers, so an unfortunate allocation from the PDB's system.</p>
<h4><span>Scrabble legal</span></h4>
<p>I cross-referenced the taken names with Scrabble legal words. As expected Qi-like scrabble words dominate.</p>
<pre><code>scrabble: List[str] = requests.get('https://raw.githubusercontent.com/benjamincrom/scrabble/master/scrabble/dictionary.json').json()
possible: List[str] = sorted(set(map(str.upper, scrabble)).intersection(takens))
print(', '.join(possible))</code></pre>
<p><a href="https://www.rcsb.org/ligand/AA" target="_blank">AA</a>,
<a href="https://www.rcsb.org/ligand/AAH" target="_blank">AAH</a>,
<a href="https://www.rcsb.org/ligand/AAL" target="_blank">AAL</a>,
<a href="https://www.rcsb.org/ligand/AAS" target="_blank">AAS</a>,
<a href="https://www.rcsb.org/ligand/ABA" target="_blank">ABA</a>,
<a href="https://www.rcsb.org/ligand/ABO" target="_blank">ABO</a>,
<a href="https://www.rcsb.org/ligand/ABS" target="_blank">ABS</a>,
<a href="https://www.rcsb.org/ligand/ABY" target="_blank">ABY</a>,
<a href="https://www.rcsb.org/ligand/ACE" target="_blank">ACE</a>,
<a href="https://www.rcsb.org/ligand/ACT" target="_blank">ACT</a>,
<a href="https://www.rcsb.org/ligand/ADD" target="_blank">ADD</a>,
<a href="https://www.rcsb.org/ligand/ADO" target="_blank">ADO</a>,
<a href="https://www.rcsb.org/ligand/ADS" target="_blank">ADS</a>,
<a href="https://www.rcsb.org/ligand/ADZ" target="_blank">ADZ</a>,
<a href="https://www.rcsb.org/ligand/AFF" target="_blank">AFF</a>,
<a href="https://www.rcsb.org/ligand/AFT" target="_blank">AFT</a>,
<a href="https://www.rcsb.org/ligand/AG" target="_blank">AG</a>,
<a href="https://www.rcsb.org/ligand/AGA" target="_blank">AGA</a>,
<a href="https://www.rcsb.org/ligand/AGE" target="_blank">AGE</a>,
<a href="https://www.rcsb.org/ligand/AGO" target="_blank">AGO</a>,
<a href="https://www.rcsb.org/ligand/AHA" target="_blank">AHA</a>,
<a href="https://www.rcsb.org/ligand/AI" target="_blank">AI</a>,
<a href="https://www.rcsb.org/ligand/AIL" target="_blank">AIL</a>,
<a href="https://www.rcsb.org/ligand/AIM" target="_blank">AIM</a>,
<a href="https://www.rcsb.org/ligand/AIN" target="_blank">AIN</a>,
<a href="https://www.rcsb.org/ligand/AIR" target="_blank">AIR</a>,
<a href="https://www.rcsb.org/ligand/AIS" target="_blank">AIS</a>,
<a href="https://www.rcsb.org/ligand/AIT" target="_blank">AIT</a>,
<a href="https://www.rcsb.org/ligand/AL" target="_blank">AL</a>,
<a href="https://www.rcsb.org/ligand/ALA" target="_blank">ALA</a>,
<a href="https://www.rcsb.org/ligand/ALB" target="_blank">ALB</a>,
<a href="https://www.rcsb.org/ligand/ALE" target="_blank">ALE</a>,
<a href="https://www.rcsb.org/ligand/ALL" target="_blank">ALL</a>,
<a href="https://www.rcsb.org/ligand/ALP" target="_blank">ALP</a>,
<a href="https://www.rcsb.org/ligand/ALS" target="_blank">ALS</a>,
<a href="https://www.rcsb.org/ligand/ALT" target="_blank">ALT</a>,
<a href="https://www.rcsb.org/ligand/AM" target="_blank">AM</a>,
<a href="https://www.rcsb.org/ligand/AMA" target="_blank">AMA</a>,
<a href="https://www.rcsb.org/ligand/AMI" target="_blank">AMI</a>,
<a href="https://www.rcsb.org/ligand/AMP" target="_blank">AMP</a>,
<a href="https://www.rcsb.org/ligand/AMU" target="_blank">AMU</a>,
<a href="https://www.rcsb.org/ligand/ANA" target="_blank">ANA</a>,
<a href="https://www.rcsb.org/ligand/AND" target="_blank">AND</a>,
<a href="https://www.rcsb.org/ligand/ANE" target="_blank">ANE</a>,
<a href="https://www.rcsb.org/ligand/ANI" target="_blank">ANI</a>,
<a href="https://www.rcsb.org/ligand/ANT" target="_blank">ANT</a>,
<a href="https://www.rcsb.org/ligand/ANY" target="_blank">ANY</a>,
<a href="https://www.rcsb.org/ligand/APE" target="_blank">APE</a>,
<a href="https://www.rcsb.org/ligand/APT" target="_blank">APT</a>,
<a href="https://www.rcsb.org/ligand/AR" target="_blank">AR</a>,
<a href="https://www.rcsb.org/ligand/ARB" target="_blank">ARB</a>,
<a href="https://www.rcsb.org/ligand/ARC" target="_blank">ARC</a>,
<a href="https://www.rcsb.org/ligand/ARE" target="_blank">ARE</a>,
<a href="https://www.rcsb.org/ligand/ARF" target="_blank">ARF</a>,
<a href="https://www.rcsb.org/ligand/ARK" target="_blank">ARK</a>,
<a href="https://www.rcsb.org/ligand/ARM" target="_blank">ARM</a>,
<a href="https://www.rcsb.org/ligand/ARS" target="_blank">ARS</a>,
<a href="https://www.rcsb.org/ligand/ART" target="_blank">ART</a>,
<a href="https://www.rcsb.org/ligand/AS" target="_blank">AS</a>,
<a href="https://www.rcsb.org/ligand/ASH" target="_blank">ASH</a>,
<a href="https://www.rcsb.org/ligand/ASK" target="_blank">ASK</a>,
<a href="https://www.rcsb.org/ligand/ASP" target="_blank">ASP</a>,
<a href="https://www.rcsb.org/ligand/ASS" target="_blank">ASS</a>,
<a href="https://www.rcsb.org/ligand/ATE" target="_blank">ATE</a>,
<a href="https://www.rcsb.org/ligand/ATT" target="_blank">ATT</a>,
<a href="https://www.rcsb.org/ligand/AUK" target="_blank">AUK</a>,
<a href="https://www.rcsb.org/ligand/AVA" target="_blank">AVA</a>,
<a href="https://www.rcsb.org/ligand/AVE" target="_blank">AVE</a>,
<a href="https://www.rcsb.org/ligand/AVO" target="_blank">AVO</a>,
<a href="https://www.rcsb.org/ligand/AWA" target="_blank">AWA</a>,
<a href="https://www.rcsb.org/ligand/AWE" target="_blank">AWE</a>,
<a href="https://www.rcsb.org/ligand/AWL" target="_blank">AWL</a>,
<a href="https://www.rcsb.org/ligand/AWN" target="_blank">AWN</a>,
<a href="https://www.rcsb.org/ligand/AXE" target="_blank">AXE</a>,
<a href="https://www.rcsb.org/ligand/AYE" target="_blank">AYE</a>,
<a href="https://www.rcsb.org/ligand/AYS" target="_blank">AYS</a>,
<a href="https://www.rcsb.org/ligand/AZO" target="_blank">AZO</a>,
<a href="https://www.rcsb.org/ligand/BA" target="_blank">BA</a>,
<a href="https://www.rcsb.org/ligand/BAA" target="_blank">BAA</a>,
<a href="https://www.rcsb.org/ligand/BAG" target="_blank">BAG</a>,
<a href="https://www.rcsb.org/ligand/BAH" target="_blank">BAH</a>,
<a href="https://www.rcsb.org/ligand/BAL" target="_blank">BAL</a>,
<a href="https://www.rcsb.org/ligand/BAM" target="_blank">BAM</a>,
<a href="https://www.rcsb.org/ligand/BAN" target="_blank">BAN</a>,
<a href="https://www.rcsb.org/ligand/BAP" target="_blank">BAP</a>,
<a href="https://www.rcsb.org/ligand/BAR" target="_blank">BAR</a>,
<a href="https://www.rcsb.org/ligand/BAS" target="_blank">BAS</a>,
<a href="https://www.rcsb.org/ligand/BAT" target="_blank">BAT</a>,
<a href="https://www.rcsb.org/ligand/BAY" target="_blank">BAY</a>,
<a href="https://www.rcsb.org/ligand/BED" target="_blank">BED</a>,
<a href="https://www.rcsb.org/ligand/BEE" target="_blank">BEE</a>,
<a href="https://www.rcsb.org/ligand/BEG" target="_blank">BEG</a>,
<a href="https://www.rcsb.org/ligand/BEL" target="_blank">BEL</a>,
<a href="https://www.rcsb.org/ligand/BEN" target="_blank">BEN</a>,
<a href="https://www.rcsb.org/ligand/BET" target="_blank">BET</a>,
<a href="https://www.rcsb.org/ligand/BEY" target="_blank">BEY</a>,
<a href="https://www.rcsb.org/ligand/BIB" target="_blank">BIB</a>,
<a href="https://www.rcsb.org/ligand/BID" target="_blank">BID</a>,
<a href="https://www.rcsb.org/ligand/BIG" target="_blank">BIG</a>,
<a href="https://www.rcsb.org/ligand/BIN" target="_blank">BIN</a>,
<a href="https://www.rcsb.org/ligand/BIO" target="_blank">BIO</a>,
<a href="https://www.rcsb.org/ligand/BIS" target="_blank">BIS</a>,
<a href="https://www.rcsb.org/ligand/BIT" target="_blank">BIT</a>,
<a href="https://www.rcsb.org/ligand/BIZ" target="_blank">BIZ</a>,
<a href="https://www.rcsb.org/ligand/BOA" target="_blank">BOA</a>,
<a href="https://www.rcsb.org/ligand/BOB" target="_blank">BOB</a>,
<a href="https://www.rcsb.org/ligand/BOD" target="_blank">BOD</a>,
<a href="https://www.rcsb.org/ligand/BOG" target="_blank">BOG</a>,
<a href="https://www.rcsb.org/ligand/BOP" target="_blank">BOP</a>,
<a href="https://www.rcsb.org/ligand/BOS" target="_blank">BOS</a>,
<a href="https://www.rcsb.org/ligand/BOT" target="_blank">BOT</a>,
<a href="https://www.rcsb.org/ligand/BOW" target="_blank">BOW</a>,
<a href="https://www.rcsb.org/ligand/BOX" target="_blank">BOX</a>,
<a href="https://www.rcsb.org/ligand/BOY" target="_blank">BOY</a>,
<a href="https://www.rcsb.org/ligand/BRA" target="_blank">BRA</a>,
<a href="https://www.rcsb.org/ligand/BRO" target="_blank">BRO</a>,
<a href="https://www.rcsb.org/ligand/BRR" target="_blank">BRR</a>,
<a href="https://www.rcsb.org/ligand/BUB" target="_blank">BUB</a>,
<a href="https://www.rcsb.org/ligand/BUD" target="_blank">BUD</a>,
<a href="https://www.rcsb.org/ligand/BUG" target="_blank">BUG</a>,
<a href="https://www.rcsb.org/ligand/BUM" target="_blank">BUM</a>,
<a href="https://www.rcsb.org/ligand/BUN" target="_blank">BUN</a>,
<a href="https://www.rcsb.org/ligand/BUR" target="_blank">BUR</a>,
<a href="https://www.rcsb.org/ligand/BUT" target="_blank">BUT</a>,
<a href="https://www.rcsb.org/ligand/BUY" target="_blank">BUY</a>,
<a href="https://www.rcsb.org/ligand/BYE" target="_blank">BYE</a>,
<a href="https://www.rcsb.org/ligand/BYS" target="_blank">BYS</a>,
<a href="https://www.rcsb.org/ligand/CAB" target="_blank">CAB</a>,
<a href="https://www.rcsb.org/ligand/CAD" target="_blank">CAD</a>,
<a href="https://www.rcsb.org/ligand/CAM" target="_blank">CAM</a>,
<a href="https://www.rcsb.org/ligand/CAN" target="_blank">CAN</a>,
<a href="https://www.rcsb.org/ligand/CAP" target="_blank">CAP</a>,
<a href="https://www.rcsb.org/ligand/CAR" target="_blank">CAR</a>,
<a href="https://www.rcsb.org/ligand/CAT" target="_blank">CAT</a>,
<a href="https://www.rcsb.org/ligand/CAW" target="_blank">CAW</a>,
<a href="https://www.rcsb.org/ligand/CAY" target="_blank">CAY</a>,
<a href="https://www.rcsb.org/ligand/CEE" target="_blank">CEE</a>,
<a href="https://www.rcsb.org/ligand/CEL" target="_blank">CEL</a>,
<a href="https://www.rcsb.org/ligand/CEP" target="_blank">CEP</a>,
<a href="https://www.rcsb.org/ligand/CHI" target="_blank">CHI</a>,
<a href="https://www.rcsb.org/ligand/CIS" target="_blank">CIS</a>,
<a href="https://www.rcsb.org/ligand/COB" target="_blank">COB</a>,
<a href="https://www.rcsb.org/ligand/COD" target="_blank">COD</a>,
<a href="https://www.rcsb.org/ligand/COG" target="_blank">COG</a>,
<a href="https://www.rcsb.org/ligand/COL" target="_blank">COL</a>,
<a href="https://www.rcsb.org/ligand/CON" target="_blank">CON</a>,
<a href="https://www.rcsb.org/ligand/COO" target="_blank">COO</a>,
<a href="https://www.rcsb.org/ligand/COP" target="_blank">COP</a>,
<a href="https://www.rcsb.org/ligand/COR" target="_blank">COR</a>,
<a href="https://www.rcsb.org/ligand/COS" target="_blank">COS</a>,
<a href="https://www.rcsb.org/ligand/COT" target="_blank">COT</a>,
<a href="https://www.rcsb.org/ligand/COW" target="_blank">COW</a>,
<a href="https://www.rcsb.org/ligand/COX" target="_blank">COX</a>,
<a href="https://www.rcsb.org/ligand/COY" target="_blank">COY</a>,
<a href="https://www.rcsb.org/ligand/COZ" target="_blank">COZ</a>,
<a href="https://www.rcsb.org/ligand/CRY" target="_blank">CRY</a>,
<a href="https://www.rcsb.org/ligand/CUB" target="_blank">CUB</a>,
<a href="https://www.rcsb.org/ligand/CUD" target="_blank">CUD</a>,
<a href="https://www.rcsb.org/ligand/CUE" target="_blank">CUE</a>,
<a href="https://www.rcsb.org/ligand/CUM" target="_blank">CUM</a>,
<a href="https://www.rcsb.org/ligand/CUP" target="_blank">CUP</a>,
<a href="https://www.rcsb.org/ligand/CUR" target="_blank">CUR</a>,
<a href="https://www.rcsb.org/ligand/CUT" target="_blank">CUT</a>,
<a href="https://www.rcsb.org/ligand/CWM" target="_blank">CWM</a>,
<a href="https://www.rcsb.org/ligand/DAB" target="_blank">DAB</a>,
<a href="https://www.rcsb.org/ligand/DAD" target="_blank">DAD</a>,
<a href="https://www.rcsb.org/ligand/DAG" target="_blank">DAG</a>,
<a href="https://www.rcsb.org/ligand/DAH" target="_blank">DAH</a>,
<a href="https://www.rcsb.org/ligand/DAK" target="_blank">DAK</a>,
<a href="https://www.rcsb.org/ligand/DAL" target="_blank">DAL</a>,
<a href="https://www.rcsb.org/ligand/DAM" target="_blank">DAM</a>,
<a href="https://www.rcsb.org/ligand/DAP" target="_blank">DAP</a>,
<a href="https://www.rcsb.org/ligand/DAW" target="_blank">DAW</a>,
<a href="https://www.rcsb.org/ligand/DAY" target="_blank">DAY</a>,
<a href="https://www.rcsb.org/ligand/DEB" target="_blank">DEB</a>,
<a href="https://www.rcsb.org/ligand/DEE" target="_blank">DEE</a>,
<a href="https://www.rcsb.org/ligand/DEL" target="_blank">DEL</a>,
<a href="https://www.rcsb.org/ligand/DEN" target="_blank">DEN</a>,
<a href="https://www.rcsb.org/ligand/DEV" target="_blank">DEV</a>,
<a href="https://www.rcsb.org/ligand/DEW" target="_blank">DEW</a>,
<a href="https://www.rcsb.org/ligand/DEX" target="_blank">DEX</a>,
<a href="https://www.rcsb.org/ligand/DEY" target="_blank">DEY</a>,
<a href="https://www.rcsb.org/ligand/DIB" target="_blank">DIB</a>,
<a href="https://www.rcsb.org/ligand/DID" target="_blank">DID</a>,
<a href="https://www.rcsb.org/ligand/DIE" target="_blank">DIE</a>,
<a href="https://www.rcsb.org/ligand/DIG" target="_blank">DIG</a>,
<a href="https://www.rcsb.org/ligand/DIM" target="_blank">DIM</a>,
<a href="https://www.rcsb.org/ligand/DIN" target="_blank">DIN</a>,
<a href="https://www.rcsb.org/ligand/DIP" target="_blank">DIP</a>,
<a href="https://www.rcsb.org/ligand/DIS" target="_blank">DIS</a>,
<a href="https://www.rcsb.org/ligand/DIT" target="_blank">DIT</a>,
<a href="https://www.rcsb.org/ligand/DOC" target="_blank">DOC</a>,
<a href="https://www.rcsb.org/ligand/DOE" target="_blank">DOE</a>,
<a href="https://www.rcsb.org/ligand/DOG" target="_blank">DOG</a>,
<a href="https://www.rcsb.org/ligand/DOL" target="_blank">DOL</a>,
<a href="https://www.rcsb.org/ligand/DOM" target="_blank">DOM</a>,
<a href="https://www.rcsb.org/ligand/DON" target="_blank">DON</a>,
<a href="https://www.rcsb.org/ligand/DOR" target="_blank">DOR</a>,
<a href="https://www.rcsb.org/ligand/DOS" target="_blank">DOS</a>,
<a href="https://www.rcsb.org/ligand/DOT" target="_blank">DOT</a>,
<a href="https://www.rcsb.org/ligand/DOW" target="_blank">DOW</a>,
<a href="https://www.rcsb.org/ligand/DRY" target="_blank">DRY</a>,
<a href="https://www.rcsb.org/ligand/DUB" target="_blank">DUB</a>,
<a href="https://www.rcsb.org/ligand/DUD" target="_blank">DUD</a>,
<a href="https://www.rcsb.org/ligand/DUE" target="_blank">DUE</a>,
<a href="https://www.rcsb.org/ligand/DUG" target="_blank">DUG</a>,
<a href="https://www.rcsb.org/ligand/DUN" target="_blank">DUN</a>,
<a href="https://www.rcsb.org/ligand/DUO" target="_blank">DUO</a>,
<a href="https://www.rcsb.org/ligand/DUP" target="_blank">DUP</a>,
<a href="https://www.rcsb.org/ligand/EAR" target="_blank">EAR</a>,
<a href="https://www.rcsb.org/ligand/EAT" target="_blank">EAT</a>,
<a href="https://www.rcsb.org/ligand/EAU" target="_blank">EAU</a>,
<a href="https://www.rcsb.org/ligand/EBB" target="_blank">EBB</a>,
<a href="https://www.rcsb.org/ligand/ECU" target="_blank">ECU</a>,
<a href="https://www.rcsb.org/ligand/EDH" target="_blank">EDH</a>,
<a href="https://www.rcsb.org/ligand/EEL" target="_blank">EEL</a>,
<a href="https://www.rcsb.org/ligand/EFF" target="_blank">EFF</a>,
<a href="https://www.rcsb.org/ligand/EFS" target="_blank">EFS</a>,
<a href="https://www.rcsb.org/ligand/EFT" target="_blank">EFT</a>,
<a href="https://www.rcsb.org/ligand/EGG" target="_blank">EGG</a>,
<a href="https://www.rcsb.org/ligand/EGO" target="_blank">EGO</a>,
<a href="https://www.rcsb.org/ligand/EKE" target="_blank">EKE</a>,
<a href="https://www.rcsb.org/ligand/EL" target="_blank">EL</a>,
<a href="https://www.rcsb.org/ligand/ELD" target="_blank">ELD</a>,
<a href="https://www.rcsb.org/ligand/ELF" target="_blank">ELF</a>,
<a href="https://www.rcsb.org/ligand/ELK" target="_blank">ELK</a>,
<a href="https://www.rcsb.org/ligand/ELL" target="_blank">ELL</a>,
<a href="https://www.rcsb.org/ligand/ELM" target="_blank">ELM</a>,
<a href="https://www.rcsb.org/ligand/ELS" target="_blank">ELS</a>,
<a href="https://www.rcsb.org/ligand/EME" target="_blank">EME</a>,
<a href="https://www.rcsb.org/ligand/EMF" target="_blank">EMF</a>,
<a href="https://www.rcsb.org/ligand/EMS" target="_blank">EMS</a>,
<a href="https://www.rcsb.org/ligand/EMU" target="_blank">EMU</a>,
<a href="https://www.rcsb.org/ligand/END" target="_blank">END</a>,
<a href="https://www.rcsb.org/ligand/ENG" target="_blank">ENG</a>,
<a href="https://www.rcsb.org/ligand/ENS" target="_blank">ENS</a>,
<a href="https://www.rcsb.org/ligand/EON" target="_blank">EON</a>,
<a href="https://www.rcsb.org/ligand/ERA" target="_blank">ERA</a>,
<a href="https://www.rcsb.org/ligand/ERE" target="_blank">ERE</a>,
<a href="https://www.rcsb.org/ligand/ERG" target="_blank">ERG</a>,
<a href="https://www.rcsb.org/ligand/ERN" target="_blank">ERN</a>,
<a href="https://www.rcsb.org/ligand/ERR" target="_blank">ERR</a>,
<a href="https://www.rcsb.org/ligand/ERS" target="_blank">ERS</a>,
<a href="https://www.rcsb.org/ligand/ESS" target="_blank">ESS</a>,
<a href="https://www.rcsb.org/ligand/ET" target="_blank">ET</a>,
<a href="https://www.rcsb.org/ligand/ETA" target="_blank">ETA</a>,
<a href="https://www.rcsb.org/ligand/ETH" target="_blank">ETH</a>,
<a href="https://www.rcsb.org/ligand/EVE" target="_blank">EVE</a>,
<a href="https://www.rcsb.org/ligand/EWE" target="_blank">EWE</a>,
<a href="https://www.rcsb.org/ligand/EYE" target="_blank">EYE</a>,
<a href="https://www.rcsb.org/ligand/FA" target="_blank">FA</a>,
<a href="https://www.rcsb.org/ligand/FAD" target="_blank">FAD</a>,
<a href="https://www.rcsb.org/ligand/FAG" target="_blank">FAG</a>,
<a href="https://www.rcsb.org/ligand/FAN" target="_blank">FAN</a>,
<a href="https://www.rcsb.org/ligand/FAR" target="_blank">FAR</a>,
<a href="https://www.rcsb.org/ligand/FAS" target="_blank">FAS</a>,
<a href="https://www.rcsb.org/ligand/FAT" target="_blank">FAT</a>,
<a href="https://www.rcsb.org/ligand/FAX" target="_blank">FAX</a>,
<a href="https://www.rcsb.org/ligand/FAY" target="_blank">FAY</a>,
<a href="https://www.rcsb.org/ligand/FED" target="_blank">FED</a>,
<a href="https://www.rcsb.org/ligand/FEE" target="_blank">FEE</a>,
<a href="https://www.rcsb.org/ligand/FEH" target="_blank">FEH</a>,
<a href="https://www.rcsb.org/ligand/FEM" target="_blank">FEM</a>,
<a href="https://www.rcsb.org/ligand/FEN" target="_blank">FEN</a>,
<a href="https://www.rcsb.org/ligand/FER" target="_blank">FER</a>,
<a href="https://www.rcsb.org/ligand/FEU" target="_blank">FEU</a>,
<a href="https://www.rcsb.org/ligand/FEW" target="_blank">FEW</a>,
<a href="https://www.rcsb.org/ligand/FEY" target="_blank">FEY</a>,
<a href="https://www.rcsb.org/ligand/FEZ" target="_blank">FEZ</a>,
<a href="https://www.rcsb.org/ligand/FIB" target="_blank">FIB</a>,
<a href="https://www.rcsb.org/ligand/FID" target="_blank">FID</a>,
<a href="https://www.rcsb.org/ligand/FIG" target="_blank">FIG</a>,
<a href="https://www.rcsb.org/ligand/FIL" target="_blank">FIL</a>,
<a href="https://www.rcsb.org/ligand/FIN" target="_blank">FIN</a>,
<a href="https://www.rcsb.org/ligand/FIR" target="_blank">FIR</a>,
<a href="https://www.rcsb.org/ligand/FIT" target="_blank">FIT</a>,
<a href="https://www.rcsb.org/ligand/FIX" target="_blank">FIX</a>,
<a href="https://www.rcsb.org/ligand/FLU" target="_blank">FLU</a>,
<a href="https://www.rcsb.org/ligand/FLY" target="_blank">FLY</a>,
<a href="https://www.rcsb.org/ligand/FOB" target="_blank">FOB</a>,
<a href="https://www.rcsb.org/ligand/FOE" target="_blank">FOE</a>,
<a href="https://www.rcsb.org/ligand/FOG" target="_blank">FOG</a>,
<a href="https://www.rcsb.org/ligand/FOH" target="_blank">FOH</a>,
<a href="https://www.rcsb.org/ligand/FON" target="_blank">FON</a>,
<a href="https://www.rcsb.org/ligand/FOP" target="_blank">FOP</a>,
<a href="https://www.rcsb.org/ligand/FOR" target="_blank">FOR</a>,
<a href="https://www.rcsb.org/ligand/FOU" target="_blank">FOU</a>,
<a href="https://www.rcsb.org/ligand/FOX" target="_blank">FOX</a>,
<a href="https://www.rcsb.org/ligand/FOY" target="_blank">FOY</a>,
<a href="https://www.rcsb.org/ligand/FRO" target="_blank">FRO</a>,
<a href="https://www.rcsb.org/ligand/FRY" target="_blank">FRY</a>,
<a href="https://www.rcsb.org/ligand/FUB" target="_blank">FUB</a>,
<a href="https://www.rcsb.org/ligand/FUD" target="_blank">FUD</a>,
<a href="https://www.rcsb.org/ligand/FUG" target="_blank">FUG</a>,
<a href="https://www.rcsb.org/ligand/FUN" target="_blank">FUN</a>,
<a href="https://www.rcsb.org/ligand/FUR" target="_blank">FUR</a>,
<a href="https://www.rcsb.org/ligand/GAB" target="_blank">GAB</a>,
<a href="https://www.rcsb.org/ligand/GAD" target="_blank">GAD</a>,
<a href="https://www.rcsb.org/ligand/GAE" target="_blank">GAE</a>,
<a href="https://www.rcsb.org/ligand/GAG" target="_blank">GAG</a>,
<a href="https://www.rcsb.org/ligand/GAL" target="_blank">GAL</a>,
<a href="https://www.rcsb.org/ligand/GAM" target="_blank">GAM</a>,
<a href="https://www.rcsb.org/ligand/GAN" target="_blank">GAN</a>,
<a href="https://www.rcsb.org/ligand/GAP" target="_blank">GAP</a>,
<a href="https://www.rcsb.org/ligand/GAR" target="_blank">GAR</a>,
<a href="https://www.rcsb.org/ligand/GAS" target="_blank">GAS</a>,
<a href="https://www.rcsb.org/ligand/GAT" target="_blank">GAT</a>,
<a href="https://www.rcsb.org/ligand/GEE" target="_blank">GEE</a>,
<a href="https://www.rcsb.org/ligand/GEL" target="_blank">GEL</a>,
<a href="https://www.rcsb.org/ligand/GEM" target="_blank">GEM</a>,
<a href="https://www.rcsb.org/ligand/GEN" target="_blank">GEN</a>,
<a href="https://www.rcsb.org/ligand/GET" target="_blank">GET</a>,
<a href="https://www.rcsb.org/ligand/GEY" target="_blank">GEY</a>,
<a href="https://www.rcsb.org/ligand/GIG" target="_blank">GIG</a>,
<a href="https://www.rcsb.org/ligand/GIN" target="_blank">GIN</a>,
<a href="https://www.rcsb.org/ligand/GIP" target="_blank">GIP</a>,
<a href="https://www.rcsb.org/ligand/GIT" target="_blank">GIT</a>,
<a href="https://www.rcsb.org/ligand/GNU" target="_blank">GNU</a>,
<a href="https://www.rcsb.org/ligand/GOA" target="_blank">GOA</a>,
<a href="https://www.rcsb.org/ligand/GOB" target="_blank">GOB</a>,
<a href="https://www.rcsb.org/ligand/GOD" target="_blank">GOD</a>,
<a href="https://www.rcsb.org/ligand/GOO" target="_blank">GOO</a>,
<a href="https://www.rcsb.org/ligand/GOT" target="_blank">GOT</a>,
<a href="https://www.rcsb.org/ligand/GOX" target="_blank">GOX</a>,
<a href="https://www.rcsb.org/ligand/GOY" target="_blank">GOY</a>,
<a href="https://www.rcsb.org/ligand/GUL" target="_blank">GUL</a>,
<a href="https://www.rcsb.org/ligand/GUM" target="_blank">GUM</a>,
<a href="https://www.rcsb.org/ligand/GUN" target="_blank">GUN</a>,
<a href="https://www.rcsb.org/ligand/GUT" target="_blank">GUT</a>,
<a href="https://www.rcsb.org/ligand/GUV" target="_blank">GUV</a>,
<a href="https://www.rcsb.org/ligand/GUY" target="_blank">GUY</a>,
<a href="https://www.rcsb.org/ligand/GYM" target="_blank">GYM</a>,
<a href="https://www.rcsb.org/ligand/GYP" target="_blank">GYP</a>,
<a href="https://www.rcsb.org/ligand/HAD" target="_blank">HAD</a>,
<a href="https://www.rcsb.org/ligand/HAE" target="_blank">HAE</a>,
<a href="https://www.rcsb.org/ligand/HAG" target="_blank">HAG</a>,
<a href="https://www.rcsb.org/ligand/HAH" target="_blank">HAH</a>,
<a href="https://www.rcsb.org/ligand/HAJ" target="_blank">HAJ</a>,
<a href="https://www.rcsb.org/ligand/HAM" target="_blank">HAM</a>,
<a href="https://www.rcsb.org/ligand/HAO" target="_blank">HAO</a>,
<a href="https://www.rcsb.org/ligand/HAP" target="_blank">HAP</a>,
<a href="https://www.rcsb.org/ligand/HAS" target="_blank">HAS</a>,
<a href="https://www.rcsb.org/ligand/HAT" target="_blank">HAT</a>,
<a href="https://www.rcsb.org/ligand/HAW" target="_blank">HAW</a>,
<a href="https://www.rcsb.org/ligand/HAY" target="_blank">HAY</a>,
<a href="https://www.rcsb.org/ligand/HEH" target="_blank">HEH</a>,
<a href="https://www.rcsb.org/ligand/HEM" target="_blank">HEM</a>,
<a href="https://www.rcsb.org/ligand/HEN" target="_blank">HEN</a>,
<a href="https://www.rcsb.org/ligand/HEP" target="_blank">HEP</a>,
<a href="https://www.rcsb.org/ligand/HER" target="_blank">HER</a>,
<a href="https://www.rcsb.org/ligand/HES" target="_blank">HES</a>,
<a href="https://www.rcsb.org/ligand/HET" target="_blank">HET</a>,
<a href="https://www.rcsb.org/ligand/HEW" target="_blank">HEW</a>,
<a href="https://www.rcsb.org/ligand/HEX" target="_blank">HEX</a>,
<a href="https://www.rcsb.org/ligand/HEY" target="_blank">HEY</a>,
<a href="https://www.rcsb.org/ligand/HIC" target="_blank">HIC</a>,
<a href="https://www.rcsb.org/ligand/HID" target="_blank">HID</a>,
<a href="https://www.rcsb.org/ligand/HIE" target="_blank">HIE</a>,
<a href="https://www.rcsb.org/ligand/HIN" target="_blank">HIN</a>,
<a href="https://www.rcsb.org/ligand/HIP" target="_blank">HIP</a>,
<a href="https://www.rcsb.org/ligand/HIS" target="_blank">HIS</a>,
<a href="https://www.rcsb.org/ligand/HIT" target="_blank">HIT</a>,
<a href="https://www.rcsb.org/ligand/HMM" target="_blank">HMM</a>,
<a href="https://www.rcsb.org/ligand/HO" target="_blank">HO</a>,
<a href="https://www.rcsb.org/ligand/HOB" target="_blank">HOB</a>,
<a href="https://www.rcsb.org/ligand/HOE" target="_blank">HOE</a>,
<a href="https://www.rcsb.org/ligand/HOG" target="_blank">HOG</a>,
<a href="https://www.rcsb.org/ligand/HOP" target="_blank">HOP</a>,
<a href="https://www.rcsb.org/ligand/HOT" target="_blank">HOT</a>,
<a href="https://www.rcsb.org/ligand/HOW" target="_blank">HOW</a>,
<a href="https://www.rcsb.org/ligand/HUB" target="_blank">HUB</a>,
<a href="https://www.rcsb.org/ligand/HUE" target="_blank">HUE</a>,
<a href="https://www.rcsb.org/ligand/HUG" target="_blank">HUG</a>,
<a href="https://www.rcsb.org/ligand/HUH" target="_blank">HUH</a>,
<a href="https://www.rcsb.org/ligand/HUM" target="_blank">HUM</a>,
<a href="https://www.rcsb.org/ligand/HUN" target="_blank">HUN</a>,
<a href="https://www.rcsb.org/ligand/HUP" target="_blank">HUP</a>,
<a href="https://www.rcsb.org/ligand/HUT" target="_blank">HUT</a>,
<a href="https://www.rcsb.org/ligand/HYP" target="_blank">HYP</a>,
<a href="https://www.rcsb.org/ligand/ICE" target="_blank">ICE</a>,
<a href="https://www.rcsb.org/ligand/ICH" target="_blank">ICH</a>,
<a href="https://www.rcsb.org/ligand/ICY" target="_blank">ICY</a>,
<a href="https://www.rcsb.org/ligand/IDS" target="_blank">IDS</a>,
<a href="https://www.rcsb.org/ligand/IFS" target="_blank">IFS</a>,
<a href="https://www.rcsb.org/ligand/ILL" target="_blank">ILL</a>,
<a href="https://www.rcsb.org/ligand/IMP" target="_blank">IMP</a>,
<a href="https://www.rcsb.org/ligand/IN" target="_blank">IN</a>,
<a href="https://www.rcsb.org/ligand/INK" target="_blank">INK</a>,
<a href="https://www.rcsb.org/ligand/INN" target="_blank">INN</a>,
<a href="https://www.rcsb.org/ligand/INS" target="_blank">INS</a>,
<a href="https://www.rcsb.org/ligand/ION" target="_blank">ION</a>,
<a href="https://www.rcsb.org/ligand/IRE" target="_blank">IRE</a>,
<a href="https://www.rcsb.org/ligand/IRK" target="_blank">IRK</a>,
<a href="https://www.rcsb.org/ligand/ISM" target="_blank">ISM</a>,
<a href="https://www.rcsb.org/ligand/ITS" target="_blank">ITS</a>,
<a href="https://www.rcsb.org/ligand/JAB" target="_blank">JAB</a>,
<a href="https://www.rcsb.org/ligand/JAG" target="_blank">JAG</a>,
<a href="https://www.rcsb.org/ligand/JAR" target="_blank">JAR</a>,
<a href="https://www.rcsb.org/ligand/JAW" target="_blank">JAW</a>,
<a href="https://www.rcsb.org/ligand/JAY" target="_blank">JAY</a>,
<a href="https://www.rcsb.org/ligand/JEE" target="_blank">JEE</a>,
<a href="https://www.rcsb.org/ligand/JET" target="_blank">JET</a>,
<a href="https://www.rcsb.org/ligand/JEU" target="_blank">JEU</a>,
<a href="https://www.rcsb.org/ligand/JEW" target="_blank">JEW</a>,
<a href="https://www.rcsb.org/ligand/JIN" target="_blank">JIN</a>,
<a href="https://www.rcsb.org/ligand/JOB" target="_blank">JOB</a>,
<a href="https://www.rcsb.org/ligand/JOE" target="_blank">JOE</a>,
<a href="https://www.rcsb.org/ligand/JOG" target="_blank">JOG</a>,
<a href="https://www.rcsb.org/ligand/JOT" target="_blank">JOT</a>,
<a href="https://www.rcsb.org/ligand/JOW" target="_blank">JOW</a>,
<a href="https://www.rcsb.org/ligand/JOY" target="_blank">JOY</a>,
<a href="https://www.rcsb.org/ligand/JUG" target="_blank">JUG</a>,
<a href="https://www.rcsb.org/ligand/JUN" target="_blank">JUN</a>,
<a href="https://www.rcsb.org/ligand/JUS" target="_blank">JUS</a>,
<a href="https://www.rcsb.org/ligand/JUT" target="_blank">JUT</a>,
<a href="https://www.rcsb.org/ligand/KAB" target="_blank">KAB</a>,
<a href="https://www.rcsb.org/ligand/KAF" target="_blank">KAF</a>,
<a href="https://www.rcsb.org/ligand/KAS" target="_blank">KAS</a>,
<a href="https://www.rcsb.org/ligand/KAT" target="_blank">KAT</a>,
<a href="https://www.rcsb.org/ligand/KAY" target="_blank">KAY</a>,
<a href="https://www.rcsb.org/ligand/KEA" target="_blank">KEA</a>,
<a href="https://www.rcsb.org/ligand/KEF" target="_blank">KEF</a>,
<a href="https://www.rcsb.org/ligand/KEG" target="_blank">KEG</a>,
<a href="https://www.rcsb.org/ligand/KEN" target="_blank">KEN</a>,
<a href="https://www.rcsb.org/ligand/KEP" target="_blank">KEP</a>,
<a href="https://www.rcsb.org/ligand/KEY" target="_blank">KEY</a>,
<a href="https://www.rcsb.org/ligand/KIF" target="_blank">KIF</a>,
<a href="https://www.rcsb.org/ligand/KIN" target="_blank">KIN</a>,
<a href="https://www.rcsb.org/ligand/KIR" target="_blank">KIR</a>,
<a href="https://www.rcsb.org/ligand/KOB" target="_blank">KOB</a>,
<a href="https://www.rcsb.org/ligand/KOP" target="_blank">KOP</a>,
<a href="https://www.rcsb.org/ligand/KOR" target="_blank">KOR</a>,
<a href="https://www.rcsb.org/ligand/KOS" target="_blank">KOS</a>,
<a href="https://www.rcsb.org/ligand/KUE" target="_blank">KUE</a>,
<a href="https://www.rcsb.org/ligand/LA" target="_blank">LA</a>,
<a href="https://www.rcsb.org/ligand/LAB" target="_blank">LAB</a>,
<a href="https://www.rcsb.org/ligand/LAC" target="_blank">LAC</a>,
<a href="https://www.rcsb.org/ligand/LAD" target="_blank">LAD</a>,
<a href="https://www.rcsb.org/ligand/LAG" target="_blank">LAG</a>,
<a href="https://www.rcsb.org/ligand/LAM" target="_blank">LAM</a>,
<a href="https://www.rcsb.org/ligand/LAP" target="_blank">LAP</a>,
<a href="https://www.rcsb.org/ligand/LAR" target="_blank">LAR</a>,
<a href="https://www.rcsb.org/ligand/LAS" target="_blank">LAS</a>,
<a href="https://www.rcsb.org/ligand/LAT" target="_blank">LAT</a>,
<a href="https://www.rcsb.org/ligand/LAV" target="_blank">LAV</a>,
<a href="https://www.rcsb.org/ligand/LAX" target="_blank">LAX</a>,
<a href="https://www.rcsb.org/ligand/LAY" target="_blank">LAY</a>,
<a href="https://www.rcsb.org/ligand/LEA" target="_blank">LEA</a>,
<a href="https://www.rcsb.org/ligand/LED" target="_blank">LED</a>,
<a href="https://www.rcsb.org/ligand/LEE" target="_blank">LEE</a>,
<a href="https://www.rcsb.org/ligand/LEG" target="_blank">LEG</a>,
<a href="https://www.rcsb.org/ligand/LEI" target="_blank">LEI</a>,
<a href="https://www.rcsb.org/ligand/LEK" target="_blank">LEK</a>,
<a href="https://www.rcsb.org/ligand/LET" target="_blank">LET</a>,
<a href="https://www.rcsb.org/ligand/LEU" target="_blank">LEU</a>,
<a href="https://www.rcsb.org/ligand/LEV" target="_blank">LEV</a>,
<a href="https://www.rcsb.org/ligand/LEX" target="_blank">LEX</a>,
<a href="https://www.rcsb.org/ligand/LEY" target="_blank">LEY</a>,
<a href="https://www.rcsb.org/ligand/LEZ" target="_blank">LEZ</a>,
<a href="https://www.rcsb.org/ligand/LI" target="_blank">LI</a>,
<a href="https://www.rcsb.org/ligand/LIB" target="_blank">LIB</a>,
<a href="https://www.rcsb.org/ligand/LID" target="_blank">LID</a>,
<a href="https://www.rcsb.org/ligand/LIE" target="_blank">LIE</a>,
<a href="https://www.rcsb.org/ligand/LIN" target="_blank">LIN</a>,
<a href="https://www.rcsb.org/ligand/LIP" target="_blank">LIP</a>,
<a href="https://www.rcsb.org/ligand/LIS" target="_blank">LIS</a>,
<a href="https://www.rcsb.org/ligand/LIT" target="_blank">LIT</a>,
<a href="https://www.rcsb.org/ligand/LOB" target="_blank">LOB</a>,
<a href="https://www.rcsb.org/ligand/LOG" target="_blank">LOG</a>,
<a href="https://www.rcsb.org/ligand/LOP" target="_blank">LOP</a>,
<a href="https://www.rcsb.org/ligand/LOT" target="_blank">LOT</a>,
<a href="https://www.rcsb.org/ligand/LOW" target="_blank">LOW</a>,
<a href="https://www.rcsb.org/ligand/LOX" target="_blank">LOX</a>,
<a href="https://www.rcsb.org/ligand/LUG" target="_blank">LUG</a>,
<a href="https://www.rcsb.org/ligand/LUM" target="_blank">LUM</a>,
<a href="https://www.rcsb.org/ligand/LUV" target="_blank">LUV</a>,
<a href="https://www.rcsb.org/ligand/LUX" target="_blank">LUX</a>,
<a href="https://www.rcsb.org/ligand/LYE" target="_blank">LYE</a>,
<a href="https://www.rcsb.org/ligand/MA" target="_blank">MA</a>,
<a href="https://www.rcsb.org/ligand/MAC" target="_blank">MAC</a>,
<a href="https://www.rcsb.org/ligand/MAD" target="_blank">MAD</a>,
<a href="https://www.rcsb.org/ligand/MAE" target="_blank">MAE</a>,
<a href="https://www.rcsb.org/ligand/MAG" target="_blank">MAG</a>,
<a href="https://www.rcsb.org/ligand/MAN" target="_blank">MAN</a>,
<a href="https://www.rcsb.org/ligand/MAP" target="_blank">MAP</a>,
<a href="https://www.rcsb.org/ligand/MAR" target="_blank">MAR</a>,
<a href="https://www.rcsb.org/ligand/MAS" target="_blank">MAS</a>,
<a href="https://www.rcsb.org/ligand/MAT" target="_blank">MAT</a>,
<a href="https://www.rcsb.org/ligand/MAW" target="_blank">MAW</a>,
<a href="https://www.rcsb.org/ligand/MAX" target="_blank">MAX</a>,
<a href="https://www.rcsb.org/ligand/MAY" target="_blank">MAY</a>,
<a href="https://www.rcsb.org/ligand/MED" target="_blank">MED</a>,
<a href="https://www.rcsb.org/ligand/MEL" target="_blank">MEL</a>,
<a href="https://www.rcsb.org/ligand/MEM" target="_blank">MEM</a>,
<a href="https://www.rcsb.org/ligand/MEN" target="_blank">MEN</a>,
<a href="https://www.rcsb.org/ligand/MET" target="_blank">MET</a>,
<a href="https://www.rcsb.org/ligand/MEW" target="_blank">MEW</a>,
<a href="https://www.rcsb.org/ligand/MHO" target="_blank">MHO</a>,
<a href="https://www.rcsb.org/ligand/MIB" target="_blank">MIB</a>,
<a href="https://www.rcsb.org/ligand/MID" target="_blank">MID</a>,
<a href="https://www.rcsb.org/ligand/MIG" target="_blank">MIG</a>,
<a href="https://www.rcsb.org/ligand/MIL" target="_blank">MIL</a>,
<a href="https://www.rcsb.org/ligand/MIM" target="_blank">MIM</a>,
<a href="https://www.rcsb.org/ligand/MIR" target="_blank">MIR</a>,
<a href="https://www.rcsb.org/ligand/MIS" target="_blank">MIS</a>,
<a href="https://www.rcsb.org/ligand/MIX" target="_blank">MIX</a>,
<a href="https://www.rcsb.org/ligand/MO" target="_blank">MO</a>,
<a href="https://www.rcsb.org/ligand/MOA" target="_blank">MOA</a>,
<a href="https://www.rcsb.org/ligand/MOB" target="_blank">MOB</a>,
<a href="https://www.rcsb.org/ligand/MOC" target="_blank">MOC</a>,
<a href="https://www.rcsb.org/ligand/MOD" target="_blank">MOD</a>,
<a href="https://www.rcsb.org/ligand/MOG" target="_blank">MOG</a>,
<a href="https://www.rcsb.org/ligand/MOL" target="_blank">MOL</a>,
<a href="https://www.rcsb.org/ligand/MOM" target="_blank">MOM</a>,
<a href="https://www.rcsb.org/ligand/MON" target="_blank">MON</a>,
<a href="https://www.rcsb.org/ligand/MOO" target="_blank">MOO</a>,
<a href="https://www.rcsb.org/ligand/MOP" target="_blank">MOP</a>,
<a href="https://www.rcsb.org/ligand/MOR" target="_blank">MOR</a>,
<a href="https://www.rcsb.org/ligand/MOS" target="_blank">MOS</a>,
<a href="https://www.rcsb.org/ligand/MOT" target="_blank">MOT</a>,
<a href="https://www.rcsb.org/ligand/MOW" target="_blank">MOW</a>,
<a href="https://www.rcsb.org/ligand/MUD" target="_blank">MUD</a>,
<a href="https://www.rcsb.org/ligand/MUG" target="_blank">MUG</a>,
<a href="https://www.rcsb.org/ligand/MUM" target="_blank">MUM</a>,
<a href="https://www.rcsb.org/ligand/MUN" target="_blank">MUN</a>,
<a href="https://www.rcsb.org/ligand/MUS" target="_blank">MUS</a>,
<a href="https://www.rcsb.org/ligand/MUT" target="_blank">MUT</a>,
<a href="https://www.rcsb.org/ligand/NA" target="_blank">NA</a>,
<a href="https://www.rcsb.org/ligand/NAB" target="_blank">NAB</a>,
<a href="https://www.rcsb.org/ligand/NAE" target="_blank">NAE</a>,
<a href="https://www.rcsb.org/ligand/NAG" target="_blank">NAG</a>,
<a href="https://www.rcsb.org/ligand/NAH" target="_blank">NAH</a>,
<a href="https://www.rcsb.org/ligand/NAM" target="_blank">NAM</a>,
<a href="https://www.rcsb.org/ligand/NAN" target="_blank">NAN</a>,
<a href="https://www.rcsb.org/ligand/NAP" target="_blank">NAP</a>,
<a href="https://www.rcsb.org/ligand/NAW" target="_blank">NAW</a>,
<a href="https://www.rcsb.org/ligand/NAY" target="_blank">NAY</a>,
<a href="https://www.rcsb.org/ligand/NEB" target="_blank">NEB</a>,
<a href="https://www.rcsb.org/ligand/NEE" target="_blank">NEE</a>,
<a href="https://www.rcsb.org/ligand/NET" target="_blank">NET</a>,
<a href="https://www.rcsb.org/ligand/NEW" target="_blank">NEW</a>,
<a href="https://www.rcsb.org/ligand/NIL" target="_blank">NIL</a>,
<a href="https://www.rcsb.org/ligand/NIM" target="_blank">NIM</a>,
<a href="https://www.rcsb.org/ligand/NIP" target="_blank">NIP</a>,
<a href="https://www.rcsb.org/ligand/NIT" target="_blank">NIT</a>,
<a href="https://www.rcsb.org/ligand/NIX" target="_blank">NIX</a>,
<a href="https://www.rcsb.org/ligand/NO" target="_blank">NO</a>,
<a href="https://www.rcsb.org/ligand/NOB" target="_blank">NOB</a>,
<a href="https://www.rcsb.org/ligand/NOD" target="_blank">NOD</a>,
<a href="https://www.rcsb.org/ligand/NOG" target="_blank">NOG</a>,
<a href="https://www.rcsb.org/ligand/NOH" target="_blank">NOH</a>,
<a href="https://www.rcsb.org/ligand/NOM" target="_blank">NOM</a>,
<a href="https://www.rcsb.org/ligand/NOO" target="_blank">NOO</a>,
<a href="https://www.rcsb.org/ligand/NOR" target="_blank">NOR</a>,
<a href="https://www.rcsb.org/ligand/NOS" target="_blank">NOS</a>,
<a href="https://www.rcsb.org/ligand/NOT" target="_blank">NOT</a>,
<a href="https://www.rcsb.org/ligand/NOW" target="_blank">NOW</a>,
<a href="https://www.rcsb.org/ligand/NTH" target="_blank">NTH</a>,
<a href="https://www.rcsb.org/ligand/NUB" target="_blank">NUB</a>,
<a href="https://www.rcsb.org/ligand/NUT" target="_blank">NUT</a>,
<a href="https://www.rcsb.org/ligand/OAF" target="_blank">OAF</a>,
<a href="https://www.rcsb.org/ligand/OAK" target="_blank">OAK</a>,
<a href="https://www.rcsb.org/ligand/OAR" target="_blank">OAR</a>,
<a href="https://www.rcsb.org/ligand/OBE" target="_blank">OBE</a>,
<a href="https://www.rcsb.org/ligand/OBI" target="_blank">OBI</a>,
<a href="https://www.rcsb.org/ligand/OCA" target="_blank">OCA</a>,
<a href="https://www.rcsb.org/ligand/ODD" target="_blank">ODD</a>,
<a href="https://www.rcsb.org/ligand/ODE" target="_blank">ODE</a>,
<a href="https://www.rcsb.org/ligand/ODS" target="_blank">ODS</a>,
<a href="https://www.rcsb.org/ligand/OES" target="_blank">OES</a>,
<a href="https://www.rcsb.org/ligand/OFF" target="_blank">OFF</a>,
<a href="https://www.rcsb.org/ligand/OFT" target="_blank">OFT</a>,
<a href="https://www.rcsb.org/ligand/OH" target="_blank">OH</a>,
<a href="https://www.rcsb.org/ligand/OHM" target="_blank">OHM</a>,
<a href="https://www.rcsb.org/ligand/OHO" target="_blank">OHO</a>,
<a href="https://www.rcsb.org/ligand/OHS" target="_blank">OHS</a>,
<a href="https://www.rcsb.org/ligand/OIL" target="_blank">OIL</a>,
<a href="https://www.rcsb.org/ligand/OKA" target="_blank">OKA</a>,
<a href="https://www.rcsb.org/ligand/OKE" target="_blank">OKE</a>,
<a href="https://www.rcsb.org/ligand/OLD" target="_blank">OLD</a>,
<a href="https://www.rcsb.org/ligand/OLE" target="_blank">OLE</a>,
<a href="https://www.rcsb.org/ligand/OMS" target="_blank">OMS</a>,
<a href="https://www.rcsb.org/ligand/ONE" target="_blank">ONE</a>,
<a href="https://www.rcsb.org/ligand/ONS" target="_blank">ONS</a>,
<a href="https://www.rcsb.org/ligand/OOH" target="_blank">OOH</a>,
<a href="https://www.rcsb.org/ligand/OPE" target="_blank">OPE</a>,
<a href="https://www.rcsb.org/ligand/OPS" target="_blank">OPS</a>,
<a href="https://www.rcsb.org/ligand/OPT" target="_blank">OPT</a>,
<a href="https://www.rcsb.org/ligand/ORA" target="_blank">ORA</a>,
<a href="https://www.rcsb.org/ligand/ORB" target="_blank">ORB</a>,
<a href="https://www.rcsb.org/ligand/ORC" target="_blank">ORC</a>,
<a href="https://www.rcsb.org/ligand/ORE" target="_blank">ORE</a>,
<a href="https://www.rcsb.org/ligand/ORS" target="_blank">ORS</a>,
<a href="https://www.rcsb.org/ligand/ORT" target="_blank">ORT</a>,
<a href="https://www.rcsb.org/ligand/OS" target="_blank">OS</a>,
<a href="https://www.rcsb.org/ligand/OSE" target="_blank">OSE</a>,
<a href="https://www.rcsb.org/ligand/OUD" target="_blank">OUD</a>,
<a href="https://www.rcsb.org/ligand/OUR" target="_blank">OUR</a>,
<a href="https://www.rcsb.org/ligand/OUT" target="_blank">OUT</a>,
<a href="https://www.rcsb.org/ligand/OVA" target="_blank">OVA</a>,
<a href="https://www.rcsb.org/ligand/OWL" target="_blank">OWL</a>,
<a href="https://www.rcsb.org/ligand/OWN" target="_blank">OWN</a>,
<a href="https://www.rcsb.org/ligand/OX" target="_blank">OX</a>,
<a href="https://www.rcsb.org/ligand/OXO" target="_blank">OXO</a>,
<a href="https://www.rcsb.org/ligand/OXY" target="_blank">OXY</a>,
<a href="https://www.rcsb.org/ligand/PAC" target="_blank">PAC</a>,
<a href="https://www.rcsb.org/ligand/PAD" target="_blank">PAD</a>,
<a href="https://www.rcsb.org/ligand/PAH" target="_blank">PAH</a>,
<a href="https://www.rcsb.org/ligand/PAL" target="_blank">PAL</a>,
<a href="https://www.rcsb.org/ligand/PAM" target="_blank">PAM</a>,
<a href="https://www.rcsb.org/ligand/PAN" target="_blank">PAN</a>,
<a href="https://www.rcsb.org/ligand/PAP" target="_blank">PAP</a>,
<a href="https://www.rcsb.org/ligand/PAR" target="_blank">PAR</a>,
<a href="https://www.rcsb.org/ligand/PAS" target="_blank">PAS</a>,
<a href="https://www.rcsb.org/ligand/PAT" target="_blank">PAT</a>,
<a href="https://www.rcsb.org/ligand/PAW" target="_blank">PAW</a>,
<a href="https://www.rcsb.org/ligand/PAX" target="_blank">PAX</a>,
<a href="https://www.rcsb.org/ligand/PAY" target="_blank">PAY</a>,
<a href="https://www.rcsb.org/ligand/PEA" target="_blank">PEA</a>,
<a href="https://www.rcsb.org/ligand/PEC" target="_blank">PEC</a>,
<a href="https://www.rcsb.org/ligand/PED" target="_blank">PED</a>,
<a href="https://www.rcsb.org/ligand/PEE" target="_blank">PEE</a>,
<a href="https://www.rcsb.org/ligand/PEG" target="_blank">PEG</a>,
<a href="https://www.rcsb.org/ligand/PEH" target="_blank">PEH</a>,
<a href="https://www.rcsb.org/ligand/PEP" target="_blank">PEP</a>,
<a href="https://www.rcsb.org/ligand/PER" target="_blank">PER</a>,
<a href="https://www.rcsb.org/ligand/PET" target="_blank">PET</a>,
<a href="https://www.rcsb.org/ligand/PEW" target="_blank">PEW</a>,
<a href="https://www.rcsb.org/ligand/PHI" target="_blank">PHI</a>,
<a href="https://www.rcsb.org/ligand/PHT" target="_blank">PHT</a>,
<a href="https://www.rcsb.org/ligand/PI" target="_blank">PI</a>,
<a href="https://www.rcsb.org/ligand/PIA" target="_blank">PIA</a>,
<a href="https://www.rcsb.org/ligand/PIC" target="_blank">PIC</a>,
<a href="https://www.rcsb.org/ligand/PIE" target="_blank">PIE</a>,
<a href="https://www.rcsb.org/ligand/PIG" target="_blank">PIG</a>,
<a href="https://www.rcsb.org/ligand/PIN" target="_blank">PIN</a>,
<a href="https://www.rcsb.org/ligand/PIP" target="_blank">PIP</a>,
<a href="https://www.rcsb.org/ligand/PIS" target="_blank">PIS</a>,
<a href="https://www.rcsb.org/ligand/PIT" target="_blank">PIT</a>,
<a href="https://www.rcsb.org/ligand/PIU" target="_blank">PIU</a>,
<a href="https://www.rcsb.org/ligand/PIX" target="_blank">PIX</a>,
<a href="https://www.rcsb.org/ligand/PLY" target="_blank">PLY</a>,
<a href="https://www.rcsb.org/ligand/POD" target="_blank">POD</a>,
<a href="https://www.rcsb.org/ligand/POH" target="_blank">POH</a>,
<a href="https://www.rcsb.org/ligand/POI" target="_blank">POI</a>,
<a href="https://www.rcsb.org/ligand/POL" target="_blank">POL</a>,
<a href="https://www.rcsb.org/ligand/POM" target="_blank">POM</a>,
<a href="https://www.rcsb.org/ligand/POP" target="_blank">POP</a>,
<a href="https://www.rcsb.org/ligand/POT" target="_blank">POT</a>,
<a href="https://www.rcsb.org/ligand/POW" target="_blank">POW</a>,
<a href="https://www.rcsb.org/ligand/POX" target="_blank">POX</a>,
<a href="https://www.rcsb.org/ligand/PRO" target="_blank">PRO</a>,
<a href="https://www.rcsb.org/ligand/PRY" target="_blank">PRY</a>,
<a href="https://www.rcsb.org/ligand/PSI" target="_blank">PSI</a>,
<a href="https://www.rcsb.org/ligand/PUB" target="_blank">PUB</a>,
<a href="https://www.rcsb.org/ligand/PUD" target="_blank">PUD</a>,
<a href="https://www.rcsb.org/ligand/PUG" target="_blank">PUG</a>,
<a href="https://www.rcsb.org/ligand/PUL" target="_blank">PUL</a>,
<a href="https://www.rcsb.org/ligand/PUN" target="_blank">PUN</a>,
<a href="https://www.rcsb.org/ligand/PUP" target="_blank">PUP</a>,
<a href="https://www.rcsb.org/ligand/PUR" target="_blank">PUR</a>,
<a href="https://www.rcsb.org/ligand/PUS" target="_blank">PUS</a>,
<a href="https://www.rcsb.org/ligand/PUT" target="_blank">PUT</a>,
<a href="https://www.rcsb.org/ligand/PYA" target="_blank">PYA</a>,
<a href="https://www.rcsb.org/ligand/PYE" target="_blank">PYE</a>,
<a href="https://www.rcsb.org/ligand/PYX" target="_blank">PYX</a>,
<a href="https://www.rcsb.org/ligand/QAT" target="_blank">QAT</a>,
<a href="https://www.rcsb.org/ligand/QUA" target="_blank">QUA</a>,
<a href="https://www.rcsb.org/ligand/RAD" target="_blank">RAD</a>,
<a href="https://www.rcsb.org/ligand/RAH" target="_blank">RAH</a>,
<a href="https://www.rcsb.org/ligand/RAJ" target="_blank">RAJ</a>,
<a href="https://www.rcsb.org/ligand/RAM" target="_blank">RAM</a>,
<a href="https://www.rcsb.org/ligand/RAN" target="_blank">RAN</a>,
<a href="https://www.rcsb.org/ligand/RAP" target="_blank">RAP</a>,
<a href="https://www.rcsb.org/ligand/RAS" target="_blank">RAS</a>,
<a href="https://www.rcsb.org/ligand/RAT" target="_blank">RAT</a>,
<a href="https://www.rcsb.org/ligand/RAW" target="_blank">RAW</a>,
<a href="https://www.rcsb.org/ligand/RAX" target="_blank">RAX</a>,
<a href="https://www.rcsb.org/ligand/RAY" target="_blank">RAY</a>,
<a href="https://www.rcsb.org/ligand/RE" target="_blank">RE</a>,
<a href="https://www.rcsb.org/ligand/REB" target="_blank">REB</a>,
<a href="https://www.rcsb.org/ligand/REC" target="_blank">REC</a>,
<a href="https://www.rcsb.org/ligand/RED" target="_blank">RED</a>,
<a href="https://www.rcsb.org/ligand/REE" target="_blank">REE</a>,
<a href="https://www.rcsb.org/ligand/REF" target="_blank">REF</a>,
<a href="https://www.rcsb.org/ligand/REG" target="_blank">REG</a>,
<a href="https://www.rcsb.org/ligand/REI" target="_blank">REI</a>,
<a href="https://www.rcsb.org/ligand/REM" target="_blank">REM</a>,
<a href="https://www.rcsb.org/ligand/REP" target="_blank">REP</a>,
<a href="https://www.rcsb.org/ligand/RES" target="_blank">RES</a>,
<a href="https://www.rcsb.org/ligand/RET" target="_blank">RET</a>,
<a href="https://www.rcsb.org/ligand/REV" target="_blank">REV</a>,
<a href="https://www.rcsb.org/ligand/REX" target="_blank">REX</a>,
<a href="https://www.rcsb.org/ligand/RHO" target="_blank">RHO</a>,
<a href="https://www.rcsb.org/ligand/RIA" target="_blank">RIA</a>,
<a href="https://www.rcsb.org/ligand/RIB" target="_blank">RIB</a>,
<a href="https://www.rcsb.org/ligand/RID" target="_blank">RID</a>,
<a href="https://www.rcsb.org/ligand/RIF" target="_blank">RIF</a>,
<a href="https://www.rcsb.org/ligand/RIM" target="_blank">RIM</a>,
<a href="https://www.rcsb.org/ligand/RIN" target="_blank">RIN</a>,
<a href="https://www.rcsb.org/ligand/RIP" target="_blank">RIP</a>,
<a href="https://www.rcsb.org/ligand/ROB" target="_blank">ROB</a>,
<a href="https://www.rcsb.org/ligand/ROC" target="_blank">ROC</a>,
<a href="https://www.rcsb.org/ligand/ROD" target="_blank">ROD</a>,
<a href="https://www.rcsb.org/ligand/ROE" target="_blank">ROE</a>,
<a href="https://www.rcsb.org/ligand/ROM" target="_blank">ROM</a>,
<a href="https://www.rcsb.org/ligand/RUB" target="_blank">RUB</a>,
<a href="https://www.rcsb.org/ligand/RUE" target="_blank">RUE</a>,
<a href="https://www.rcsb.org/ligand/RUG" target="_blank">RUG</a>,
<a href="https://www.rcsb.org/ligand/RUM" target="_blank">RUM</a>,
<a href="https://www.rcsb.org/ligand/RUN" target="_blank">RUN</a>,
<a href="https://www.rcsb.org/ligand/RUT" target="_blank">RUT</a>,
<a href="https://www.rcsb.org/ligand/RYA" target="_blank">RYA</a>,
<a href="https://www.rcsb.org/ligand/RYE" target="_blank">RYE</a>,
<a href="https://www.rcsb.org/ligand/SAB" target="_blank">SAB</a>,
<a href="https://www.rcsb.org/ligand/SAC" target="_blank">SAC</a>,
<a href="https://www.rcsb.org/ligand/SAD" target="_blank">SAD</a>,
<a href="https://www.rcsb.org/ligand/SAE" target="_blank">SAE</a>,
<a href="https://www.rcsb.org/ligand/SAG" target="_blank">SAG</a>,
<a href="https://www.rcsb.org/ligand/SAL" target="_blank">SAL</a>,
<a href="https://www.rcsb.org/ligand/SAP" target="_blank">SAP</a>,
<a href="https://www.rcsb.org/ligand/SAT" target="_blank">SAT</a>,
<a href="https://www.rcsb.org/ligand/SAU" target="_blank">SAU</a>,
<a href="https://www.rcsb.org/ligand/SAW" target="_blank">SAW</a>,
<a href="https://www.rcsb.org/ligand/SAX" target="_blank">SAX</a>,
<a href="https://www.rcsb.org/ligand/SAY" target="_blank">SAY</a>,
<a href="https://www.rcsb.org/ligand/SEA" target="_blank">SEA</a>,
<a href="https://www.rcsb.org/ligand/SEC" target="_blank">SEC</a>,
<a href="https://www.rcsb.org/ligand/SEE" target="_blank">SEE</a>,
<a href="https://www.rcsb.org/ligand/SEG" target="_blank">SEG</a>,
<a href="https://www.rcsb.org/ligand/SEI" target="_blank">SEI</a>,
<a href="https://www.rcsb.org/ligand/SEL" target="_blank">SEL</a>,
<a href="https://www.rcsb.org/ligand/SEN" target="_blank">SEN</a>,
<a href="https://www.rcsb.org/ligand/SER" target="_blank">SER</a>,
<a href="https://www.rcsb.org/ligand/SET" target="_blank">SET</a>,
<a href="https://www.rcsb.org/ligand/SHA" target="_blank">SHA</a>,
<a href="https://www.rcsb.org/ligand/SHH" target="_blank">SHH</a>,
<a href="https://www.rcsb.org/ligand/SHY" target="_blank">SHY</a>,
<a href="https://www.rcsb.org/ligand/SIB" target="_blank">SIB</a>,
<a href="https://www.rcsb.org/ligand/SIC" target="_blank">SIC</a>,
<a href="https://www.rcsb.org/ligand/SIM" target="_blank">SIM</a>,
<a href="https://www.rcsb.org/ligand/SIN" target="_blank">SIN</a>,
<a href="https://www.rcsb.org/ligand/SIP" target="_blank">SIP</a>,
<a href="https://www.rcsb.org/ligand/SIR" target="_blank">SIR</a>,
<a href="https://www.rcsb.org/ligand/SIS" target="_blank">SIS</a>,
<a href="https://www.rcsb.org/ligand/SIX" target="_blank">SIX</a>,
<a href="https://www.rcsb.org/ligand/SKA" target="_blank">SKA</a>,
<a href="https://www.rcsb.org/ligand/SKY" target="_blank">SKY</a>,
<a href="https://www.rcsb.org/ligand/SLY" target="_blank">SLY</a>,
<a href="https://www.rcsb.org/ligand/SOD" target="_blank">SOD</a>,
<a href="https://www.rcsb.org/ligand/SOL" target="_blank">SOL</a>,
<a href="https://www.rcsb.org/ligand/SON" target="_blank">SON</a>,
<a href="https://www.rcsb.org/ligand/SOP" target="_blank">SOP</a>,
<a href="https://www.rcsb.org/ligand/SOS" target="_blank">SOS</a>,
<a href="https://www.rcsb.org/ligand/SOT" target="_blank">SOT</a>,
<a href="https://www.rcsb.org/ligand/SOX" target="_blank">SOX</a>,
<a href="https://www.rcsb.org/ligand/SOY" target="_blank">SOY</a>,
<a href="https://www.rcsb.org/ligand/SPA" target="_blank">SPA</a>,
<a href="https://www.rcsb.org/ligand/SPY" target="_blank">SPY</a>,
<a href="https://www.rcsb.org/ligand/SRI" target="_blank">SRI</a>,
<a href="https://www.rcsb.org/ligand/STY" target="_blank">STY</a>,
<a href="https://www.rcsb.org/ligand/SUB" target="_blank">SUB</a>,
<a href="https://www.rcsb.org/ligand/SUE" target="_blank">SUE</a>,
<a href="https://www.rcsb.org/ligand/SUN" target="_blank">SUN</a>,
<a href="https://www.rcsb.org/ligand/SUP" target="_blank">SUP</a>,
<a href="https://www.rcsb.org/ligand/SUQ" target="_blank">SUQ</a>,
<a href="https://www.rcsb.org/ligand/SYN" target="_blank">SYN</a>,
<a href="https://www.rcsb.org/ligand/TAB" target="_blank">TAB</a>,
<a href="https://www.rcsb.org/ligand/TAD" target="_blank">TAD</a>,
<a href="https://www.rcsb.org/ligand/TAE" target="_blank">TAE</a>,
<a href="https://www.rcsb.org/ligand/TAG" target="_blank">TAG</a>,
<a href="https://www.rcsb.org/ligand/TAJ" target="_blank">TAJ</a>,
<a href="https://www.rcsb.org/ligand/TAM" target="_blank">TAM</a>,
<a href="https://www.rcsb.org/ligand/TAN" target="_blank">TAN</a>,
<a href="https://www.rcsb.org/ligand/TAO" target="_blank">TAO</a>,
<a href="https://www.rcsb.org/ligand/TAP" target="_blank">TAP</a>,
<a href="https://www.rcsb.org/ligand/TAR" target="_blank">TAR</a>,
<a href="https://www.rcsb.org/ligand/TAS" target="_blank">TAS</a>,
<a href="https://www.rcsb.org/ligand/TAT" target="_blank">TAT</a>,
<a href="https://www.rcsb.org/ligand/TAU" target="_blank">TAU</a>,
<a href="https://www.rcsb.org/ligand/TAV" target="_blank">TAV</a>,
<a href="https://www.rcsb.org/ligand/TAW" target="_blank">TAW</a>,
<a href="https://www.rcsb.org/ligand/TAX" target="_blank">TAX</a>,
<a href="https://www.rcsb.org/ligand/TEA" target="_blank">TEA</a>,
<a href="https://www.rcsb.org/ligand/TED" target="_blank">TED</a>,
<a href="https://www.rcsb.org/ligand/TEE" target="_blank">TEE</a>,
<a href="https://www.rcsb.org/ligand/TEG" target="_blank">TEG</a>,
<a href="https://www.rcsb.org/ligand/TEL" target="_blank">TEL</a>,
<a href="https://www.rcsb.org/ligand/TEN" target="_blank">TEN</a>,
<a href="https://www.rcsb.org/ligand/TET" target="_blank">TET</a>,
<a href="https://www.rcsb.org/ligand/TEW" target="_blank">TEW</a>,
<a href="https://www.rcsb.org/ligand/THE" target="_blank">THE</a>,
<a href="https://www.rcsb.org/ligand/THO" target="_blank">THO</a>,
<a href="https://www.rcsb.org/ligand/THY" target="_blank">THY</a>,
<a href="https://www.rcsb.org/ligand/TIC" target="_blank">TIC</a>,
<a href="https://www.rcsb.org/ligand/TIL" target="_blank">TIL</a>,
<a href="https://www.rcsb.org/ligand/TIN" target="_blank">TIN</a>,
<a href="https://www.rcsb.org/ligand/TIS" target="_blank">TIS</a>,
<a href="https://www.rcsb.org/ligand/TIT" target="_blank">TIT</a>,
<a href="https://www.rcsb.org/ligand/TOD" target="_blank">TOD</a>,
<a href="https://www.rcsb.org/ligand/TOE" target="_blank">TOE</a>,
<a href="https://www.rcsb.org/ligand/TOG" target="_blank">TOG</a>,
<a href="https://www.rcsb.org/ligand/TOM" target="_blank">TOM</a>,
<a href="https://www.rcsb.org/ligand/TON" target="_blank">TON</a>,
<a href="https://www.rcsb.org/ligand/TOP" target="_blank">TOP</a>,
<a href="https://www.rcsb.org/ligand/TOR" target="_blank">TOR</a>,
<a href="https://www.rcsb.org/ligand/TOT" target="_blank">TOT</a>,
<a href="https://www.rcsb.org/ligand/TOW" target="_blank">TOW</a>,
<a href="https://www.rcsb.org/ligand/TOY" target="_blank">TOY</a>,
<a href="https://www.rcsb.org/ligand/TRY" target="_blank">TRY</a>,
<a href="https://www.rcsb.org/ligand/TSK" target="_blank">TSK</a>,
<a href="https://www.rcsb.org/ligand/TUB" target="_blank">TUB</a>,
<a href="https://www.rcsb.org/ligand/TUG" target="_blank">TUG</a>,
<a href="https://www.rcsb.org/ligand/TUI" target="_blank">TUI</a>,
<a href="https://www.rcsb.org/ligand/TUN" target="_blank">TUN</a>,
<a href="https://www.rcsb.org/ligand/TUP" target="_blank">TUP</a>,
<a href="https://www.rcsb.org/ligand/TUT" target="_blank">TUT</a>,
<a href="https://www.rcsb.org/ligand/TUX" target="_blank">TUX</a>,
<a href="https://www.rcsb.org/ligand/TWA" target="_blank">TWA</a>,
<a href="https://www.rcsb.org/ligand/TWO" target="_blank">TWO</a>,
<a href="https://www.rcsb.org/ligand/TYE" target="_blank">TYE</a>,
<a href="https://www.rcsb.org/ligand/UDO" target="_blank">UDO</a>,
<a href="https://www.rcsb.org/ligand/UGH" target="_blank">UGH</a>,
<a href="https://www.rcsb.org/ligand/UKE" target="_blank">UKE</a>,
<a href="https://www.rcsb.org/ligand/UMM" target="_blank">UMM</a>,
<a href="https://www.rcsb.org/ligand/UMP" target="_blank">UMP</a>,
<a href="https://www.rcsb.org/ligand/UNS" target="_blank">UNS</a>,
<a href="https://www.rcsb.org/ligand/UPS" target="_blank">UPS</a>,
<a href="https://www.rcsb.org/ligand/URB" target="_blank">URB</a>,
<a href="https://www.rcsb.org/ligand/URD" target="_blank">URD</a>,
<a href="https://www.rcsb.org/ligand/URN" target="_blank">URN</a>,
<a href="https://www.rcsb.org/ligand/USE" target="_blank">USE</a>,
<a href="https://www.rcsb.org/ligand/UTA" target="_blank">UTA</a>,
<a href="https://www.rcsb.org/ligand/UTS" target="_blank">UTS</a>,
<a href="https://www.rcsb.org/ligand/VAC" target="_blank">VAC</a>,
<a href="https://www.rcsb.org/ligand/VAR" target="_blank">VAR</a>,
<a href="https://www.rcsb.org/ligand/VAS" target="_blank">VAS</a>,
<a href="https://www.rcsb.org/ligand/VAT" target="_blank">VAT</a>,
<a href="https://www.rcsb.org/ligand/VAU" target="_blank">VAU</a>,
<a href="https://www.rcsb.org/ligand/VAW" target="_blank">VAW</a>,
<a href="https://www.rcsb.org/ligand/VEE" target="_blank">VEE</a>,
<a href="https://www.rcsb.org/ligand/VEG" target="_blank">VEG</a>,
<a href="https://www.rcsb.org/ligand/VET" target="_blank">VET</a>,
<a href="https://www.rcsb.org/ligand/VIA" target="_blank">VIA</a>,
<a href="https://www.rcsb.org/ligand/VIG" target="_blank">VIG</a>,
<a href="https://www.rcsb.org/ligand/VIS" target="_blank">VIS</a>,
<a href="https://www.rcsb.org/ligand/VUG" target="_blank">VUG</a>,
<a href="https://www.rcsb.org/ligand/WAD" target="_blank">WAD</a>,
<a href="https://www.rcsb.org/ligand/WAN" target="_blank">WAN</a>,
<a href="https://www.rcsb.org/ligand/WAS" target="_blank">WAS</a>,
<a href="https://www.rcsb.org/ligand/WAY" target="_blank">WAY</a>,
<a href="https://www.rcsb.org/ligand/WHA" target="_blank">WHA</a>,
<a href="https://www.rcsb.org/ligand/WHY" target="_blank">WHY</a>,
<a href="https://www.rcsb.org/ligand/WIN" target="_blank">WIN</a>,
<a href="https://www.rcsb.org/ligand/WOE" target="_blank">WOE</a>,
<a href="https://www.rcsb.org/ligand/WOG" target="_blank">WOG</a>,
<a href="https://www.rcsb.org/ligand/WOO" target="_blank">WOO</a>,
<a href="https://www.rcsb.org/ligand/WOP" target="_blank">WOP</a>,
<a href="https://www.rcsb.org/ligand/WOS" target="_blank">WOS</a>,
<a href="https://www.rcsb.org/ligand/WOT" target="_blank">WOT</a>,
<a href="https://www.rcsb.org/ligand/WOW" target="_blank">WOW</a>,
<a href="https://www.rcsb.org/ligand/WRY" target="_blank">WRY</a>,
<a href="https://www.rcsb.org/ligand/WUD" target="_blank">WUD</a>,
<a href="https://www.rcsb.org/ligand/WYE" target="_blank">WYE</a>,
<a href="https://www.rcsb.org/ligand/XIS" target="_blank">XIS</a>,
<a href="https://www.rcsb.org/ligand/YAK" target="_blank">YAK</a>,
<a href="https://www.rcsb.org/ligand/YAM" target="_blank">YAM</a>,
<a href="https://www.rcsb.org/ligand/YAP" target="_blank">YAP</a>,
<a href="https://www.rcsb.org/ligand/YEN" target="_blank">YEN</a>,
<a href="https://www.rcsb.org/ligand/YES" target="_blank">YES</a>,
<a href="https://www.rcsb.org/ligand/YET" target="_blank">YET</a>,
<a href="https://www.rcsb.org/ligand/YIN" target="_blank">YIN</a>,
<a href="https://www.rcsb.org/ligand/YIP" target="_blank">YIP</a>,
<a href="https://www.rcsb.org/ligand/YOD" target="_blank">YOD</a>,
<a href="https://www.rcsb.org/ligand/YOK" target="_blank">YOK</a>,
<a href="https://www.rcsb.org/ligand/YOM" target="_blank">YOM</a>,
<a href="https://www.rcsb.org/ligand/YUK" target="_blank">YUK</a>,
<a href="https://www.rcsb.org/ligand/YUM" target="_blank">YUM</a>,
<a href="https://www.rcsb.org/ligand/YUP" target="_blank">YUP</a>,
<a href="https://www.rcsb.org/ligand/ZAP" target="_blank">ZAP</a>,
<a href="https://www.rcsb.org/ligand/ZED" target="_blank">ZED</a>,
<a href="https://www.rcsb.org/ligand/ZEE" target="_blank">ZEE</a>,
<a href="https://www.rcsb.org/ligand/ZIG" target="_blank">ZIG</a>,
<a href="https://www.rcsb.org/ligand/ZIN" target="_blank">ZIN</a>,
<a href="https://www.rcsb.org/ligand/ZIP" target="_blank">ZIP</a>,
<a href="https://www.rcsb.org/ligand/ZIT" target="_blank">ZIT</a>,
<a href="https://www.rcsb.org/ligand/ZOA" target="_blank">ZOA</a>,
<a href="https://www.rcsb.org/ligand/ZOO" target="_blank">ZOO</a>.<br />
Most of these are synthetic compounds, few are natural products. To figure out what is a natural product, it is a rather convoluted process. Cofactor are easily spotted in the PBDe data, via the 'mapping' API route, wherein they will be in the 'function' field:</p>
<pre><code>x = requests.post(f'https://www.ebi.ac.uk/pdbe/api/pdb/compound/mappings/', data=','.join(possibles)).json()
xref_dict: Dict[str, Any] = {n: l[0] if l else {} for n,l in _.items()}
xref = pd.DataFrame(xref_dict).transpose()</code></pre>
<p>The 'mapping' PDBe API route gives a series of cross-reference keys, but they are not full complete, as some records will have ChemEBI, ChemEMBL or DrugBank IDs —PubChem IDs are absent. ChEBI contains information regarding a compound if it's a natural product (<i>e.g.</i> <a href="https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:108" target="_blank">phaseolin, a bean pterocarpan (a fancy flavinoid)</a>.</p>
<p>Of the above list, I would say (non-empirically) that <a href="https://www.rcsb.org/ligand/SIR" target="_blank">SIR</a> is a curious one because the autogenerated model of the cobalt bound porphirin ring is completely and utterly wrong because metal coordinated compounds crash most compchemistry tools.</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-7746584207271002852022-06-04T05:54:00.007-07:002023-02-01T06:11:58.318-08:00Annotate as you go<p>There's a counter-constructive saying: <em>a project is dead as soon as you add documentation</em> (Aeschylus, I believe).</p>
<p>This could not be more incorrect.
Whereas it is true that writing documentation on an evolving project
will quickly result in the fresh documentation becoming quickly invalid,
it is a planning truth that writing documentation once a project is finishing is impossible
because there are a hundread and one more pressing issues.
Therefore, adding docstrings to each function,
method and class in Python as one goes along is by far more advantageous.
Once this is done, however this information needs to be transmuted into documentation.
Here is how once can set up <a target="_blank" href="https://readthedocs.org/">ReadTheDocs</a> without falling into a few traps,
as the documentation generator Sphinx is ironically weirdly documented and should be done ideally early on,
so one knows what mistakes one's making.</p><a name='more'></a>
<h3>Note</h3>
<p>I run an internal workshop on adding extras to a GitHub repository
and one question in the feedback was if I could write out the steps to do ReadTheDocs properly.
And this is it.</p>
<h3>Motivation</h3>
<p>Motivation is hard when it’s a big task.
When the project is complete,
the paper becomes the only focus and documentation falls on the sidelines.
It does not help that reviewers rarely check code:
in my experience, half of reviewers do not even check a web app.
However, in the grand scheme of things it does matter.
Therefore, one should not leave it to last.
Every time I have left it as the last thing I have sorely regretted it.
Furthermore, the sentence “I wrote this,
but don’t remember what it does” does often arise when documentation is done late.</p>
<h3>Invalid top-down documentation</h3>
<p>The saying in the lead applies primarily to writing an overview.
In an ideal world, the overview is written first and acts as a roadmap of how the module ought to work and
by virtue of the excellent planning resulting from the thoroughly thought-out overview
will be true even at the end of the project. However, projects evolve and
more often than not they were not started with the idea of being evolvable
—the comparison of an American city vs. a European medieval city is classic teaching example
of the concept of planning and evolvability in CS.
Nevertheless, I would still advocate to think what the end goals of a project are and sketch them out before starting.
But majorly the most time-consuming part of writing documentation is the description of the parts,
hence my insistence on writing them as one goes along.</p>
<h3>Docstrings</h3>
<p>In Python docstrings work really nicely —much nicer than doxigen documentation in C++.
Docstrings are generally written in ReStructuredText (rst) within triple quotes within a function, method or class.</p>
<div class="codehilite"><pre><span></span><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Any</span>
<span class="k">def</span> <span class="nf">foo</span><span class="p">(</span><span class="n">bar</span><span class="p">:</span> <span class="n">Any</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="sd">"""</span>
<span class="sd"> This is a docstring.</span>
<span class="sd"> :param bar: This is a parameter.</span>
<span class="sd"> :type bar: Any</span>
<span class="sd"> :return: This is a return value.</span>
<span class="sd"> :rtype: int</span>
<span class="sd"> """</span>
<span class="o">...</span>
<span class="k">return</span> <span class="n">bar</span>
</code></pre></div>
<p>All these docstrings can be converted by Sphinx into a nice documentation page.
Previously I wrote a blog post about
<a target="_blank" href="https://blog.matteoferla.com/2019/11/convert-python-docstrings-to-github.html">converting docstrings to markdown documentation for GitHub</a>,
which is helpful in the case the project is not intended to be pip released, but for a proper project this is
a bad idea and instead the correct course of action is to create ReadTheDocs documentation.
The preferred format for GitHub is markdown as it's easier and the Sphinx autodoc extension is not applicable there.
The preferred format for ReadTheDocs is ReStructuredText (rst).</p>
<p>The textbook example generation of the conf.py file is
using Sphinx <code>sphinx-quickstart</code> command.
This does not automatically tell it to convert docstrings out of the box, but you have to add them.
The docstrings and module content is ”API” documentation and the command line tool <code>sphinx-apidoc</code> or <code>sphinx-autogen</code>
do this. But it often requires some tweaks for the API documentation one wants.</p>
<p>At the base of the repo, we will create a <code>.readthedocs.yml</code> file for ReadTheDocs,
but first lets make a <code>.readthedocs</code> folder (or any other name you want) will the documentation.</p>
<div class="codehilite"><pre><span></span><code>sphinx-apidoc -o .readthedocs . .readthedocs --full -A <span class="s1">'Your name here'</span> -l <span class="s1">'en'</span><span class="p">;</span>
<span class="nb">cd</span> .readthedocs<span class="p">;</span>
</code></pre></div>
<p>Running <code>make html</code> in that folder will generate the documentation in the <code>html</code> folder,
for you to check out. Do this often as stuff breaks easily with Sphinx.</p>
<p>Some tweaks are a must.</p>
<p>In the folder there are two main files of interest, the <code>conf.py</code> file and the <code>index.rst</code> file.
The former holds how the project is parsed the latter how is the main menu.</p>
<h3>Automodule, autoclass, autobahn, automethod, autofunction</h3>
<p>The <code>index.rst</code> file is the main menu. It will refer to a file, without the <code>.rst</code> extension,
with the name of your module.</p>
<p>This will be a file in the folder along with all submodules,
in the format module.submodule.rst. And will contain the following workhorse:</p>
<div class="codehilite"><pre><span></span><code><span class="p">..</span> <span class="ow">automodule</span><span class="p">::</span> module_name.submodule_name
<span class="nc">:members:</span>
<span class="nc">:undoc-members:</span>
<span class="nc">:show-inheritance:</span>
</code></pre></div>
<p>There are a few directives like this that can be used to generate the documentation
and are discussed in <a target="_blank" href="https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html">autodoc documentation</a>,
such as <code>autoclass</code>.
When you add a new python file (submodule) to your project,
Sphinx will not know about it. So be vigilant to add a new definition to the <code>index.rst</code> file.</p>
<p>The following parameters are worth noting:</p>
<ul>
<li><code>:members:</code> will include all the members of the module and the order can be changed with <code>:member-order:</code>.</li>
<li>This will not include private (<code>_foo</code>) or magic (or dunder) methods. <code>:private-members:</code> will include all,
while <code>:special-members:</code> will include magic methods (called special by nobody except Sphinx).</li>
<li><code>:undoc-members:</code> will include all members that are not documented.</li>
<li><code>:inherited-members:</code> will include all members that are inherited from a parent class, which is rather key.</li>
</ul>
<p>When a class gets too big, it should be split into multiple files, each with a single class in it
that has a functional theme. These classes will form a chain of inheritance, leading up to the main class.
Naming the split files with underscores will get them ignored. Consequently,
it is an option to document only the main class which thanks to <code>:inherited-members:</code> will have everything.</p>
<p>But <code>:inherited-members:</code> is not always welcome.
For example, when using typehinting (which is optional but actually a must),
one does resort to <code>typing.TypedDict</code> (which allows you to specify the expected names of the dictionary keys and the
type of its values) or <code>typing.TypeVar</code> (which is a wrapper for a type). The <code>:inherited-members:</code> on these will
make a mess of pointlessness.
Therefore it often gets easier to manually define how one wants things annotated via multiple <code>autoclass</code>
rather than the autogenerated blanket <code>automodule</code>.</p>
<h3>conf.py file</h3>
<h4>Ignore sys.path.insert</h4>
<p>In the <code>conf.py</code> file, there's a commented out line with <code>sys.path.insert</code>. Leave it like so.
In the <code>.readthedocs.yml</code> file, there will be</p>
<div class="codehilite"><pre><span></span><code><span class="nt">python</span><span class="p">:</span>
<span class="nt">install</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">method</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">pip</span>
<span class="nt">path</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">.</span>
<span class="p p-Indicator">-</span> <span class="nt">requirements</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">.readthedocs/requirements.txt</span>
<span class="p p-Indicator">-</span> <span class="nt">requirements</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">requirements.txt</span>
</code></pre></div>
<p>So the module to be documented will be installed anyway (path: <code>.</code>).</p>
<h4>extensions</h4>
<p>The conf.py file does not call a function like <code>setup</code> in a <code>setup.py</code> file,
but just sets global variables for Sphinx.
One is the list <code>extensions</code> which tells Sphinx which extensions to use. E.g.</p>
<div class="codehilite"><pre><span></span><code><span class="n">extensions</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'readthedocs_ext.readthedocs'</span><span class="p">,</span>
<span class="s1">'sphinx.ext.viewcode'</span><span class="p">,</span>
<span class="s1">'sphinx.ext.todo'</span><span class="p">,</span>
<span class="c1">#'sphinx_toolbox.more_autodoc',</span>
<span class="s1">'sphinx.ext.autodoc'</span><span class="p">,</span>
<span class="p">]</span>
</code></pre></div>
<p><code>readthedocs_ext.readthedocs</code> will be added by RTD, but it's nice for testing locally (need to be installed).
<code>sphinx.ext.viewcode</code> shows the code snippets in the documentation.
<code>sphinx_toolbox.more_autodoc</code> is a <a target="_blank" href="https://sphinx-toolbox.readthedocs.io/en/stable/extensions/more_autodoc/index.html">nice extension</a> that adds more autodoc directives,
but is hard to set up as it will crash one and a million corner cases —more so than mypy.
But it is a good idea to check if it can in the first place —if something fails use the subsets that work.
<code>sphinx_toolbox.more_autodoc.typehints</code> is the key one in my opinion as vanilla Sphinx does not do typehints.
In the <code>sphinx-quickstart</code> command <a target="_blank" href="https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html">documentation</a>
there's a list of vanilla extensions that one can use.</p>
<p>It should be noted that the classic way to specify typehint only methods:</p>
<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">typing</span>
<span class="k">if</span> <span class="n">typing</span><span class="o">.</span><span class="n">TYPE_CHECKING</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">foo</span> <span class="kn">import</span> <span class="n">Foo</span>
</code></pre></div>
<p>needs to be altered to:</p>
<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">typing</span>
<span class="k">if</span> <span class="n">typing</span><span class="o">.</span><span class="n">TYPE_CHECKING</span> <span class="ow">or</span> <span class="s1">'sphinx'</span> <span class="ow">in</span> <span class="n">sys</span><span class="o">.</span><span class="n">modules</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">foo</span> <span class="kn">import</span> <span class="n">Foo</span>
</code></pre></div>
<h4>Other variables</h4>
<p>There is a variable <code>html_static_path</code>, which can be set to an empty list if there are no static files:</p>
<div class="codehilite"><pre><span></span><code><span class="n">html_static_path</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'_static'</span><span class="p">]</span>
</code></pre></div>
<p>This is because you cannot git commit an empty folder so without a <code>_static</code> folder it will fail. </p>
<p>There is also a line <code>html_theme = 'alabaster'</code> which is the default theme for Sphinx.
ReadTheDocs uses <code>'sphinx_rtd_theme'</code>. Therefore to use the <code>sphinx_rtd_theme</code> locally you need to install it.
So our installation list is looking like:</p>
<div class="codehilite"><pre><span></span><code>pip install sphinx-toolbox readthedocs-sphinx-ext sphinx-rtd-theme
</code></pre></div>
<p>Other variables worth adding for <code>more_autodoc</code> are:</p>
<div class="codehilite"><pre><span></span><code><span class="n">always_document_param_types</span> <span class="o">=</span> <span class="kc">True</span>
<span class="n">typehints_defaults</span> <span class="o">=</span> <span class="s1">'braces'</span> <span class="c1"># other styles are available</span>
</code></pre></div>
<p>The <code>root_doc</code> variable is a good way to store the rst files in a folder to declutter. By default it is <code>index</code>
as <code>index.rst</code> is the main page, so moving it to <code>source/index.rst</code> and setting <code>root_doc='source/index'</code>.
Alternatively, one could have the conf.py in that folder, but not the make.</p>
<h3><strong>init</strong>.py</h3>
<p>Counterintuitively, <code>__init__</code> method docstrings are skipped,
even if at first documentation of how to initialise a module would be expected in the <code>__init__.py</code> file.
There are thre solutions:</p>
<p>One can add it manually on an <code>autoclass</code> directive
via <code>:special-members: __init__</code> in the rst definition.</p>
<p>One can globally override its skippage in the <code>conf.py</code> file one can add:</p>
<div class="codehilite"><pre><span></span><code><span class="k">def</span> <span class="nf">skip</span><span class="p">(</span><span class="n">app</span><span class="p">,</span> <span class="n">what</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">obj</span><span class="p">,</span><span class="n">would_skip</span><span class="p">,</span> <span class="n">options</span><span class="p">):</span>
<span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="p">(</span> <span class="s1">'__init__'</span><span class="p">,):</span>
<span class="k">return</span> <span class="kc">False</span>
<span class="k">return</span> <span class="n">would_skip</span>
<span class="k">def</span> <span class="nf">setup</span><span class="p">(</span><span class="n">app</span><span class="p">):</span>
<span class="n">app</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s1">'autodoc-skip-member'</span><span class="p">,</span> <span class="n">skip</span><span class="p">)</span>
</code></pre></div>
<p>One can document class initialisation in the class docstring, which is often done,
but one loses the typehints.</p>
<p>However, as codeclimate painfully reminds us, there should be ideally 4 or less attributes in a method,
and class initialisation often has many arguments, so you may end up using packed keyword arguments
annotated as a <code>TypedDict</code>. And to add insult to injury, the init may be overloaded:</p>
<div class="codehilite"><pre><span></span><code><span class="kn">from</span> <span class="nn">typing_extensions</span> <span class="kn">import</span> <span class="n">Unpack</span><span class="p">,</span> <span class="n">TypedDict</span> <span class="c1"># this is a 3.10 feature</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">List</span>
<span class="kn">from</span> <span class="nn">singledispatchmethod</span> <span class="kn">import</span> <span class="n">singledispatchmethod</span>
<span class="k">class</span> <span class="nc">FooOptions</span><span class="p">(</span><span class="n">TypedDict</span><span class="p">):</span>
<span class="n">a</span><span class="p">:</span> <span class="nb">int</span>
<span class="n">b</span><span class="p">:</span> <span class="nb">str</span>
<span class="n">c</span><span class="p">:</span> <span class="nb">float</span>
<span class="n">d</span><span class="p">:</span> <span class="nb">bool</span>
<span class="n">e</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span>
<span class="k">class</span> <span class="nc">Foo</span><span class="p">:</span>
<span class="sd">"""</span>
<span class="sd"> This class accepts a main arguments, either as a dictionary or as a list, </span>
<span class="sd"> followed by various options as keyword arguments as specified in the `FooKwargs` class.</span>
<span class="sd"> """</span>
<span class="nd">@singledispatchmethod</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="o">**</span><span class="n">options</span><span class="p">:</span> <span class="n">Unpack</span><span class="p">[</span><span class="n">FooOptions</span><span class="p">]):</span>
<span class="sd">"""</span>
<span class="sd"> This docstring will be skipped. And also, are we talking of this dispatch or all?</span>
<span class="sd"> """</span>
<span class="bp">self</span><span class="o">.</span><span class="n">data</span><span class="p">:</span><span class="n">List</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span>
<span class="bp">self</span><span class="o">.</span><span class="n">a</span><span class="p">:</span><span class="nb">int</span> <span class="o">=</span> <span class="n">options</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">:</span><span class="nb">str</span> <span class="o">=</span> <span class="n">options</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'b'</span><span class="p">,</span> <span class="s1">'unknown'</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">c</span><span class="p">:</span><span class="nb">float</span> <span class="o">=</span> <span class="n">options</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'c'</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="s1">'nan'</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">d</span><span class="p">:</span><span class="nb">bool</span> <span class="o">=</span> <span class="n">options</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'d'</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">e</span><span class="p">:</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="n">options</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">'e'</span><span class="p">,</span> <span class="p">{})</span>
<span class="nd">@__init__</span><span class="o">.</span><span class="n">register</span>
<span class="k">def</span> <span class="nf">_</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="nb">dict</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">:</span> <span class="n">Unpack</span><span class="p">[</span><span class="n">FooOptions</span><span class="p">]):</span>
<span class="bp">self</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">values</span><span class="p">()),</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
</code></pre></div>
<p>In this rather extreme case, annotating the class makes very much more sense.
If this example seemed very alien, don't worry, but do make sure to read up on typehints
as they make coding easier and less error-prone and as a bonus PyCharm will give better suggestions.</p>
<h4>Mock</h4>
<p>Often some module is required, but this requires a dark magic ritual to get running.
As a result the <code>Mock</code> class from <code>unittest</code> is of great use.
This is used to make a mock of a module, which pretends to be there, but does nothing.
So in <code>config.py</code> one can add:</p>
<div class="codehilite"><pre><span></span><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">from</span> <span class="nn">unittest.mock</span> <span class="kn">import</span> <span class="n">Mock</span><span class="p">,</span> <span class="n">MagicMock</span>
<span class="n">sys</span><span class="o">.</span><span class="n">modules</span><span class="p">[</span><span class="s1">'foo'</span><span class="p">]</span> <span class="o">=</span> <span class="n">MagicMock</span><span class="p">()</span>
</code></pre></div>
<h4>Mixed Markdown</h4>
<p>GitHub runs off a README.md, while the PyPI runs off the setuptools.setup call in setup.py, specifically whatever text is passed to the description and long_description arguments and flavoured via long_description_content_type argument. However, most projects simply pass the text of the former to the latter. The same applies to the intro in RTD.
Therefore, it's beneficial to mix some markdown within the RST files. To make Sphinx accept both the module <code>sphinx-mdinclude</code> can be used. In the <code>requirements.txt</code>, it is hyphenated, while in the include list in the <code>conf.py</code> it is underscored.
The <code>conf.py</code> for Sphinx is messy and will populate its folder with RST files hence why it was kept separate above. This however means that the markdown files at the root of the project will be missed. As a result they need to be copied over to the documentation folder and the contained links fixed and the filenames changed to me more graceful (<code>README.md</code> to <code>Description.md</code>).
</p>
<h3>Additional caveats</h3>
<h4>Stick to ReStructuredText</h4>
<p>One can write docstrings directly in markdown,
but this is not a great idea as RST is specifically designed
for code annotation as we will see in a later section.</p>
<h4>Catch formatting errors early</h4>
<p>PyCharm autofills docstrings for you if set to do so
(search preferences for “Automatic documentation”),
but a common mistake is to not add a blank space between the description and the parameters.
Without this the first parameter will be interpreted as the description and not as a bullet point.
Everyone makes this mistake, but if one started early to check the documentation was getting generated fine,
then one would avoid this subsequently.</p>
<h4>Browser hard refresh</h4>
<p>In a browser it is critical to do a hard refresh of the pages (Shift+refresh button).
Silly but I'd say 90% of issues come from this.</p>
<h4>Tests are documentation</h4>
<p>Tests are documentation. You should always write tests. I test new features generally in a Jupyter notebook,
to see the outputs in full, but the key conclusions can be converted into a test.
Future you or a user will likely check out the code in the tests, so do add docstrings to them too.</p>
<h4>Check if possible</h4>
<p>Sphinx has many extra formatting features over markdown and
if you have a need for something that may be a common requirement, check the documentation and
pick up the extra extensions or Sphinx formatting tricks as the need arises, for example: <code>:ivar:</code> or <code>:cvar:</code>,
are worth adding to the documentation.</p>
<h3>ReadTheDocs</h3>
<p>So far I have gone through Sphinx, which only half of it.
The next step, once we have a working Sphinx, is to use ReadTheDocs.</p>
<h4>Yaml</h4>
<p>Add a <code>.readthedocs.yml</code> file to the root of your project. For example I like to have:</p>
<div class="codehilite"><pre><span></span><code><span class="nt">version</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">2</span>
<span class="nt">build</span><span class="p">:</span>
<span class="nt">os</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">ubuntu-20.04</span>
<span class="nt">tools</span><span class="p">:</span>
<span class="nt">python</span><span class="p">:</span> <span class="s">"3.8"</span>
<span class="nt">sphinx</span><span class="p">:</span>
<span class="nt">configuration</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">.readthedocs/source/conf.py</span>
<span class="nt">builder</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">html</span>
<span class="nt">fail_on_warning</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">true</span>
<span class="nt">python</span><span class="p">:</span>
<span class="nt">install</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="nt">method</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">pip</span>
<span class="nt">path</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">.</span>
<span class="p p-Indicator">-</span> <span class="nt">requirements</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">requirements.txt</span>
<span class="p p-Indicator">-</span> <span class="nt">requirements</span><span class="p">:</span> <span class="l l-Scalar l-Scalar-Plain">.readthedocs/requirements.txt</span>
</code></pre></div>
<p>Namely, we install the module defined in the <code>setup.py</code> in the root with the method <code>pip</code> and
the requirements with <code>requirements.txt</code>.
But as mentioned there are a few requirements specific to Sphinx, which have nothing to do with the module's
operations, hence the additional <code>.readthedocs/requirements.txt</code> file.</p>
<p>The fail_on_warning set to true is rather wishful thinking but at the debug stage this is helpful.</p>
<p>In the case of PyRosetta, we have a problem as it does not install like normal.
Luckily one can have private environment variables in ReadTheDocs (set within the settings for the project on the
ReadTheDocs website). In my package, <code>pyrosetta-help</code> is a command line tool that is added <code>install_pyrosetta</code>,
which requires the presence of the <code>PYROSETTA_USERNAME</code> and <code>PYROSETTA_PASSWORD</code> env variables.
This can be run by setting in the yaml file the following:</p>
<div class="codehilite"><pre><span></span><code><span class="nt">build</span><span class="p">:</span>
<span class="l l-Scalar l-Scalar-Plain">...</span>
<span class="l l-Scalar l-Scalar-Plain">jobs</span><span class="p p-Indicator">:</span>
<span class="nt">post_install</span><span class="p">:</span>
<span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">install_pyrosetta</span>
</code></pre></div>
<p>Likewise for other options the <code>jobs</code> directives can be used to better set up the environment.</p>
<h4>Runtime</h4>
<p>Once the yaml file is complete, head over to the ReadTheDocs website and link your GitHub account
and create a new project from the reposition of interest.</p>
<p>Once the project build is kicked off, you can see what happens in the 'Builds' tab.
Clicking on the top build, which give a badge (green hopefully), a printout and on two links in the right hand side
saying <code>view docs</code> and <code>view raw</code>. The latter is crucial as it gives you the raw output.</p>
<p>Check for errors and warnings. If <code>fail_on_warning</code> is set to false,
then if the documentation was partially generated, it would claim to be a success
and only <code>view raw</code> would say otherwise.</p>
<p>And to reiterate, do make sure to do a hard refresh of the docs page.</p>
<h4>Slack</h4>
<p>In the settings on the site one can set up a webhook to a Slack channel to notify of build status.</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-54039705152646255522022-05-10T14:37:00.002-07:002022-05-11T04:02:34.909-07:00Show neighbours in nglview<p>Nglview is a really nice Python library which encodes a widget to show a NGL viewport, a JS 3D protein viewer used until recently by the PDB. One annoying feature is that one cannot select neighbours as easily as say PyMOL's "select byres HEM around 3". But it is possible and here is how.</p>
<span><a name='more'></a></span>
<p>There are two ways.<br />
The first is to generate a selection string of the neighbours* with, say, PyMOL or PyRosetta, and use that. The other is to use NGL's inbuild functionality even if it is not exposed to Python.
<br /><small>* Neighbours or "neighbors" if you either are American or have muddled your brian by coding too much like me...</small></p>
<p>The latter consists in:</p>
<pre><code>
# load a structure:
import nglview as nv
view:nv.NGLWidget = nv.show_pdbid('1MBN')
# tamper with the JS
view = nv.show_pdbid('1MBN')
def add_neighbors(view, selection:str, comp_id:int=0, radius:float=5, style:str = 'hyperball', color:str='gainsboro'):
view._js(f"""const comp = this.stage.compList[{comp_id}]
const target_sele = new NGL.Selection('{selection}');
const radius = {radius};
const neigh_atoms = comp.structure.getAtomSetWithinSelection( target_sele, radius );
const resi_atoms = comp.structure.getAtomSetWithinGroup( neigh_atoms );
comp.addRepresentation( "{style}", {{sele: resi_atoms.toSeleString(),
colorValue: "{color}",
multipleBond: true
}});
comp.addRepresentation( "contact", {{sele: resi_atoms.toSeleString()}});
""")
# show the viewer (via the IPython.display.display than wraps cell runs)
view
</code>
</pre>
<p>This snippet is nothing more than the example snippet in the <a href="http://nglviewer.org/ngl/api/manual/snippets.html">manual of ngl.js</a>, but with JavaScript's infamous <code>this</code>. Compare:</p>
<pre><code>stage.loadFile( "rcsb://3pqr" ).then( function( o ){
// get all atoms within 5 Angstrom of retinal
var selection = new NGL.Selection( "RET" );
var radius = 5;
var atomSet = o.structure.getAtomSetWithinSelection( selection, radius );
// expand selection to complete groups
var atomSet2 = o.structure.getAtomSetWithinGroup( atomSet );
o.addRepresentation( "licorice", { sele: atomSet2.toSeleString() } );
o.addRepresentation( "cartoon" );
o.autoView();
} );</code></pre>
<p>Looking back at the PyMOL <code>select byres HEM expand 3</code>, there are two parts: <code>HEM expand 3</code>, which will select the neighbouring atoms (or "neighboring atoms" if you will), and <code>byres sele</code> which expands the selection of atoms to complete residues. This is the same thing that happens in the snippet above. PyRosetta does it differently as it is residue-centric as opposed to atom-centric because residues are defined by residue types and it is not possible to have a random extra atom here and there.</p>
<p>The major difference between the snippet from the <a href="http://nglviewer.org/ngl/api/manual/snippets.html" target="_blank">JavaScript manual</a> and the one for NGLView is that in the former an object in the local namespace is called <code>stage</code> and
the function <code>loadFile</code> returns a <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise" target="_blank">JavaScript Promise</a> (because IO operations are classically asynchronous in JavaScript) whose result is a component, whereas in the Pythonically excuted JS code
the object <code>this</code> refers to a stage object. The object <code>this</code> is similar to <code>self</code> in Python, but a lot weirder, for example it can get unbound and end up being the global namespace <code>window</code> or a function has its own <code>this</code> (as a class is a fancy function after all in JavaScript) unless an arrow function is used or the method bind is called on the offending function.</p>
<h3>Excuting JavaScript</h3>
<p>In a previous blog post I <a href="https://blog.matteoferla.com/2022/05/js-in-colab.html">discuss JS in colab and Jupyter</a>, but here we are dealing with a well made widget, so things are nice and tidy!</p>
<p>Firstly, adding a <code>IPython.display.display(IPython.display.Javascript(js_code)</code> to the Python code has the problem that the code will be run in the global namespace and the stage object will not be available or would NGL for that matter, hence why the widget needs to be used.</p>
<p><code>nv.NGLWidget</code> is a subclass of <code>ipywidgets.DOMWidget</code> and you may notice that an instance will have an attribute <code>comm</code> and a method <code>send</code>, which is how the JS–Python dialogue happens. However, it is extremely tedious and weird and what can be done depends on how the frontend listens to changes, generally via a call to the method <code>listenTo</code> in <code>initialize</code> or <code>render</code> —see <a href="https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Low%20Level.html">documentation for more</a>.<br/>
Luckly this is abstracted out and a bit of reverse engineering can reveal how to use it to one's advantage as seen in my snippet with the call to the private method <code>_js</code></p>
<h3>GitHub searches</h3>
There are two ways to figure out how to use a piece of code. Starting from the bedrock and going down into the specifics or starting from a specific method and seeing how it works.
<h4>Top down</h4>
<p>The first place to check for a given functionality in any codebase on GitHub is to search in Issues. The next is to search in the repository for pieces of code you'd expect, like <code>self.send</code>: in <a href="https://github.com/nglviewer/nglview" target="_blank">nglview's GitHub</a> this leads us to <a href="https://github.com/nglviewer/nglview/blob/b1d50de25672a3849ec2d0feb770c73f14de7b27/nglview/base.py" target="_blank">base.py</a> as one would expect were one finds out that the send and its tratlet buddies are neatly wrapped up in the method <code>_call</code>, which is next to the method <code>_js</code>, which I used in my snippet. Now, there may be a better way to do this so searching the <code>_call</code> would be the next step. This is where the top-down and bottom-up approach converge.</p>
<h4>Bottom up</h4>
<p>Given a specific method which is believed to do the functionality one wants to replicate one can see its code block in a Jupyter notebook with the handy module <code>inspect</code>. For example:</p>
<pre><code>import inspect
print_code = lambda fun: print(inspect.getsource(fun))
print_code(view.component_0.add_representation)
</code></pre>
<p>Does a call to <code>self._view.add_representation</code>, checking <code>view.add_representation</code>, one finds:</p>
<pre><code>self._remote_call('addRepresentation',
target='compList',
args=[
d['type'],
],
kwargs=params)</code></pre>
<p>In JS <code>addRepresentation</code> is a function of the component, thus revealing how the Python to JavaScript dialogue.
The <code>target='compList',</code> is promising. However, one key part of the JS neighbourhood selection snippet is instantiating a Selection class, so a dictionary mocking it may work, but would be overly complicated. Searching the GitHub code I could not find a call to NGL.Selection in the widget typescript file or in the python module, so appears to be a dead-end. Similarly, this is pretty common outcome with Python modules that wrap C++ code with Boost or pybind11: one often finds cases where in C++ there's a variable that does not make it to the Python side and the best bet it try a different route.
Luckily we have in our case the <code>_js</code> which suits us perfectly!
</p>
Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-51943888811278951632022-05-07T05:10:00.007-07:002022-06-04T06:03:20.684-07:00JS in Colab<p>A Jupyter or Colab notebook has two sides, one is the Python kernel, which may be running on a remote machine,
and the front-end running in one's browser.
The JavaScript in the browser and the Python kernel as a result may be on separate machine,
yet it is possible to make them dialogue. However, this differs between Jupyter and Colab,
the latter being more restrictive. I have found this difference problematic and even though
I may not be fully versed in Colab functionality I want share some pointers, discussed below. Majorly:</p>
<ul>
<li>Colab diverges greatly from Jupyter in terms of JS operations.</li>
<li>JS code injected into Colab is sandboxed within each cell.</li>
<li>There is no requireJS in Colab cells or window.</li>
<li>Imported modules have to be external to Colab/Drive.</li>
</ul>
<span><a name='more'></a></span>
<h3>Jupyter notebooks and Jupyter Lab</h3>
<p>Colab is derived from Jupyter notebooks, but shares some similarity with Jupyter labs,
such as file navigation panel, but not the tabbed layout.
Dialogue with JavaScript works differently in JupyterLab and majorly
there is no <code>IPython</code>/<code>Jupyter</code> object
in JS (cf. <a href="https://github.com/jupyterlab/jupyterlab/issues/5660">bug discussion</a>),
so no JS to Python communication. Colab does allow the latter, but differently.</p>
<p>As a result here I am talking solely about Jupyter notebook not lab.</p>
<h3>Applications</h3>
<p>For proper applications, widgets, which have their own complex system, are made, but for simple things like a SVG where clicking on an item is registered in Python this seems overkill.<br/>Recently I tried making a widget out of an existing library (JSME), but did not manage, so I had to use a work around <a href="https://github.com/matteoferla/JSME_notebook_hack">resulting in a module that works</a>, but is not elegant.</p>
<h3>Output basics</h3>
<p>The output under a Jupyter cell, shows the standard-out and standard-error streams,
making is so one can see the Python output of <code>print</code> and <code>warning.warn</code>
—exception tracebacks are a special case, as they are formatted and outputted by the IPython shell
and each flavour of shell does it differently as I discovered for my weekend project
for the <a href="https://github.com/matteoferla/notebook-error-reporter">reporting of errors from shared notebooks</a>
(cf. <a href="https://github.com/matteoferla/notebook-error-reporter/blob/main/experimentation.md">notes</a>).
With a Python kernel one can have custom outputs shown thanks to the function <code>IPython.display.display</code>,
which will render the passed object based on its <code>_repr_html_</code> (or <code>__repr__</code> if <code>_repr_html_</code> is absent).
Many libraries show plots and molecules thanks to this.
The value returned by the last command (<code>_</code>) is displayed this way.
Additionally, in <code>IPython.display</code> there are a <a href="https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#classes">several classes</a>
to display particular formats,
from <code>Audio</code> to <code>YouTubeVideo</code>. In particular, <code>HTML</code>, <code>Javascript</code>, <code>SVG</code> and <code>FileLink</code> are very useful.
Therefore, one can inject JavaScript dynamically in Python like so <code>display(JavaScript(js_codeblock))</code>.</p>
<p>A cell can be run as JavaScript itself with the cell magic <code>%%javascript</code> or <code>%%js</code>
(or <code>%%typescript</code> in future).</p>
<h3>JS import</h3>
<p>In Jupyter, injected JS code runs in the normal JS space,
while in Colab it is runs sandboxed, i.e. in an iframe.
This means that in Colab the namespace will be isolated.</p>
<p>There are two ways to import a JS library, one in a script-element in HTML the other in JS
with RequireJS. Both will need an address whence to source the script/module.</p>
<p>To import a JS library in HTML, the attribute <code>src</code> or the element <code>script</code> will do the job
—with a special case of <code>type=module</code>, so one can do the same in a notebook cell
<code>display(HTML('<script src="some_url"></script>))</code>. Two attributes worth reading up on are
<code>crossorigin="anonymous"</code> and <code>async</code>.</p>
<p>RequireJS allows one's code to run smoother asynchronously
—the second-biggest pain in JS are codeblocks running before everything is loaded.
To do so RequireJS allows one to declare a variable or define a function only once the required module is loaded,
e.g. <code>cost something = require(['resource'])</code> or <code>define('somefun', ['resource'], (something) => {...})</code>.
It normally runs off a preset configuration assigning a nice name to a longer URL,
but it can somewhat work with a URL directly.</p>
<h3>RequireJS in Colab</h3>
<p>In colab requireJS is somewhat available in the page's JS namespace, but not in the cells' JS namespaces.
Running:</p>
<div class="codehilite"><pre><span></span><code><span class="n">display</span><span class="p">(</span><span class="n">HTML</span><span class="p">(</span><span class="s1">'''</span>
<span class="s1"><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js" integrity="sha256-Ae2Vz/4ePdIu6ZyI/5ZGsYnb+m0JlOmKPjt6XZ9JJkA=" crossorigin="anonymous"></script></span>
<span class="s1">'''</span><span class="p">))</span>
</code></pre></div>
<p>fails for me due to a bad underscore.js package dependency, so I did not look further into it,
as it seems like it would be a nightmare as there nothing about this online.</p>
<p>I said requireJS is "kind of" available, because it is not actually, but instead a similar system is present,
via the monaco_editor (a feature of VSCode), so I am utterly lost.
Enabling <code>nbextensions</code> in Python must do something, but not through this.</p>
<h3>URLs</h3>
<p>Both approaches require a URL.
Before talking about URLs, it may be constructive to quickly mention the different path types:
* a URL starting with two slashes has the <em>full</em> address, minus the protocol (<code>http:</code>/<code>https:</code>/etc.),
* a URL starting with a single slash (<em>absolute path</em>) is from the root of the domain, just like in a filesystem,
* a URL starting with any other valid character is <em>relative</em> to the referring file, i.e. file in same folders, just like in a filesystem.</p>
<h4>Jupyter routes</h4>
<p>In a Jupyter notebook everything within
the base directory of the jupyter server is accessible.
You can navigate through this folder (dashboard view) in the <code>/tree/{path}</code> route,
you can edit any file in the <code>/edit/{path}</code> route or get served
a notebook in the <code>/notebooks/{path}</code> route. The latter also serves raw files if not a notebook,
because it redirects to <code>/files/{path}</code>.
There is also an additional route that serves files, <code>/static/{path}</code>, which serves files
that are in the <code>site-packages/notebook/static</code> folder in the Python path.
For completeness, there is also the <code>/api/{several}/{stuff}</code> routes, which are <a href="https://jupyter-server.readthedocs.io/en/latest/developers/rest-api.html">well documented</a>
and actually do all the heavy lifting, including session management.
As a result, in requireJS or script-tag one can use a relative path to the notebook file, and it will work.
One can dump files in <code>site-packages/notebook/static</code> and they will be served by <code>/static</code>,
without authentication and without having to reset the server.</p>
<p>In Colab things are different: the URL has a UUID in it and none of the non-API routes exist.
One therefore has to rely on an externally hosted address.</p>
<h4>CDNs</h4>
<p>A CDN is a repository for libraries,
which are generally fast. A shameful amount of internet traffic is actually the same libraries
being served over and over —JQuery, Bootstrap, React, Google fonts etc.
In some cases there may not be a working CDN for a given JS library, in which case one
can host it themselves, but with the caveat that it needs to have
the <code>Header add Access-Control-Allow-Origin "*"</code> directive set (or Nginx <em>etc.</em> equivalents)
in the Apache config or in the <code>.htaccess</code> file in that folder's parent folder
because the request is cross-origin and would be otherwise refused
(I apologise for stating what may be obvious, but this is a classic tripping hazard when starting out).
Rawgit was a handy CDN for getting a GitHub repo, but is no longer active.
Dropbox and others used to allow it, but no longer due to excessive requests and websites with illegal content.
My university provides user filestorage for staff and students that can be used as a CDN,
but it is far from a well-know service so worth a Google search if one's an academic.</p>
<h3>Dialogue with Python</h3>
<p>The JS can talk to Python via functions in <code>Jupyter.notebook.kernel</code> in Jupyter (or <code>IPython.notebook.kernel</code>), which is available in
both the JS console and in injected or run JS code. For example, <code>Jupyter.notebook.kernel.execute</code>
will run a python code block when the current Python execution finishes.
There is also <a href="https://jupyter-notebook.readthedocs.io/en/stable/comms.html">comms</a>
that allow a smoother data exchange. And there are widgets that build on these allowing user interactions
to affect Python —sliders are the demo example but one can do a lot more.</p>
<p>In Colab's documentation page
<a href="https://colab.research.google.com/notebooks/snippets/advanced_outputs.ipynb">advanced_output</a>
there is something, but it is no way as well documented as Jupyter's.
The kernel interaction machinery in JS is in <code>google.colab.kernel</code>,
but is rather different. This is available in the cells' outputs,
but not in the namespace —fun time debugging.
I have not figured out where to get <code>DOMWidgetView</code> for example
—this is part of <code>@jupyter-widgets/base</code> library (for nbextensions).</p>
<h3>Proper way</h3>
<p>Like in JupyterLab, the most sane way to make a JS<->Python interaction is creating
a proper <a href="https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Custom.html">widget extension</a>
(which will run NPM and have
<a href="https://github.com/jupyter-widgets/widget-ts-cookiecutter/blob/master/%7B%7Bcookiecutter.github_project_name%7D%7D/package.json">@jupyter-widgets/base</a>
): this will work in classic Jupyter notebook, Jupyter Lab and Colab.
But depending on the task required is very much overkill.</p>
<h3>Concluding thoughts</h3>
<p>JS operations in Jupyter are not straightforward despite the great documentation.
It is not really something that is important only to people that make Python modules with widgets, but
is a useful thing to know how to use. For example, I worked on a port-forwarded
notebook served by a cluster that did
not have access to the internet, but as my local browser obviously had access to the web,
I could download the data I need via JS on my machine and feed it to the Python kernel for it
to crunch.</p>
<p>Colab is great for demoing a feature: it runs without the user having to do anything.
It visually diverges from Jupyter has the former uses the Bootstrap3 framework
while Colab uses LitElement —Google is disconcertingly inconsistent with its frontend frameworks.
This means that nicely formatted <code>_repr_html_</code> functionality using BS3 will not work.
Parenthetically, I feel like the BS3 functionality is under-represented in Jupyter ecosystem,
because we are on bootstrap 5 now, and I assume everyone is waiting for the switch
—I was tempted to make a nice modal a few times, but I thought I'd be tempting fate
as BS4 diverges from BS3 a lot (and for the better) and JupyterLabs does not use Bootstrap,
but backbone.js.</p>
<p>However, the divergence from Jupyter notebook is painful as discussed,
especially as it is tricky in Jupyter. However, custom widgets
(which work different) appear to be the sole solution.
It is a lot more laborious and is not overly flexibly: to end with a <a href="https://xkcd.com/1890/">XKCD reference</a>,
it is like bringing a gun to a knife fight, you win in two cases, but cannot easily put out fires with it.</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-806849741621173992022-04-02T04:34:00.002-07:002022-04-05T07:54:10.039-07:00Covalents, patches and N-O-S bridges in PyRosetta<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41589-021-00966-5/MediaObjects/41589_2021_966_Figa_HTML.png?as=webp" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="393" data-original-width="685" height="115" src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41589-021-00966-5/MediaObjects/41589_2021_966_Figa_HTML.png?as=webp" width="200" /></a></div>Crosslinked residues are common, but for sure make up for it by being simultaneously highly intriguing and highly
technically problematic. Oddly, I seem to keep bumping into them. During my PhD a decade ago I saw a talk by the father
of Kiwi structural biochemistry, Ted Baker, about a curious case where they found an isopeptide bonds hidden in their
crystal density. In a postdoc I worked with isopeptide bonds
—I <a href="https://blog.matteoferla.com/2018/09/everything-you-wanted-to-know-about.html">blogged about isopeptide bonds in Rosetta</a>
four years ago. During the start of the pandemic I dis some covalent-docking of compounds with PyRosetta for
the <a href="https://covid.postera.ai/covid">Covid Moonshoot</a> project, which evolved
into <a href="https://github.com/matteoferla/Fragmenstein">Fragmenstein</a>. Most tools have a hard time with crosslinks. And last
month the Twittersphere was abuzz with the news of lysine-hydroxylcysteine (N-O-S) bridges in protein.<p></p>
<p>PyMOL will strip LINK entries from PDBs on saving while NGL obeys only CONECT entries in PBDs. An exception is
PyRosetta: it behaves very nicely with disulfides, isopeptide bonds (
cf. <a href="https://github.com/matteoferla/DogCatcher">repo of PyRosetta code</a>
from <a href="https://linkinghub.elsevier.com/retrieve/pii/S2451945621003159">Keeble et al.</a>) and other crosslinks —mostly. As a
result I thought I'd add a note on how to add them in PyRosetta.<span></span></p><a name='more'></a><p></p>
<h3 id="covalents-in-pdb-files">Covalents in PDB files</h3>
<p>The easiest and most cheatsome way is adding a LINK (or SSBOND for disulfides) entry to the PDB file and opening it in
PyRosetta, which will do its magic to add that crosslink:</p>
<pre><code>LINK NZ LYS A <span class="hljs-number">11</span> CD GLU A <span class="hljs-number">34</span> <span class="hljs-number">1.33</span>
SBOND CYS A <span class="hljs-number">24</span> CYS A <span class="hljs-number">52</span> <span class="hljs-number">2.06</span>
</code></pre><p>Rosetta rightfully cares about official spacing so typing it will be trickly, but this is the major challenge. However,
the question begs asking is <em>what is this magic that happens behind the scenes?</em>.</p>
<h3 id="basics">Basics</h3>
<p>First, the basics. In computational biochemistry, the definition of residue includes nucleobases, ligands, ions and
water molecules. Each residue ought to abide by a preset topology. In Rosetta the topology is stored in a params file,
which is loaded as a residue type, the archetype of that residue. Each one with a unique 3-letter code —mostly three
letters. In Rosetta there's the global residue type set, which can be altered with the command line
argument <code>-extra_fa_res</code>, and mutable residues type sets that can be altered (<em>vide intra</em>). Additionally, a residue in
Rosetta can be "patched", for example the terminal amino acids in a peptide chain may have the <code>NtermProteinFull</code>
and <code>CtermProteinFull</code> patches applied: in the former there's an extra two protons on the backbone nitrogen and an
oxygen (named <code>OXT</code>) on the carboxyl carbon. In the case of lysine and glutamate, these are canonical amino acids and
accept the sidechain conjugation patch (`SidechainConjugation`). When a pdb file or pdb block is read termini are
added automatically by patching the terminal residues (unless disabled with `use_truncated_termini` command line
argument) and the covalent bond between residues added as directed in the LINK entry by patching them depending on the
type of interaction (such as <code>SidechainConjugation</code>).</p>
<pre><code>pose.residue(<span class="hljs-number">1</span>).annotated_name() <span class="hljs-comment"># 'A[ALA:NtermProteinFull]'</span>
</code></pre><p>There are many patches available. Some are alternate forms of residues, which are normally stored in PDB files as
alternate residues names, such as phosphoserine, which is SEP in PDB but <code>S[SER:phosphorylated]</code> in Rosetta —doing a
round trip will dump it as SEP.</p>
<h3 id="anatomy-of-a-connection">Anatomy of a connection</h3>
<p>In PyRosetta, every residues has none or more residue connections, which can be crosslinks, peptidic bonds <em>etc.</em> These
residue connections are numbered, but the polymer connections have an extra layer of syntax, wherein the N terminal
connection is called the "lower" and the C terminal connection is the "upper". There's probably a reason for this
nomenclature, but a way I remember it is that nitrogen has a lower atomic zahl than carbon (5 vs. 6) or that the lower
connects to a residue with a lower residue index.<br />A connection between two residues (either an amino acid or ligand) can be inspected as follows:</p>
<pre><code>this_idx:<span class="hljs-keyword">int</span> = 11 # the known residue
this_residue:pyrosetta.rosetta.core.conformation.Residue = pose.residue(this_idx)
number_connections:<span class="hljs-keyword">int</span> = this_residue.n_current_residue_connections()
other_idx:<span class="hljs-keyword">int</span> = this_residue.connected_residue_at_resconn(3) # residue connection no. 3 (see note below).
other_residue:pyrosetta.rosetta.core.conformation.Residue = pose.residue(other_idx)
other_atomno:<span class="hljs-keyword">int</span> = this_residue.connect_atom(other_residue)
other_atomname:str = other_residue.atom_name(other_atomno)
this_atomno:<span class="hljs-keyword">int</span> = other_residue.connect_atom(this_residue)
this_atomname:str = this_residue.atom_name(this_atomno)
distance:<span class="hljs-keyword">float</span> = ( this_residue.xyz(this_atomname) - others_residue.xyz(other_atomname) ).norm()
print(this_idx, this_atomname, this_atomno, other_idx, other_atomname, this_atomno, distance)
</code></pre><p>An amino acid reside has lower and an upper connection (N-terminal and C-terminal connection) and if it's got a
crosslink it will be the third connection (resconn). In the params file, this will be CONN3. CONN1 with a LOWER and
UPPER will fail. Therefore, for an amino acid connection 3 will certainly be the non-polymer one.</p>
<p>As a test subject I will use <a href="https://github.com/matteoferla/segfaultin">segfaultin</a>, a fictional protein that I never
finished making with multiple troublesome parts, such an isopeptidic bond, cystine bond and a non-canonical amino acid,
.</p>
<pre><code>from rdkit_to_params import Params
import requests
<span class="hljs-selector-tag">p</span> = Params.from_smiles(<span class="hljs-string">'CCCCC(N*)C(*)=O'</span>, name=<span class="hljs-string">'NLE'</span>)
<span class="hljs-selector-tag">p</span><span class="hljs-selector-class">.PROPERTIES</span><span class="hljs-selector-class">.append</span>(<span class="hljs-string">'ALIPHATIC'</span>)
<span class="hljs-selector-tag">p</span><span class="hljs-selector-class">.PROPERTIES</span><span class="hljs-selector-class">.append</span>(<span class="hljs-string">'HYDROPHOBIC'</span>)
rts = <span class="hljs-selector-tag">p</span>.add_residuetype(pose)
pdbblock = requests.get(<span class="hljs-string">'https://raw.githubusercontent.com/matteoferla/segfaultin/main/1ubq.iso.ss.sep.nle.pdb'</span>)<span class="hljs-selector-class">.text</span>
assert <span class="hljs-string">'LINK'</span> <span class="hljs-keyword">in</span> pdbblock
pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.import_pose</span><span class="hljs-selector-class">.pose_from_pdbstring</span>(pose, pdbblock, rts, <span class="hljs-string">'foo'</span>)
</code></pre><p>Parenthetically, RDKit calls a string with a PDB block in it a <code>pdbblock</code>, PyMOL a <code>pdb_str</code> and pyrosetta
a <code>pdb_string</code>, so the variable name choice tells something about one's loyalties!<br />Let's inspect what the residues call themselves:</p>
<pre><code><span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.residue</span>(1)<span class="hljs-selector-class">.annotated_name</span>() # <span class="hljs-selector-tag">Z</span><span class="hljs-selector-attr">[NLE:NtermProteinFull]</span>
<span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.residue</span>(2)<span class="hljs-selector-class">.annotated_name</span>() # <span class="hljs-selector-tag">Q</span>
<span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.residue</span>(24)<span class="hljs-selector-class">.annotated_name</span>() # <span class="hljs-selector-tag">C</span><span class="hljs-selector-attr">[CYS:disulfide]</span>
<span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.residue</span>(76)<span class="hljs-selector-class">.annotated_name</span>() # <span class="hljs-selector-tag">G</span><span class="hljs-selector-attr">[GLY:CtermProteinFull]</span>
<span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.residue</span>(100)<span class="hljs-selector-class">.annotated_name</span>() # <span class="hljs-selector-tag">w</span><span class="hljs-selector-attr">[HOH]</span>
</code></pre><p>This shows that a regular amino acid is just a letter, a non-canonical amino acid or ligand is Z/X/w and its three
letter code, while the patched residues have colon and the patch name. This is important for mutations. So let's do a
roundtrip:</p>
<p>First let's remove the bond by mutating to alanine.
I chose alanine because that is what is done in alanine scanning experiment in the lab,
but it could have been anything except the unpatched residues as that will have kept the coordinates.
There is a method in <code>MutateResidue</code> called <code>.set_preserve_atom_coords</code> which preserves the coordinates of atoms
with the same name, which can backfire and result in highly contorted residues.</p>
<pre><code><span class="hljs-keyword">for</span> res <span class="hljs-keyword">in</span> (<span class="hljs-number">11</span>, <span class="hljs-number">34</span>):
pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.protocols</span><span class="hljs-selector-class">.simple_moves</span><span class="hljs-selector-class">.MutateResidue</span>(target=res, new_res=<span class="hljs-string">'ALA'</span>).apply(pose)
assert <span class="hljs-string">'LINK'</span> not <span class="hljs-keyword">in</span> ph.get_pdbstr(pose)
</code></pre><p>Now that we have a blank slate, let's add the bonds back, first by add the residues with the patch.
A <code>SidechainConjugation</code> patch with no connection will be just a regular residue
without that atom, ie. no protons or oxygens added.</p>
<pre><code>pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.protocols</span><span class="hljs-selector-class">.simple_moves</span><span class="hljs-selector-class">.MutateResidue</span>(target=<span class="hljs-number">11</span>, new_res=<span class="hljs-string">'LYS:SidechainConjugation'</span>).apply(pose)
pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.protocols</span><span class="hljs-selector-class">.simple_moves</span><span class="hljs-selector-class">.MutateResidue</span>(target=<span class="hljs-number">34</span>, new_res=<span class="hljs-string">'GLU:SidechainConjugation'</span>).apply(pose)
</code></pre><p>These are not linked. To achieve this there are two possible commands:</p>
<pre><code><span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.conformation</span>()<span class="hljs-selector-class">.declare_chemical_bond</span>(seqpos1=<span class="hljs-number">11</span>, atom_name1=<span class="hljs-string">' NZ '</span>, seqpos2=<span class="hljs-number">34</span>, atom_name2=<span class="hljs-string">' CD '</span>)
</code></pre><p>or</p>
<pre><code>pyrosetta.rosetta.core.util.add_covalent_linkage(<span class="hljs-attr">pose=pose,</span> <span class="hljs-attr">resA_pos=11,</span> <span class="hljs-attr">resB_pos=34,</span> <span class="hljs-attr">resA_At=7,</span> <span class="hljs-attr">resB_At=7,</span> <span class="hljs-attr">remove_hydrogens=True)</span>
</code></pre><p>The latter will remove relevant hydrogens,
create a new patch if not a normally linked residue and call the former function.</p>
<p>However, even though Rosetta thinks there's a covalent bond between these residues,
the distance will be off.</p>
<pre><code>assert <span class="hljs-string">'LINK'</span> <span class="hljs-keyword">in</span> ph.get_pdbstr(<span class="hljs-keyword">pose</span>)
<span class="hljs-keyword">print</span>(<span class="hljs-string">'distance: '</span>, (<span class="hljs-keyword">pose</span>.residue(<span class="hljs-number">11</span>).xyz(<span class="hljs-string">"CD"</span>) - <span class="hljs-keyword">pose</span>.residue(<span class="hljs-number">34</span>).xyz(<span class="hljs-string">"CD"</span>)).norm())
</code></pre><p>To fix it, one can just do a FastRelax cycle, preferably in cartesian mode:</p>
<pre><code>scorefxn=pyrosetta.create_score_function(<span class="hljs-string">'ref2015_cart'</span>)
cycles=<span class="hljs-number">3</span>
relax = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.protocols</span><span class="hljs-selector-class">.relax</span><span class="hljs-selector-class">.FastRelax</span>(scorefxn, cycles)
movemap = pyrosetta.MoveMap()
idxs = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.utility</span><span class="hljs-selector-class">.vector1_unsigned_long</span>(<span class="hljs-number">2</span>)
idxs[<span class="hljs-number">1</span>] = <span class="hljs-number">11</span>
idxs[<span class="hljs-number">2</span>] = <span class="hljs-number">34</span>
resi_sele = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.select</span><span class="hljs-selector-class">.residue_selector</span><span class="hljs-selector-class">.ResidueIndexSelector</span>(idxs)
movemap.set_chi(allow_chi=resi_sele.apply(pose))
movemap.set_bb(False)
relax.set_movemap(movemap)
relax.minimize_bond_angles(True)
relax.minimize_bond_lengths(True)
relax.apply(pose)
</code></pre><p>Now we can have a gander, but with the caveat that NGL ignores LINK entries... as I said, crosslinks are tricky.</p>
<pre><code>import nglview <span class="hljs-keyword">as</span> nv
<span class="hljs-keyword">view</span> = nv.show_rosetta(pose)
<span class="hljs-keyword">view</span>.update_cartoon(smoothSheet=True)
<span class="hljs-keyword">view</span>.add_hyperball(<span class="hljs-string">'11 or 34'</span>)
<span class="hljs-keyword">view</span>
</code></pre><h3 id="anatomy-of-a-patch">Anatomy of a patch</h3>
<p>Patches live in the database under <code>database/chemical/residue_type_sets/fa_standard/patches/</code>.
In PyRosetta there is a database folder, which can be accessed by navigating to its installation folder:</p>
<pre><code><span class="hljs-selector-tag">os</span><span class="hljs-selector-class">.path</span><span class="hljs-selector-class">.split</span>(<span class="hljs-selector-tag">pyrosetta</span><span class="hljs-selector-class">.__file__</span>)<span class="hljs-selector-attr">[0]</span>
</code></pre><p>Sidechain conjugation patches, for example, are added thanks to the file <code>SidechainConjugation.txt</code>.
A cool thing is that there can be two definitions loaded, so the original <code>SidechainConjugation.txt</code>
can stay. The easiest way to add a patch is from the command line.
Patches are controlled by the residuetypeset:</p>
<pre><code>rts:pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.PoseResidueTypeSet</span> = pose.residue_type_set_for_pose()
</code></pre><p>or</p>
<pre><code>params_paths = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.utility</span><span class="hljs-selector-class">.vector1_string</span>()
params_paths.extend([paramsfile])
resiset = pyrosetta.generate_nonstandard_residue_set(pose, params_paths)
</code></pre><p>or</p>
<pre><code>FULL_ATOM_t = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.FULL_ATOM_t</span>
rts:pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.PoseResidueTypeSet</span> = pose.conformation().modifiable_residue_type_set_for_conf(FULL_ATOM_t)
</code></pre><p>The first is the global one, the second allows the addition of params files,
while the last is the modifiable version that
allows shenanigans like adding a residuetype from a string with a params block:</p>
<pre><code>rts = pose.conformation().modifiable_residue_type_set_for_conf(pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.FULL_ATOM_t</span>)
buffer = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.std</span><span class="hljs-selector-class">.stringbuf</span>(params_block)
stream = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.std</span><span class="hljs-selector-class">.istream</span>(buffer)
new = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.read_topology_file</span>(stream,
name,
rts)
rts.add_base_residue_type(new)
pose.conformation().reset_residue_type_set_for_conf(rts)
</code></pre><p>Patches are stored as a vector given by the method <code>.patches</code>:</p>
<pre><code>patches:pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.utility</span><span class="hljs-selector-class">.vector1_std_shared_ptr_const_core_chemical_Patch_t</span> = rts.patches()
patch:pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.Patch</span> = patches[<span class="hljs-number">1</span>]
vt:pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.VariantType</span>=patch.types()[<span class="hljs-number">1</span>]
</code></pre><p>The variant type is an enum, whose <a href="https://graylab.jhu.edu/PyRosetta.documentation/pyrosetta.rosetta.core.chemical.html#pyrosetta.rosetta.core.chemical.VariantType">members are listed here</a>.
We can check what patches apply to a give residue:</p>
<pre><code>cys_rt:pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.chemical</span><span class="hljs-selector-class">.ResidueType</span> = pose.residue_type_set_for_pose().get_base_types_name3(<span class="hljs-string">'CYS'</span>)[<span class="hljs-number">1</span>]
[(patch.name(), [vt<span class="hljs-selector-class">.name</span> <span class="hljs-keyword">for</span> vt <span class="hljs-keyword">in</span> patch.types()]) <span class="hljs-keyword">for</span> patch <span class="hljs-keyword">in</span> rts.patches() <span class="hljs-keyword">if</span> patch.applies_to(cys_rt)]
</code></pre><p>which yields: </p>
<pre><code>[('D', []),
('CtermProteinFull', ['FIRST_VARIANT']),
('NtermProteinFull', ['LOWER_TERMINUS_VARIANT']),
('N_Methylation', ['N_METHYLATION']),
('protein_cutpoint_upper', ['CUTPOINT_UPPER']),
('protein_cutpoint_lower', ['CUTPOINT_LOWER']),
('disulfide', ['DISULFIDE']),
('SidechainConjugation', ['SIDECHAIN_CONJUGATION']),
('VirtualMetalConjugation', ['VIRTUAL_METAL_CONJUGATION']),
('Virtual_Protein_SideChain', ['VIRTUAL_SIDE_CHAIN']),
('N_acetylated', ['N_ACETYLATION']),
('C_methylamidated', ['C_METHYLAMIDATION']),
('NtermProteinMethylated', ['METHYLATED_NTERM_VARIANT']),
('S-conjugated', ['SC_BRANCH_POINT']),
('Cterm_amidation', ['CTERM_AMIDATION']),
('Virtual_Residue', ['VIRTUAL_RESIDUE_VARIANT']),
('acetylated', ['ACETYLATION']),
('MethylatedCtermProteinFull', ['METHYLATED_CTERMINUS_VARIANT']),
('AcetylatedNtermProteinFull', ['ACETYLATED_NTERMINUS_VARIANT']),
('AcetylatedNtermConnectionProteinFull',
['ACETYLATED_NTERMINUS_CONNECTION_VARIANT']),
('DimethylatedCtermProteinFull', ['DIMETHYLATED_CTERMINUS_VARIANT']),
('hbs_pre', ['HBS_PRE']),
('hbs_post', ['HBS_POST']),
('a3b_hbs_pre', ['A3B_HBS_PRE']),
('a3b_hbs_post', ['A3B_HBS_POST']),
('oop_pre', ['OOP_PRE']),
('oop_post', ['OOP_POST']),
('triazolamerN', ['TRIAZOLAMERN']),
('triazolamerC', ['TRIAZOLAMERC']),
('ProteinReplsBB', ['REPLS_BB'])]
</code></pre><p>Note that <code>pyrosetta.rosetta.core.chemical.VariantType.DEPROTONATED</code> isn't there,
because that and <code>pyrosetta.rosetta.core.chemical.VariantType.PROTONATED</code>
and <code>pyrosetta.rosetta.core.chemical.VariantType.ALTERNATIVE_PROTONATION</code>
aren't for patches.</p>
<h3 id="n-o-s-bridge">N-O-S bridge</h3>
<p>In February this year (2022) <a href="https://www.nature.com/articles/s41589-021-00966-5">a paper came out in Nature</a>
that showed that many protein had N-O-S bridges,
a feature <a href="https://www.nature.com/articles/s41586-021-03513-3">reported last year (2021, Nature)</a>.
A N-O-S bridge is the sidechain conjugation of a lysine and hydroxycysteine,
via the atoms nitrogen, oxygen and sulfur.</p>
<p>This would make for a nice example.</p>
<p>In the PDB file <code>6ZWJ</code> from the latter paper is an example of this between lysine (LYS) 8 and hydroxycysteine (<code>CSO</code>) 38.</p>
<p>If we wanted to treat this properly,
we would need to add the patch <code>SidechainConjugation</code> for the residue <code>CSO</code>.
Making a residue <code>CSO</code> defined by a params file with <code>CONN3</code> entry is also an option
but would be inelegant as it could not be used in isolation, even though <code>CSO</code> would not
be found in this context anyway.</p>
<p>Parenthetically, the oxidations of cysteine are S-oxycysteine (<code>SCX</code>),
S-nitrosocysteine (<code>SNC</code>), cysteine sulfenic acid (<code>SDC</code>), cysteinesulfonic acid (<code>OCS</code>)</p>
<p>So let's start by making a CSO residuetype and patch for it.</p>
<p>CSO.params:</p>
<pre><code>NAME CSO
# *C(=O)[C@@]([H])(N(*)[H])C([H])([H])SO[H]
IO_STRING CSO Z
TYPE POLYMER
AA UNK
ATOM N Nbb X <span class="hljs-number">-0.2957777</span>
ATOM CA CAbb X <span class="hljs-number">0.0787753</span>
ATOM C CObb X <span class="hljs-number">0.1595662</span>
ATOM O OCbb X <span class="hljs-number">-0.2971563</span>
ATOM CB CH2 X <span class="hljs-number">0.0420447</span>
ATOM SG S X <span class="hljs-number">-0.0107892</span>
ATOM OD OH X <span class="hljs-number">-0.3299774</span>
ATOM H HNbb X <span class="hljs-number">0.1237177</span>
ATOM HA Hapo X <span class="hljs-number">0.0557563</span>
ATOM HB1 Hapo X <span class="hljs-number">0.0425142</span>
ATOM HB2 Hapo X <span class="hljs-number">0.0425142</span>
ATOM HD Hpol X <span class="hljs-number">0.2284798</span>
BOND CB CA
BOND CA C
BOND_TYPE C O <span class="hljs-number">2</span>
BOND CA N
BOND CB SG
BOND SG OD
BOND CB HB1
BOND CB HB2
BOND CA HA
BOND N H
BOND OD HD
PROPERTIES PROTEIN ALPHA_AA L_AA POLAR SC_ORBITALS
FIRST_SIDECHAIN_ATOM CB
BACKBONE_AA ALA
CHI <span class="hljs-number">1</span> N CA CB SG
CHI <span class="hljs-number">2</span> N CA C O
CHI <span class="hljs-number">3</span> CA CB SG OD
CHI <span class="hljs-number">4</span> CB SG OD HD
UPPER_CONNECT C
LOWER_CONNECT N
NBR_ATOM CA
NBR_RADIUS <span class="hljs-number">4.556737904333845</span>
ICOOR_INTERNAL N <span class="hljs-number">0.000000</span> <span class="hljs-number">0.000000</span> <span class="hljs-number">0.000000</span> N CA C
ICOOR_INTERNAL CA <span class="hljs-number">0.000000</span> <span class="hljs-number">180.000000</span> <span class="hljs-number">1.479082</span> N CA C
ICOOR_INTERNAL C <span class="hljs-number">0.000000</span> <span class="hljs-number">69.099439</span> <span class="hljs-number">1.516005</span> CA N C
ICOOR_INTERNAL UPPER <span class="hljs-number">55.959949</span> <span class="hljs-number">59.496115</span> <span class="hljs-number">1.552345</span> C CA N
ICOOR_INTERNAL O <span class="hljs-number">173.362709</span> <span class="hljs-number">51.230560</span> <span class="hljs-number">1.218710</span> C CA UPPER
ICOOR_INTERNAL LOWER <span class="hljs-number">66.923203</span> <span class="hljs-number">66.676171</span> <span class="hljs-number">1.468631</span> N CA C
ICOOR_INTERNAL CB <span class="hljs-number">129.676916</span> <span class="hljs-number">70.236808</span> <span class="hljs-number">1.535445</span> CA C N
ICOOR_INTERNAL SG <span class="hljs-number">169.561221</span> <span class="hljs-number">68.101623</span> <span class="hljs-number">1.828532</span> CB CA C
ICOOR_INTERNAL OD <span class="hljs-number">-177.727087</span> <span class="hljs-number">81.275795</span> <span class="hljs-number">1.662137</span> SG CB CA
ICOOR_INTERNAL H <span class="hljs-number">-123.392905</span> <span class="hljs-number">70.662367</span> <span class="hljs-number">1.024697</span> N CA LOWER
ICOOR_INTERNAL HA <span class="hljs-number">-113.868432</span> <span class="hljs-number">72.686858</span> <span class="hljs-number">1.099708</span> CA CB C
ICOOR_INTERNAL HB1 <span class="hljs-number">122.852027</span> <span class="hljs-number">67.860323</span> <span class="hljs-number">1.095516</span> CB CA SG
ICOOR_INTERNAL HB2 <span class="hljs-number">-118.751443</span> <span class="hljs-number">70.100055</span> <span class="hljs-number">1.096489</span> CB CA SG
ICOOR_INTERNAL HD <span class="hljs-number">-101.189326</span> <span class="hljs-number">68.731845</span> <span class="hljs-number">0.989399</span> OD SG CB
</code></pre><p>Patch:</p>
<pre><code>NAME SidechainConjugation
TYPES SIDECHAIN_CONJUGATION
<span class="hljs-comment">## general requirements for this patch</span>
<span class="hljs-comment">## require protein, ignore anything that's already nterm patched:</span>
<span class="hljs-keyword">BEGIN_SELECTOR
</span>NAME3 CSO <span class="hljs-comment"># Add to this list as more sidechain-conjugable types are added.</span>
NOT VARIANT_TYPE <span class="hljs-keyword">SC_BRANCH_POINT
</span>NOT VARIANT_TYPE PROTONATED
NOT VARIANT_TYPE VIRTUAL_METAL_CONJUGATION
NOT VARIANT_TYPE SIDECHAIN_CONJUGATION
NOT VARIANT_TYPE TRIMETHYLATION
NOT VARIANT_TYPE <span class="hljs-keyword">DIMETHYLATION
</span>NOT VARIANT_TYPE METHYLATION
NOT VARIANT_TYPE ACETYLATION
END_SELECTOR
<span class="hljs-keyword">BEGIN_CASE </span><span class="hljs-comment">## S-hydroxycysteine (CSO), L- or D-version #################################################</span>
<span class="hljs-comment">## These define which residues match this case:</span>
<span class="hljs-keyword">BEGIN_SELECTOR
</span>NAME3 CSO <span class="hljs-comment">#L-version.</span>
END_SELECTOR
<span class="hljs-comment"># These are the operations involved:</span>
<span class="hljs-comment"># Delete a sidechain hydrogen:</span>
DELETE_ATOM HD
<span class="hljs-comment"># Add a sidechain connection and a virtual atom:</span>
<span class="hljs-keyword">ADD_CONNECT </span>OD ICOOR <span class="hljs-number">180</span>.<span class="hljs-number">0</span> <span class="hljs-number">75</span>.<span class="hljs-number">00</span> <span class="hljs-number">1</span>.<span class="hljs-number">793</span> OD SG CB
<span class="hljs-keyword">ADD_ATOM </span> <span class="hljs-built_in">V1</span> VIRT VIRT <span class="hljs-number">0</span>.<span class="hljs-number">00</span>
<span class="hljs-comment"># Add new bonds:</span>
<span class="hljs-keyword">ADD_BOND </span>OD <span class="hljs-built_in">V1</span>
<span class="hljs-comment"># Set position of the new V1 atom:</span>
SET_ICOOR <span class="hljs-built_in">V1</span> <span class="hljs-number">0</span>.<span class="hljs-number">00</span> <span class="hljs-number">75</span>.<span class="hljs-number">00</span> <span class="hljs-number">1</span>.<span class="hljs-number">793</span> OD SG CONN%LASTCONN
REDEFINE_CHI <span class="hljs-number">4</span> CB SG OD <span class="hljs-built_in">V1</span>
END_CASE
</code></pre><p>Import them:</p>
<pre><code><span class="hljs-keyword">import</span> pyrosetta
<span class="hljs-keyword">import</span> pyrosetta_help <span class="hljs-keyword">as</span> ph
<span class="hljs-title">pyrosetta</span>.init('-<span class="hljs-keyword">in</span>:file:extra_res_fa <span class="hljs-type">CSO</span>.params -<span class="hljs-keyword">in</span>:file:extra_patch_fa <span class="hljs-type">SidechainConjugation</span>.txt')
</code></pre><p>Alternatively, if imported via options, do remember to register these with the relevant mover or residuetypeset.</p>
<pre><code>patch_files = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.utility</span><span class="hljs-selector-class">.vector1_std_string</span>(<span class="hljs-number">1</span>)
patch_files[<span class="hljs-number">1</span>] = <span class="hljs-string">'...'</span>
pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.basic</span><span class="hljs-selector-class">.options</span><span class="hljs-selector-class">.set_file_vector_option</span>(<span class="hljs-string">'in:file:extra_patch_fa'</span>, patch_files)
pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.protocols</span><span class="hljs-selector-class">.simple_moves</span><span class="hljs-selector-class">.MutateResidue</span>(...).register_options()
</code></pre><p>Create a pose and link:</p>
<pre><code><span class="hljs-keyword">pose</span> = pyrosetta.pose_from_sequence(<span class="hljs-string">'AC[CSO:SidechainConjugation]AGGGK[LYS:SidechainConjugation]G'</span>)
<span class="hljs-keyword">print</span>(<span class="hljs-keyword">pose</span>.sequence())
CSO_idx = <span class="hljs-number">2</span>
CSO_atom_idx = <span class="hljs-keyword">pose</span>.residue(CSO_idx).atom_index(<span class="hljs-string">'OD'</span>)
LYS_idx = <span class="hljs-number">7</span>
LYS_atom_idx = <span class="hljs-keyword">pose</span>.residue(LYS_idx).atom_index(<span class="hljs-string">'NZ'</span>)
pyrosetta.rosetta.core.util.add_covalent_linkage(<span class="hljs-keyword">pose</span>=<span class="hljs-keyword">pose</span>,
resA_pos=CSO_idx,
resB_pos=LYS_idx,
resA_At=CSO_atom_idx,
resB_At=LYS_atom_idx,
remove_hydrogens=False)
assert <span class="hljs-string">'LINK'</span> <span class="hljs-keyword">in</span> ph.get_pdbstr(<span class="hljs-keyword">pose</span>)
</code></pre><p>minimise:</p>
<pre><code><span class="hljs-keyword">scorefxn=pyrosetta.create_score_function('ref2015_cart')
</span>cycles=<span class="hljs-number">3</span>
relax = pyrosetta.rosetta.protocols.relax.FastRelax(<span class="hljs-keyword">scorefxn, </span>cycles)
<span class="hljs-keyword">movemap </span>= pyrosetta.<span class="hljs-keyword">MoveMap()
</span><span class="hljs-keyword">movemap.set_chi(True)
</span><span class="hljs-keyword">movemap.set_bb(True)
</span><span class="hljs-keyword">movemap.set_jump(True)
</span>relax.set_movemap(<span class="hljs-keyword">movemap)
</span>relax.minimize_bond_angles(True)
relax.minimize_bond_lengths(True)
relax.apply(pose)
</code></pre><p>view:</p>
<pre><code>import nglview <span class="hljs-keyword">as</span> nv
<span class="hljs-keyword">view</span> = nv.show_rosetta(pose)
<span class="hljs-keyword">view</span>.add_representation(<span class="hljs-string">'hyperball'</span>, <span class="hljs-keyword">f</span><span class="hljs-string">'{CSO_idx} or {LYS_idx}'</span>)
<span class="hljs-keyword">view</span>.download_image(<span class="hljs-string">'NOS.png'</span>)
<span class="hljs-keyword">view</span>
</code></pre>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-52491293632942998382022-01-11T12:42:00.001-08:002022-01-13T05:00:59.523-08:00ggplot colours in PythonIn <a href="https://blog.matteoferla.com/2021/02/multiple-poses-in-nglview.html">my post about multiple poses in NGLView</a> I mention using ggplot colours in Python, here I revisit it in a bit more detail. Warning: Contains colour theory which may befuddle some viewers.<span><a name='more'></a></span><p style="text-align: left;">
Computer screens generally use 3 colour channels each encoded by 8 bits, resulting in 24bit colours, called "True color"*. As a result there are 256 shades of grey, which is why colour theory is a lot more saucier than those books.</p><p style="text-align: left;">*) In images you have an alpha channel which is the opacity, and in older systems you have different colour schemes, such as 8bit (256) colours in an Atari. CYMK is a colourspace similar to RGB, but to do with subtractive colours —mixing paints and mixing coloured lights has different effects (black and white) respectively. But these are beyond the point here.</p><p style="text-align: left;">Humans seen in three primary colours (red, green, blue), so the most common colorspace, i.e. way to represent colours is with a system of three integers for these. A simple way to write these 24bits is as a 6 letters hexadecimal, where each 2 digits represent one of the channels. The hexadecimal 0xff is 255 in base 10, for example. In the web, these hexadecimals are written prefixed with a hash, so, for example <span style="color: #f8766d;">#F8766D</span> represents <span style="color: #f800000;">red=248</span>, <span style="color: #007600;">green=118</span>, <span style="color: #00006d;">blue=109</span> and goes by the name of salmon. However, stuff gets complicated quickly to the point that there is even a "International Commission on Illumination" (CIE) to standardise things...</p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/HueScale.svg/320px-HueScale.svg.png" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="69" data-original-width="320" height="69" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ad/HueScale.svg/320px-HueScale.svg.png" width="320" /></a></div>Firstly, most operations don't quite work as intended in RGB colourspace and instead one has to use a different colorspace one. A frequently used colourspace for transformations is HSV (also known as HSL), where the colours are encoded as hue (the rainbow position, see figure), saturation and value/luminance. Interestingly, <a href="https://en.wikipedia.org/wiki/Hue#24_hues_of_HSL/HSV" target="_blank">Wikipedia</a> has a lovely table naming the colours in the hue wheel in 15º increments, which is way nicer that the silly seven kindergarten names of the colours of the rainbow. Complementary colours, discrete colours and the lot are basically operations in a colourspace with hue.<p></p><p style="text-align: left;"></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Helmholtz-Kohlrausch_effect_visualized_improved.png/320px-Helmholtz-Kohlrausch_effect_visualized_improved.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="132" data-original-width="320" height="132" src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Helmholtz-Kohlrausch_effect_visualized_improved.png/320px-Helmholtz-Kohlrausch_effect_visualized_improved.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://en.wikipedia.org/wiki/Helmholtz%E2%80%93Kohlrausch_effect">Helmholtz–Kohlrausch effect</a><br /><span style="color: #999999;">Top row colours with equal luminance<br />Bottom row, the same but desaturated</span></td></tr></tbody></table><br />Unfortunately, humans do not see colours across the hue wheel with equal strength —<i>cf.</i> figure. Therefore extra corrections are required to have a colour space that does not give dull colours. As a result there are lots of other colour spaces. The R ggplots colours are rotations in <a href="https://en.wikipedia.org/wiki/HCL_color_space">HCL colourspace</a>, where the Ch stands for chroma, which a projection to account for this. Salmon in this scheme is hue=15º, 100% chroma and 65% luminance.<p></p><p style="text-align: left;">Now, for python, the standard library module colorsys is handy for conversions between RGB and HSL, but does not do HCL colourspace. I had no success with the module "colorio", but I did with <a href="!pip install git+https://github.com/retostauffer/python-colorspace">colorspace</a>, which is not pip released (due to package name conflict), so needs installing via github:</p><p style="text-align: left;"><br /></p><pre><code>pip install git+https://github.com/retostauffer/python-colorspace</code></pre>
<p>With this simply doing the following will give an R-like colour pallette for salmon:</p>
<pre><code>n = 3
import numpy as np
from colorspace.colorlib import HCL, hexcols
hues : np.ndarray = np.linspace(0,360, n+1)+15
hues[hues >= 360] -= 360
colors = HCL(H = hues[:-1], C = [100]*n, L = [65]*n)
colors.to('hex')
colors.colors()</code></pre>
The colours as a result are the following based on the number requested:
<ol>
<li><span style="color: #f8766d;">#F8766D</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #00bfc4;">#00BFC4</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #00ba38;">#00BA38</span>
<span style="color: #619cff;">#619CFF</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #7cae00;">#7CAE00</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #c77cff;">#C77CFF</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #a3a500;">#A3A500</span>
<span style="color: #00bf7d;">#00BF7D</span>
<span style="color: #00b0f6;">#00B0F6</span>
<span style="color: #e76bf3;">#E76BF3</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #b79f00;">#B79F00</span>
<span style="color: #00ba38;">#00BA38</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #619cff;">#619CFF</span>
<span style="color: #f564e3;">#F564E3</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #c49a00;">#C49A00</span>
<span style="color: #53b400;">#53B400</span>
<span style="color: #00c094;">#00C094</span>
<span style="color: #00b6eb;">#00B6EB</span>
<span style="color: #a58aff;">#A58AFF</span>
<span style="color: #fb61d7;">#FB61D7</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #cd9600;">#CD9600</span>
<span style="color: #7cae00;">#7CAE00</span>
<span style="color: #00be67;">#00BE67</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00a9ff;">#00A9FF</span>
<span style="color: #c77cff;">#C77CFF</span>
<span style="color: #ff61cc;">#FF61CC</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #d39200;">#D39200</span>
<span style="color: #93aa00;">#93AA00</span>
<span style="color: #00ba38;">#00BA38</span>
<span style="color: #00c19f;">#00C19F</span>
<span style="color: #00b9e3;">#00B9E3</span>
<span style="color: #619cff;">#619CFF</span>
<span style="color: #db72fb;">#DB72FB</span>
<span style="color: #ff61c3;">#FF61C3</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #d89000;">#D89000</span>
<span style="color: #a3a500;">#A3A500</span>
<span style="color: #39b600;">#39B600</span>
<span style="color: #00bf7d;">#00BF7D</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00b0f6;">#00B0F6</span>
<span style="color: #9590ff;">#9590FF</span>
<span style="color: #e76bf3;">#E76BF3</span>
<span style="color: #ff62bc;">#FF62BC</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #db8e00;">#DB8E00</span>
<span style="color: #aea200;">#AEA200</span>
<span style="color: #64b200;">#64B200</span>
<span style="color: #00bd5c;">#00BD5C</span>
<span style="color: #00c1a7;">#00C1A7</span>
<span style="color: #00bade;">#00BADE</span>
<span style="color: #00a6ff;">#00A6FF</span>
<span style="color: #b385ff;">#B385FF</span>
<span style="color: #ef67eb;">#EF67EB</span>
<span style="color: #ff63b6;">#FF63B6</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #de8c00;">#DE8C00</span>
<span style="color: #b79f00;">#B79F00</span>
<span style="color: #7cae00;">#7CAE00</span>
<span style="color: #00ba38;">#00BA38</span>
<span style="color: #00c08b;">#00C08B</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00b4f0;">#00B4F0</span>
<span style="color: #619cff;">#619CFF</span>
<span style="color: #c77cff;">#C77CFF</span>
<span style="color: #f564e3;">#F564E3</span>
<span style="color: #ff64b0;">#FF64B0</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e18a00;">#E18A00</span>
<span style="color: #be9c00;">#BE9C00</span>
<span style="color: #8cab00;">#8CAB00</span>
<span style="color: #24b700;">#24B700</span>
<span style="color: #00be70;">#00BE70</span>
<span style="color: #00c1ab;">#00C1AB</span>
<span style="color: #00bbda;">#00BBDA</span>
<span style="color: #00acfc;">#00ACFC</span>
<span style="color: #8b93ff;">#8B93FF</span>
<span style="color: #d575fe;">#D575FE</span>
<span style="color: #f962dd;">#F962DD</span>
<span style="color: #ff65ac;">#FF65AC</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e38900;">#E38900</span>
<span style="color: #c49a00;">#C49A00</span>
<span style="color: #99a800;">#99A800</span>
<span style="color: #53b400;">#53B400</span>
<span style="color: #00bc56;">#00BC56</span>
<span style="color: #00c094;">#00C094</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00b6eb;">#00B6EB</span>
<span style="color: #06a4ff;">#06A4FF</span>
<span style="color: #a58aff;">#A58AFF</span>
<span style="color: #df70f8;">#DF70F8</span>
<span style="color: #fb61d7;">#FB61D7</span>
<span style="color: #ff66a8;">#FF66A8</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e58700;">#E58700</span>
<span style="color: #c99800;">#C99800</span>
<span style="color: #a3a500;">#A3A500</span>
<span style="color: #6bb100;">#6BB100</span>
<span style="color: #00ba38;">#00BA38</span>
<span style="color: #00bf7d;">#00BF7D</span>
<span style="color: #00c0af;">#00C0AF</span>
<span style="color: #00bcd8;">#00BCD8</span>
<span style="color: #00b0f6;">#00B0F6</span>
<span style="color: #619cff;">#619CFF</span>
<span style="color: #b983ff;">#B983FF</span>
<span style="color: #e76bf3;">#E76BF3</span>
<span style="color: #fd61d1;">#FD61D1</span>
<span style="color: #ff67a4;">#FF67A4</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e68613;">#E68613</span>
<span style="color: #cd9600;">#CD9600</span>
<span style="color: #aba300;">#ABA300</span>
<span style="color: #7cae00;">#7CAE00</span>
<span style="color: #0cb702;">#0CB702</span>
<span style="color: #00be67;">#00BE67</span>
<span style="color: #00c19a;">#00C19A</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00b8e7;">#00B8E7</span>
<span style="color: #00a9ff;">#00A9FF</span>
<span style="color: #8494ff;">#8494FF</span>
<span style="color: #c77cff;">#C77CFF</span>
<span style="color: #ed68ed;">#ED68ED</span>
<span style="color: #ff61cc;">#FF61CC</span>
<span style="color: #ff68a1;">#FF68A1</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e7851e;">#E7851E</span>
<span style="color: #d09400;">#D09400</span>
<span style="color: #b2a100;">#B2A100</span>
<span style="color: #89ac00;">#89AC00</span>
<span style="color: #45b500;">#45B500</span>
<span style="color: #00bc51;">#00BC51</span>
<span style="color: #00c087;">#00C087</span>
<span style="color: #00c0b2;">#00C0B2</span>
<span style="color: #00bcd6;">#00BCD6</span>
<span style="color: #00b3f2;">#00B3F2</span>
<span style="color: #29a3ff;">#29A3FF</span>
<span style="color: #9c8dff;">#9C8DFF</span>
<span style="color: #d277ff;">#D277FF</span>
<span style="color: #f166e8;">#F166E8</span>
<span style="color: #ff61c7;">#FF61C7</span>
<span style="color: #ff689e;">#FF689E</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e88526;">#E88526</span>
<span style="color: #d39200;">#D39200</span>
<span style="color: #b79f00;">#B79F00</span>
<span style="color: #93aa00;">#93AA00</span>
<span style="color: #5eb300;">#5EB300</span>
<span style="color: #00ba38;">#00BA38</span>
<span style="color: #00bf74;">#00BF74</span>
<span style="color: #00c19f;">#00C19F</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00b9e3;">#00B9E3</span>
<span style="color: #00adfa;">#00ADFA</span>
<span style="color: #619cff;">#619CFF</span>
<span style="color: #ae87ff;">#AE87FF</span>
<span style="color: #db72fb;">#DB72FB</span>
<span style="color: #f564e3;">#F564E3</span>
<span style="color: #ff61c3;">#FF61C3</span>
<span style="color: #ff699c;">#FF699C</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #e9842c;">#E9842C</span>
<span style="color: #d69100;">#D69100</span>
<span style="color: #bc9d00;">#BC9D00</span>
<span style="color: #9ca700;">#9CA700</span>
<span style="color: #6fb000;">#6FB000</span>
<span style="color: #00b813;">#00B813</span>
<span style="color: #00bd61;">#00BD61</span>
<span style="color: #00c08e;">#00C08E</span>
<span style="color: #00c0b4;">#00C0B4</span>
<span style="color: #00bdd4;">#00BDD4</span>
<span style="color: #00b5ee;">#00B5EE</span>
<span style="color: #00a7ff;">#00A7FF</span>
<span style="color: #7f96ff;">#7F96FF</span>
<span style="color: #bc81ff;">#BC81FF</span>
<span style="color: #e26ef7;">#E26EF7</span>
<span style="color: #f863df;">#F863DF</span>
<span style="color: #ff62bf;">#FF62BF</span>
<span style="color: #ff6a9a;">#FF6A9A</span></li>
<li><span style="color: #f8766d;">#F8766D</span>
<span style="color: #ea8331;">#EA8331</span>
<span style="color: #d89000;">#D89000</span>
<span style="color: #c09b00;">#C09B00</span>
<span style="color: #a3a500;">#A3A500</span>
<span style="color: #7cae00;">#7CAE00</span>
<span style="color: #39b600;">#39B600</span>
<span style="color: #00bb4e;">#00BB4E</span>
<span style="color: #00bf7d;">#00BF7D</span>
<span style="color: #00c1a3;">#00C1A3</span>
<span style="color: #00bfc4;">#00BFC4</span>
<span style="color: #00bae0;">#00BAE0</span>
<span style="color: #00b0f6;">#00B0F6</span>
<span style="color: #35a2ff;">#35A2FF</span>
<span style="color: #9590ff;">#9590FF</span>
<span style="color: #c77cff;">#C77CFF</span>
<span style="color: #e76bf3;">#E76BF3</span>
<span style="color: #fa62db;">#FA62DB</span>
<span style="color: #ff62bc;">#FF62BC</span>
<span style="color: #ff6a98;">#FF6A98</span></li>
</ol>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-40610843791640972352021-10-31T06:51:00.002-07:002024-02-09T00:17:06.255-08:00Multiple sequence alignments<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5UhUH2pDOwPC_62bl8MWv1-zvRFMfuCQbKxjM149Saax32WXSzJKf91wkJeneCNK-xyf6qCjjghtgLGLQNMeYmTkqpGsA8qi6VSjcJPGHYc16tFT3hAWOQD-p3riJ0ZOpM-RP_eYR_vvSICF-lEn_gxOO9BlOTHs_qL8xDRtUvvNlW_flsb-7z17uizM/s1024/DALL%C2%B7E%202024-02-09%2008.15.29%20-%20Imagine%20a%20unique%20and%20adorable%20creature%20that%20combines%20the%20best%20features%20of%20a%20wombat%20and%20a%20corgi.%20This%20animal%20has%20the%20sturdy,%20compact%20body%20of%20a%20corgi,%20c.webp" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="1024" data-original-width="1024" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5UhUH2pDOwPC_62bl8MWv1-zvRFMfuCQbKxjM149Saax32WXSzJKf91wkJeneCNK-xyf6qCjjghtgLGLQNMeYmTkqpGsA8qi6VSjcJPGHYc16tFT3hAWOQD-p3riJ0ZOpM-RP_eYR_vvSICF-lEn_gxOO9BlOTHs_qL8xDRtUvvNlW_flsb-7z17uizM/w200-h200/DALL%C2%B7E%202024-02-09%2008.15.29%20-%20Imagine%20a%20unique%20and%20adorable%20creature%20that%20combines%20the%20best%20features%20of%20a%20wombat%20and%20a%20corgi.%20This%20animal%20has%20the%20sturdy,%20compact%20body%20of%20a%20corgi,%20c.webp" width="200" /></a></div>A sequence alignment is a rather important tool.<div><ul style="text-align: left;"><li>Sequence conservation is a key ingredient in most nucleotide mutation severity predictors.</li><li>The covariance within it powers the AlphaFold2 Evoformer and other <i>de novo</i> structure predictors.</li><li>The phylogeny extracted from it tells the evolutionary tale of the protein</li></ul>However, on the very basic level,<i> i.e. </i>getting a nice figure, far from the world of covariance matrices, it is a slight nuisance.</div><div>Therefore I would like share some pointers on choosing species and two python operation, namely getting the equivalent residue in a homologue and making a figure in Plotly. Just like with docking, where careful and diligent human choices make all the difference, rational choices help greatly with clarity for sequence alignments.<br /><span><a name='more'></a></span>
<h3 style="text-align: left;">Forenote</h3><div><div>This post isn't aimed at going through all the basics of making a MSA, but rather focus on three details mentioned and in particular is focused on animals and not other groups. But briefly, to make MSA, Muscle (<a href="https://www.ebi.ac.uk/Tools/msa/muscle/" target="_blank">web</a>, <a href="https://www.drive5.com/muscle/downloads.htm" target="_blank">binary</a>) is very robust. if there are too many sequences, <a href="https://www.ebi.ac.uk/Tools/msa/clustalo/">Clustal Omega</a> is a good choice and <a href="http://weizhong-lab.ucsd.edu/cd-hit/">cd-hit</a> at removing diversity —akin to using a uniref70, 50 or 30. For tree inference, there are many different options, using different approaches whose discussion has filled textbooks. But <a href="https://cme.h-its.org/exelixis/web/software/raxml/" target="_blank">RAxML</a> is really good, but if a quick neighbour joined tree is required, EBI has an <a href="https://www.ebi.ac.uk/Tools/phylogeny/simple_phylogeny/" target="_blank">web tool </a>for it.<br /><h3 style="text-align: left;">Footnote on lead</h3></div><div>In the lead I mention nucleotide mutation severity predictors, e.g. CADD scores. These just spit out a mysterious number, whereas it is a lot more informative looking where the residue is in the structure —I may be biased as I wrote <a href="https://venus.sgc.ox.ac.uk/" target="_blank">Venus</a>, a tool to investigate exactly that. But generally buried core residues are conserved, whereas surface residues are less —unless they are interface residues, conformationally important or involved in some other mechanism!</div><div><br /></div><div>In terms of mapping sequence conservation on structure, <a href="https://consurfdb.tau.ac.il/">Consurf</a> is the tool to use. However, even if I prefer the former, a MSA figure is the preferred option for sequence conservation in publications as it says explicitly, how far back is it conserved even though it does not show secondary structure and surface exposure.</div><div><br /></div></div><h3 style="text-align: left;">Making the MSA itself: humans</h3>
<div>A rather key ingredient in making the multiple sequence alignment (MSA) itself is choosing the species.</div><div>For high-throughput projects this is easy: blast and remove redundancy, but MSAs for human readers, it is not. Ideally this should be well spaced out and species that a reader can recognise.</div><div><br /></div><div>For animal protein trees these near-universally follow the species tree, with the exception of gene duplication events.</div><div>As a result the assembly of the group of sequences requires going back along the evolutionary tree and picking species whose ancestor branched off earlier and earlier —but with a strong focus on genome duplication events as these may be when the gene was duplicated.<br />Doing a BLAST search will return an endless troop of monkeys and conspiracy of lemurs, which lacks divergence, so often a manual selection is more useful for figures. This has four points to note.</div><div><br /></div><h4 style="text-align: left;">1. Genome quality</h4><div>A big problem is that the genome of many phylogenetically useful animals (e.g. coral, sea urchin, wallaby, dog etc.) may not be as well curated as one would hope, resulting in shorter protein (missing exons). Mice (rodents) and rabbits (langomorphs) are a sister clade to monkeys and lemurs, so are not as distant as one would think, however mice are a model organism, therefore their genome is well curated, so are often seen included in MSAs albeit often pointlessly.</div><div>The annotated name is highly variable, therefore doing a Blast search filtered for the particular species of interest is the better approach.</div><div><br /></div><div><h4 style="text-align: left;">2. NCBI or Uniprot?</h4>I normally use Uniprot because the curation is better for model organisms, but NCBI is way better curated than Uniprot for non-model organism. In NCBI, there may be multiple entries with different lengths, these may be different isoforms or a paralogue —if from a blast search the identify percentage will be similar for the former, but not the latter.</div><div><br /></div><h4 style="text-align: left;">3. Sidenote: common name or scientific name?</h4><div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3TXZcuTEbKDoxSVIav27mpv0tpWLH-DuBdHPZo3YBZn7Ct8O0t6ZB8QDn6vbDmZTBAdyjRX7IhtePaIZpLyLYELbvr5CenTTrToszEpcu-BERUxfroSCeD7NCET2Y8CC7-jzGmBOHFXo/s1198/Screenshot+2021-10-31+at+10.19.37.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1198" data-original-width="986" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3TXZcuTEbKDoxSVIav27mpv0tpWLH-DuBdHPZo3YBZn7Ct8O0t6ZB8QDn6vbDmZTBAdyjRX7IhtePaIZpLyLYELbvr5CenTTrToszEpcu-BERUxfroSCeD7NCET2Y8CC7-jzGmBOHFXo/w262-h320/Screenshot+2021-10-31+at+10.19.37.png" width="262" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><div style="text-align: left;"><span style="color: #666666;">The scientific names in a Blast search:</span></div><span style="color: #666666;"><ul><li style="text-align: left;"><span style="color: #666666;">Sus → suine = pig</span></li><li style="text-align: left;"><span style="color: #666666;">Tichochus → walrus = tricheco in Italian</span></li><li style="text-align: left;"><span style="color: #666666;">Vombatus → Wombat. Cute Scientific name!</span></li><li style="text-align: left;"><span style="color: #666666;">Nyctibius → nyx is night so... bat? No. it's a bird</span></li></ul></span></td></tr></tbody></table><br />I <u>personally</u> prefer using the common name of the species or better the single word name of a parent taxon, because to me most scientific names are gibberish. I can speak Italian, which makes guessing some less common Latin names easier, for example a blackbird is a «merlo» and its Latin name is <i>Turdus merula*</i>, but most are still alien, for example a red kite is a «nibbio reale», but its Latin name is <i>Milvus milvus</i>.<br /><br />*) In the above sentence, I should have said "Eurasian blackbird" if I wanted to be pedantically precise, but for most applications this is just pointless noise...</div><div><br />I am also of the opinion that using kelvin in an enzymology paper is utterly ridiculous, but I have been told this is a sign I am a rubbish scientist. So as said, preference for intelligible common names is my personal opinion and I may be a bad scientist for it!<br /><h4 style="text-align: left;">4. Animal phylogeny</h4></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://i.natgeofe.com/n/4749d875-8c53-463e-a6bf-49b8022b253b/36320_3x4.jpg" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="800" data-original-width="600" height="136" src="https://i.natgeofe.com/n/4749d875-8c53-463e-a6bf-49b8022b253b/36320_3x4.jpg" width="102" /></a></div>Choosing non-model animals requires a crash course on animal phylogeny and in particular when genome duplication events happened. There are some organisms that would be useful, but these are not genome sequenced, therefore some limitations are at play. Also, if there is no gene in one of them, it may be an annotation or coverage issue, so checking sister species is ideal if possible —the number of unsequenced animals is large: the adorable mexican axolotl is the sole sequenced salamander, while no newt is. </div><div><br /></div><div><div>Some watery organisms stand out:</div></div><div><ul style="text-align: left;"><li><b>Lamprey</b> (<i>Petromyzon marinus</i>) are the thing of nightmares, but predate the genome double duplication event in jawed fish, so are very nice phylogenetically as often ends up being the outgroup. When a gene family is still duplicated in lampreys one can go back further:</li><li><b>Sea urchin</b> (<i>Strongylocentrotus purpuratus</i>) is basal to vertebrates. It is an invertebrate, but not of insect, molluscs and worms kind (deuterostome not protostome). So it is more closely related to us than insects, molluscs and worms.</li><li><b>Coral</b> (<i>Acropora millepora</i>) is an animal that is basal to all others.</li><li><b>Zebrafish </b>(<i>Danio rerio</i>) is a ray finned fish with a <u>curated</u> genome. Most ray-finned fish however descent from an ancestor with a genome duplication event. Tetrapods (land vertebrates) evolved from lobe finned fish, for which the extant examples are lung fish and coelacanths. </li><li><b>Coelacanth</b> (<i>Latimeria chalumnae</i>) is not a Pokémon found in New Zealand —that's Relicanth—but a fish basal to tetrapods. Together ray-finned fish and lobe-finned fish form the boney fish, whose common ancestors had two genome duplication events, once after lampreys diverged and the second possibly after sharks (jawed cartilaginous fish).</li><li><b>Great white shark</b> (<i>Carcharodon carcharias</i>) is a sequenced shark along with two others.</li><li><b>Sturgeon</b> (<i>Acipenser ruthenus</i>) is a basal ray-finned fish that predates the genome duplication seen in zebrafish. For the purpose of diversity of a gene that duplicated in lampreys, fish are really useful as they are diverse despite looking all the same —so picking some fish near randomly would work. Do note that zebrafish is a minnow, which is in the same family as carp (sequenced).</li><li><b>Axolotl</b> (<i>Ambystoma mexicanum</i>) is an amphibian, which are basal tetrapods.<br /><br /></li></ul><div>Tetrapod phylogeny after amphibians is straightforward as there are two clades one with reptiles and birds, the other with mammals. As a result, using one representative for the former, say chicken (<i>Gallus gallus</i>), would suffice. Parenthetically, I should say that all extant species have evolved to the present day, even if they look primitive, so there's no difference in terms of divergence distance from humans between an iguana and an eagle.<br />In mammals, the phylogeny is well known. The basal monotremes platypus (<i>Ornithorhynchus anatinus</i>) and echidna (not sequenced) lay eggs and marsupials, such as a wallabies (<i>Notamacropus eugenii</i>, no big kangaroos sequenced) and wombats (<i>Vombatus ursinus</i>) have pouches. Then there's a radiation of different placental mammals, which means that pinning the precise relationship between the clades has been tricky and that the representative sequence divergence is not going to suffer if the chosen representative species is not quite basal—for a detailed review of mammalian phylogeny see <a href="https://www.cell.com/trends/ecology-evolution/fulltext/S0169-5347(04)00142-9" target="_blank">this review</a>. Also, if someone is interested in showing a protein MSA, the best example of sequence conservation/divergence is further down the tree. Two things need noting, mammalian carnivores form a single clade and many mammals that come to mind are from this clade, while cloven animals and cetaceans form a clade (Cetartiodactyla)—this is not important, but I really wanted to mention that hippos and whales had a common ancestor called a "whippo" (hence the clade name <i>Whippomorpha</i>), which is a fantastic name.</div></div><h3 style="text-align: left;">Bacterial MSA</h3><div>For bacteria, things get messy due to widespread horizontal gene transfer. So gene trees rarely match species trees. In their favour is the fact that their genomes are far easier to annotate and smaller, so there's a much larger diversity to choose from. However, they are much more diverse. As a result I cannot give pointers...</div></div><h3 style="text-align: left;">Python and MSAs</h3><div>R has market dominance over Python when it comes to genetic bioinformatics, but generally the operations are simple, so there's not too much issue. In terms of most complete MSA plotting tool, that would be Textshade in Latex —I dislike it as it is highly complicated, I seem to want to some near-impossible corner case which if done from scratch may have been easier and I am addicted to Jupyter notebooks...</div>
<p>Starting from the basics, retrieving sequences from NCBI is easy with BioPython:</p>
<pre><code>from Bio import Entrez
Entrez.email = "👻@👻.👻"
Entrez.api_key = "👾👾👾👾" # you get this by registering
with Entrez.efetch(db="protein", id=target, rettype="fasta", retmode="text") as handle:
fasta = handle.read()</code></pre>
<p>Likewise, with UniProt it is even easier:</p><pre><code>url = f'https://www.uniprot.org/uniprot/{uniprot_id}.fasta'
reply = requests.get(url)
assert reply.status_code == 200, f'{reply.status_code}: {url}'
fasta = reply.text</code></pre><p>One thing to note is that Uniprot and NCBI have different header format, for example here are two functions to get information from them (without using BioPython):</p>
<pre><code>def split_ncbi_fasta(fasta:str) -> dict: # keys: ['accession', 'description', 'species', 'sequence']
return re.match('\&gt;(?P<accession>\S+)\s+(?P<description>.*)\s+\[(?P<species>.*)\](?P<sequence>\w*)', fasta.replace('\n','')).groupdict()</code></pre><pre><code>def split_uniprot_fasta(fasta:str) -> dict: # keys: ['type', 'accession', 'name', 'description', 'args', 'OS' = species, 'OX', 'GN', 'PE', 'sequence']<br />
match = re.match(r'\>(?P<type>\w+)\|(?P<accession>.*?)\|(?P<name>[\w\_]+)\s(?P<description>.*?)\s(?P<args>\w+\=.*)', fasta).groupdict()
details = dict(re.findall(r'(\w+)\=(.*?)\s+\n', # there is a way to split this more cleanly...
re.sub(r'(\w+\=)', r'\n\1', match['args']+'\n')
)
)
return {**match, **details, 'sequence': ''.join(fasta.split('\n')[1:])}</code></pre>
<p>The next few operations (i.e. writing a fasta file with headers of one's choice, aligning locally via a muscle3 system call or remotely via Muscle server or any other aligner and reading the aligned sequences again) are very straightforward, but have a few personal choices, e.g. how to clean the labels, whether to use pandas or dictionaries etc.</p><p>One operation that is not too straightforward is getting the equivalent residue position between two sequences which requires one to remember to correct the position as they aren't zero-indexed making it an otherwise simple operation namely creating a list where each index corresponds to the zero-indexed ungapped position of the sequence, while the elements are the index of these in the alignment:</p>
<pre><code>def get_mapping(seq: str) -> List[int]:
"""Given a sequences with gaps return a zero-index
list mapping ungapped position --> gapped position """
return [i for i, p in enumerate(seq) if p != '-']
def convert(query_seq: str, target_seq: str, position: int) -> int:
"""
Given a query sequence (aligned) and target sequence (aligned)
and a ungapped query position, return the ungapped target position
:param query_seq: aligned
:param target_seq: aligned
:param position: query residue number counted from 1
:return: target residue number counted from 1
"""
query_mapping = get_mapping(query_seq)
target_mapping = get_mapping(target_seq)
mp = query_mapping[position - 1]
return target_mapping.index(mp) + 1</code></pre>
<p>The next thing is the figure. This is actually really straightforward and can be done with a scatterplot, where the datapoints are a residue represented by label without the marker shown and the position is to make a grid. For example:</p><pre><code>import plotly.graph_objects as go
import numpy as np
from collections import Counter
from typing import *
def make_fig(al_seqs: Dict[str, str], position: int, names: Optional[List[str]] = None) -> go.Figure:
"""
This make a fake scatter plot with letters for each position in Plotly.
:param al_seqs: aligned seq dictionary of names to seqs w/ gaps (aligned)
:param position: position to show based on the first name
:param names: order or the seqs to show top to bottom —if subset only these will be shown
:return:
"""
# --------------------------------------------------
# prep
if names is None:
names = list(al_seqs.keys())
elif len(names) != len(al_seqs):
sub_al_seqs = {name: [] for name in names}
for i in range(len(al_seqs[names[0]])):
if any([al_seqs[name][i] != '-' for name in names]):
for name in names:
sub_al_seqs[name].append(al_seqs[name][i])
for species in sub_al_seqs:
sub_al_seqs[species] = ''.join(sub_al_seqs[species])
al_seqs = sub_al_seqs
# mapping
first = al_seqs[names[0]]
n = len(first)
mapping = get_mapping(first)
position = position - 1
start = mapping[position] - 10
mid = mapping[position]
stop = mapping[position] + 10
# out of bonds prevention<br /> if start < 0:
forepadding = ['>'] * abs(start)
mid += abs(start)
start = 0
else:
forepadding = []
if stop > n:
aftpadding = ['<'] * stop - n
stop = n
else:
aftpadding = []
# --------------------------------------------------
# Add letters
scatters = [go.Scatter(x=np.arange(start, stop + 1),
# y=[-i/yzoom] * 20,
y=[name] * 21,
text=forepadding + list(al_seqs[name][start:stop + 1]) + aftpadding,
textposition="middle center",
# textfont_size=10,
textfont=dict(
family="monospace",
size=12,
color="black"
),
name=name,
mode='text') for i, name in enumerate(names)
]
# --------------------------------------------------
# Add red box
fig = go.Figure(data=scatters)
fig.add_shape(type="rect",
x0=mid - 0.5, x1=mid + 0.5,
# y0=0.5/yzoom, y1=(0.5 -len(al_seqs))/yzoom,
y0=0 - 0.5, y1=len(names) - 1 + 0.5, # the len is off-by-one
line=dict(color="crimson"),
)
fig.update_shapes(dict(xref='x', yref='y'))
# --------------------------------------------------
# Add turquoise squares
for i in range(start, stop + 1):
residues = [al_seqs[name][i] for name in names]
mc = Counter(residues).most_common()[0][0]
for j, r in enumerate(residues):
if mc == '-':
continue
elif r == mc:
fig.add_shape(type="rect",
x0=i - 0.5, x1=i + 0.5,
# y0=(-j -.5)/yzoom, y1=(-j +.5)/yzoom,
# y0=names[j], y1=names[j],
y0=j - 0.5, y1=j + 0.5,
fillcolor="aquamarine", opacity=0.5,
layer="below", line_width=0,
)
# --------------------------------------------------
# Correct
fig.update_layout(template='none',
showlegend=False,
title=f'Residue {position + 1}',
xaxis={
'showgrid': False,
'zeroline': False,
'visible': False,
# 'range': [start,stop],
},
yaxis={
'showgrid': False,
'zeroline': False,
'visible': True,
'autorange': 'reversed'
}
)
# --------------------------------------------------
return fig</code></pre><p>This would make something like:</p><div class="separator" style="clear: both;"><a href="https://user-images.githubusercontent.com/11302983/125954151-a869067f-88b1-4b3e-9558-3905257350b7.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="500" data-original-width="700" src="https://user-images.githubusercontent.com/11302983/125954151-a869067f-88b1-4b3e-9558-3905257350b7.png" width="320" /></a></div><p>By doing it this way one can easily change how the plot is made, such as colour only residues that match the first one or make the labels slanted (<code>fig.update_layout(yaxis=dict(tickangle = -45) )</code>). One thing to note however is that the above function makes only a window and if the whole sequences are used and zoomed in (via the range argument to the layout of xaxis) one can zoom and scroll around, but the figure becomes very heavy due to the added shapes (turquoise background).</p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-88052871924358565112021-10-17T06:06:00.002-07:002021-10-17T07:28:33.863-07:00Filling missing loops by cannibalising AlphaFold2<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGVct0K5-r9BnAxRSVkTEwhfDHo2O8OuMiU9oCMc18YYrqwpDsneFcdRtVQxNMbGOq0j9O2m_AE0iP_E9WdbqOcezvIQ4GwsoWOrq2v0VtRgg6kMZzC7x37v0YleiyT-fjgWM5IISmwS4/s1043/ripped_out.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="548" data-original-width="1043" height="105" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGVct0K5-r9BnAxRSVkTEwhfDHo2O8OuMiU9oCMc18YYrqwpDsneFcdRtVQxNMbGOq0j9O2m_AE0iP_E9WdbqOcezvIQ4GwsoWOrq2v0VtRgg6kMZzC7x37v0YleiyT-fjgWM5IISmwS4/w200-h105/ripped_out.png" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="color: #999999; font-size: x-small;">I could not resist this Photoshop.<br />But the process is not as dramatic<br />and the results not as bad as Temple of Doom...<br />If done right.</span></td></tr></tbody></table>AlphaFold2 models have a complete sequence, but for innumerable reasons the crystal structure of the protein is better, but may have missing spans. As a result one may want, for illustrative purposes only, to rip out the required parts from the AlphaFold2 models (as fragments) and have them built into the target structure. Here is how to do it by threading.<p></p><a name='more'></a><h2 style="text-align: left;">Cannibalistic threading</h2><div>Note: I have made a Colab notebook that does the operations described below (<a href="https://colab.research.google.com/github/matteoferla/pyrosetta_help/blob/main/colabs/colabs-thread_by_AF2_cannibalism.ipynb">here</a>).</div><div><br /></div><div>In two previous blog posts, I went through <a href="http://blog.matteoferla.com/2020/07/filling-missing-loops-proper-way.html" target="_blank">how to add missing loops in Rosetta or PyRosetta</a> and <a href="https://blog.matteoferla.com/2017/10/hacking-pdbs-for-fusion-protein.html" target="_blank">how to add missing loops hackishly in PyMOL</a>. Here I'll go through how one can add them by stealing off other structures by threading and using the fragments of the cannibalised structure to fill missing parts. </div><div>Given a structure of a homologue (the template) and a target sequence one can "thread" the positions of the latter onto the former. There are many tools out there, here I will use PyRosetta with the RosettaCM threading mover. As threading is actually meant for homologues, I will cover those too.</div><div>The reason, why I am misusing it for filling missing loops is that the infrastructure is already there: a good coder is a lazy one! Using other crystal structures as the fragment donors and I would strongly suggest this (instead of, or in addition to, the AlphaFold2 model). The AlphaFold2 model is mentioned not because of the hashtag-trendiness, but because these have a complete sequence, albeit <a href="https://blog.matteoferla.com/2021/07/what-to-look-out-for-with-alphafold2.html">with spaghetti loops of low quality as previously discussed</a>, and do not blow as SwissModels do (<a href="https://blog.matteoferla.com/2021/08/tweaking-alphafold2-models-with.html">discussed here</a>). A nice benefit is that the number offset is also corrected.</div><div><br />In my <a href="https://pypi.org/project/pyrosetta-help/" target="_blank">pyrosetta_help</a> module, there are several simple functions to deal with making a threaded model.<br />This includes a function <code>threaded_pose, threader, unaltered = thread(unaligned_target_sequence, template_pose)</code>, which does all the operations below (alignment, threading, sidechains and ligand theft), but most times there is something technical in the way, hence why I will go through it. Otherwise, here is how to do it cryptically with that module:</div>
<pre><code>uniprot = 'P08684'
template_pose = ph.parameterised_pose_from_file(ph.download_pdb('5A1R'), overriding_params=['HEM.params'])
af_pose = ph.pose_from_alphafold2(uniprot)
fragsets = ph.make_fragment_sets(af_pose)
threaded, threader, threadites = ph.thread(target_sequence=af_pose.sequence(),
template_pose=template_pose,
fragment_sets=fragsets)
ph.steal_ligands(template_pose, threaded)
threaded.dump_pdb('threaded.pdb')</code></pre>
<p>Where P08684 is the Uniprot sequence and 5A1R which is the PDB. This has one oddity in that the PDB provided SMILES for the residue HEM (<code>CC1=C(CCC(O)=O)C2=Cc3n4[Fe]5|6|N2=C1C=c7n5c(=CC8=N|6C(=Cc4c(C)c3CCC(O)=O)C(=C8C=C)C)c(C)c7C=C</code>, fails to parse, so I provided the HEM.params file found in one of the Rosetta tests (it lacks the inter-residue axial donor bonds with his/cys and an oxygen, but haem is a nightmare for another time). This illustrates that a one solution fits rarely works. Anyway this is the result of the above:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtHlZ9-l1c2PuzNHm4gT6-Wt8vpIlVcrOod2oaubzrU1fs-VplWdLTVrWqUSaP40QiEZDZMrmJ74rtBo4GtXLQH3b-itmZkBeg5mYrvgms3IPLFg0IZd1EZmxvaBou4sVG5OHlZPTx7I4/s1408/process.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1408" data-original-width="1255" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtHlZ9-l1c2PuzNHm4gT6-Wt8vpIlVcrOod2oaubzrU1fs-VplWdLTVrWqUSaP40QiEZDZMrmJ74rtBo4GtXLQH3b-itmZkBeg5mYrvgms3IPLFg0IZd1EZmxvaBou4sVG5OHlZPTx7I4/s320/process.png" width="285" /></a></div><br /><p>Before the pitchforks come out, I should say that (a) yes, the image is from PyMOL because it's prettier, but for checking stuff in a Jupyter notebook I always opt for NGLView, and (b) the missing bits are added from fragments hence why there is not a 1:1 match, which is a good thing as otherwise there would be nastily closed loops.</p><h2 style="text-align: left;">Steps explained</h2><h3 style="text-align: left;">Alignment</h3><div>An important feature is having a decent pairwise alignment. In Biopython, there is a handy, albeit quirky pairwise alignment set of functions, in <a href="https://biopython.org/docs/1.75/api/Bio.pairwise2.html" target="_blank">the module pairwise2</a>.</div><div></div><p></p><pre><code>from Bio import pairwise2
alignments = pairwise2.align.globalxs(target,
template,
-1, # open
-0.1 # extend
)<br /></code></pre>
<p>Refreshingly target and template are strings not <code>Bio.Seq.Seq</code> instances, but there are a few oddities:</p><p></p><ul style="text-align: left;"><li>the global and local functions are not overloaded, but different mode are controlled by two letters (above xs does not mean extra-small),</li><li>the functions do not accept named arguments making them confusing ten minutes after use</li><li>Non-amino acid letters are aligned with no penalty — w, X and Z will align happily to the twenty AAs</li><li>The output <code>alignments</code> above is a list of tuples of different meaings. Here is an example making it more sensible:</li></ul><pre><code>alignments = [dict(zip(['target', 'template', 'score', 'begin', 'end'], alignment)) for alignment in alignments]</code></pre><p>Parenthetically, there is a nice function (<code>format_alignment</code>) to add dots for matches <i>etc.</i> but if one is using a Jupyter notebook, to display the output in a sensible way a scroll bar is a must:</p><pre><code>from Bio import pairwise2
formatted:str = pairwise2.format_alignment(*alignment)
a, gap, b, score = formatted.strip().split('\n')
gap = ''.join(['.' if c == '|' else '*' for c in gap]).replace(" ", "*")
from IPython.display import display, HTML
display(HTML(f'<div style="display: inline-block; font-family: monospace; white-space: nowrap;">'+
f'{a}<br />{gap}<br />{b}<br />{score}</div>'))</code></pre><p>In the case of distant protein, it may be best to align multiple sequences and trim down. Multiple sequence alignments are not possible with biopython, but there are several tools, clustal was the original kid on the block, but Muscle is more accurate according to benchmarks.</p><p>For the RosettaCM framework, alignments are stored in the "<a href="https://www.rosettacommons.org/docs/latest/rosetta_basics/file_types/Grishan-format-alignment" target="_blank">grishin format</a>". In my aforementioned helper module, there is a function to write a grishin file, so here is an alignment with pairwise2 to a grishin file.</p>
<pre><code>
import pyrosetta_help as ph
alignment = ph.get_alignment(target_sequence, template_pose.sequence())
aln_file = f'{target_name}.aln'
ph.write_grishin(target_name,
alignment['target'],
template_name,
alignment['template'],
aln_file)
</code></pre>
<h3 style="text-align: left;">Movers</h3><p>Obviously, if a Muscle alignment was done, the first step is different. Once a Grishin alignment file has been written, the fun can start:</p>
<pre><code>align = pyrosetta.rosetta.core.sequence.read_aln(format='grishin', filename=aln_file)
threader = pyrosetta.rosetta.protocols.comparative_modeling.ThreadingMover(align=align[1],
template_pose=template_pose)
target_pose = pyrosetta.Pose()
pyrosetta.rosetta.core.pose.make_pose_from_sequence(target_pose, target_sequence, 'fa_standard')
threader.apply(target_pose)</code></pre>
<p>This has four things to note:</p>
<ul style="text-align: left;">
<li>If the template has termini these will be maintained</li>
<li>Mismatching residues and gaps will results in some residues not being placed</li>
<li>Sidechains have not been mimicked</li><li>Ligands have not been stolen and should not have been present in the template to start with (even for the alignment step)</li>
</ul>
<h3 style="text-align: left;">Termini</h3><p>A peptide's N-terminus is protonated, while the C-terminus has an oxygen called <code>OXT</code>. In Rosetta termini are a patch, like phosphorylations <i>etc.</i> To remove these:</p>
<pre><code>clean_template_pose = template_pose.clone()
pyrosetta.rosetta.core.pose.remove_nonprotein_residues(clean_template_pose)
### find
lowers = pyrosetta.rosetta.utility.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
uppers = pyrosetta.rosetta.utility.vector1_std_pair_unsigned_long_protocols_sic_dock_Vec3_t()
pyrosetta.rosetta.protocols.sic_dock.get_termini_from_pose(clean_template_pose, lowers, uppers)
### remove
rm_upper = pyrosetta.rosetta.core.conformation.remove_upper_terminus_type_from_conformation_residue
rm_lower = pyrosetta.rosetta.core.conformation.remove_lower_terminus_type_from_conformation_residue
for upper, _ in uppers:
rm_upper(clean_template_pose.conformation(), upper)
for lower, _ in lowers:
rm_lower(clean_template_pose.conformation(), lower)</code></pre>
<h3 style="text-align: left;">Fragments</h3><p>The first means that these residues will be in their original places. So if an AlphaFold2 pose was given as the target these would be comically left behind.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD70KLuaXi_s-XGFoCL_bBKJQiel28H_cag1YJ-dm6qPhPs8xGnDkzqFF-iLtDZzGnw_S2ooAbLu-AmqqljKc7RDFTmh8Jw6TeGpoPRCyU72DFDATNq5Zt18GZ9mbHMFPMYjlKDz6lkGE/s2899/AF2_threaded.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1085" data-original-width="2899" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD70KLuaXi_s-XGFoCL_bBKJQiel28H_cag1YJ-dm6qPhPs8xGnDkzqFF-iLtDZzGnw_S2ooAbLu-AmqqljKc7RDFTmh8Jw6TeGpoPRCyU72DFDATNq5Zt18GZ9mbHMFPMYjlKDz6lkGE/s320/AF2_threaded.png" width="320" /></a></div><br /><div>In the case of creating a pose from a sequence, these will be in the weird 12Å ring structure centred around the origin, which actually is nicer to work with.<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwTskFF-ruekPgKNn_IqEpghvghgwcUGdr44RtmkIStTUu8RVHStzU8eNP2NqQpjf4oRo846PjQ-6vpfna4EBEjqmIQptwT-d4xlaiNhH1O3B0ENDvy08Y4ROf_mwg74zpYtzbtUQqsgA/s1280/thread.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="951" data-original-width="1280" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwTskFF-ruekPgKNn_IqEpghvghgwcUGdr44RtmkIStTUu8RVHStzU8eNP2NqQpjf4oRo846PjQ-6vpfna4EBEjqmIQptwT-d4xlaiNhH1O3B0ENDvy08Y4ROf_mwg74zpYtzbtUQqsgA/s320/thread.png" width="320" /></a></div></div><p>Parenthetically, if for an other application one is making a pose from sequence, but is in need of a specific secondary structure, iterating across the residues and chaining the ψ and φ angles is the way to do it —In <code>pyrosetta_help</code> the functions <code>make_ss</code> (ψ and φ to be specified), <code>make_alpha_helical</code> (φ=-57.8°, ψ=-47.0°), <code>make_310_helical</code> (φ=-74.0°, ψ=-4.0°), <code>make_pi_helical</code> (φ=-57.1°, ψ=-69.7°), <code>make_sheet</code> (φ=-139°, ψ=+135°) do this.</p>
<p>To solve the unthreaded bits, one can use fragments stolen from other poses.</p>
<pre><code>confragset = pyrosetta.rosetta.core.fragment.ConstantLengthFragSet(3)
pyrosetta.rosetta.core.fragment.steal_constant_length_frag_set_from_pose(cannibilised_pose, confragset) # does not blank pre-existing values
fragsets = pyrosetta.rosetta.utility.vector1_std_shared_ptr_core_fragment_FragSet_t(1) # or however many
fragsets[1] = confragset # .append(confragset) to add to the end, i.e. zero in the above.
</code></pre>
<p>The vector of sets is because one may want different constant lengths (eg. 3 & 9) or different files were read <i>etc</i>. Here <code>cannibilised_pose</code> is the pose that is cannibilised for fragments. It can even have ligands. And can be an AlphaFold2 model, which has a full sequence —AlphaFold2 models do not blow up, unlike SwissModel models, so the backbone angles must be <i>decent enough</i>.</p><pre><code>cannibilised_pose = ph.pose_from_alphafold2('👾👾👾')</code></pre>
<p>These can be saved too, although this is not required for threading. </p><pre><code>fragio = pyrosetta.rosetta.core.fragment.FragmentIO()
fragio.write_data('👾👾</code>👾.frags', confragset)</pre>
<p>To use in threading, these need to be passed to the threading mover before applying to the pose.</p><pre><code>threader.build_loops(True)
threader.randomize_loop_coords(True) # default
threader.frag_libs(fragsets)</code></pre>
<h3 style="text-align: left;">Sidechains</h3><div>The next issue is that the sidechains need stealing, which is were <code>StealSideChainsMover</code> comes in. First, this needs to be told the equivalence of the two poses. The threader mover has an object called <code>qt_mapping</code>, an instance of <code>pyrosetta.rosetta.core.id.SequenceMapping</code>, which is different that the element in the aligment vector from reading the grishin file (<code>pyrosetta.rosetta.utility.vector1_core_sequence_SequenceAlignment</code>) as the former maps the good pairs and not everything including the rubbish ones. I mention this as it can be handy to be know what residues are a perfect vs. imperfect match.</div>
<pre><code>qt: = threader.get_qt_mapping(target_pose)
steal = pyrosetta.rosetta.protocols.comparative_modeling.StealSideChainsMover(template_pose, qt)
steal.apply(target_pose)
mapping : pyrosetta.rosetta.utility.vector1_unsigned_long = qt.mapping()
vector = pyrosetta.rosetta.utility.vector1_bool(target_pose.total_residue())
for r in range(1, target_pose.total_residue() + 1):
if mapping[r] != 0: # match!
vector[r] = 1
</code></pre>
<h3>Ligands</h3>
<p>The next part is stealing ligands.</p><p>The <code>pose.sequence()</code> outputs non-polymer residues too (w, X, Z), which PairWise2 aligns with disastrous consequences (missing span). There many useful functions, in the <code>pose</code> submodule, which aren't methods of <code>Pose</code>, one of these is:</p>
<pre><code>pyrosetta.rosetta.core.pose.remove_nonprotein_residues(pose)</code></pre>
<p>This operates in place, so obviously one would do it to a clone or else there'd not be anything to steal.</p>
<p>There is a function, <code>StealLigandMover</code> with the following arguments:</p><pre><code>pyrosetta.rosetta.protocols.comparative_modeling.StealLigandMover(source: pyrosetta.rosetta.core.pose.Pose,
anchor_atom_dest: pyrosetta.rosetta.core.id.NamedAtomID,
anchor_atom_source: pyrosetta.rosetta.core.id.NamedAtomID,
ligand_indices: pyrosetta.rosetta.utility.vector1_core_id_NamedAtomID)</code></pre>
<p>To be honest I don't quite get the benefit in this as it makes everything more complicated, but briefly the anchor business is for coordinate referencing (not connections or similar), while the ligand indices is vector1 of NamedAtomID, with one atom per residue. A NamedAtomID can be converted from an AtomID, via <code>pyrosetta.rosetta.core.conformation.atom_id_to_named_atom_id</code>, but it requires a residue instance (<code>pose.residue(res_index)</code>) anyway. The major issue is that one has to know what the residue indices of interest are anyway. If that is the case simply using <code>pyrosetta.rosetta.core.pose.append_subpose_to_pose</code> is quicker.</p>
<p>As a results, one way to do this is by taking blindly all non-protein residues:</p>
<pre><code>PROTEIN = pyrosetta.rosetta.core.chemical.ResidueProperty.PROTEIN
prot_sele = pyrosetta.rosetta.core.select.residue_selector.ResiduePropertySelector(PROTEIN)
not_sele = pyrosetta.rosetta.core.select.residue_selector.NotResidueSelector(prot_sele)
rv = pyrosetta.rosetta.core.select.residue_selector.ResidueVector(not_sele.apply(donor_pose))
for res in rv:
pyrosetta.rosetta.core.pose.append_subpose_to_pose(acceptor_pose, donor_pose, res, res, True)</code></pre>
<p>I should mention that <code>append_subpose_to_pose</code> unlike <code>pyrosetta.Pose(donor_pose, start_res, end_res)</code> copies the residue types so is works fine, but does not like copying more than one ligand. For example doing:
</p><pre><code>for from_res, to_res in ph.rangify(rv):
pyrosetta.rosetta.core.pose.append_subpose_to_pose(acceptor_pose, donor_pose, from_res, to_res, True)
</code></pre><p>Will give the error <code>Can't create a polymer bond after residue n due to incompatible type: XXX</code>. Assuming one is certain the residues are there, one could use the <code>ResidueNameSelector</code> (if the ligand residue does not exist it will raise an error):</p><pre><code>resn_sele = pyrosetta.rosetta.core.select.residue_selector.ResidueNameSelector()
resn_sele.set_residue_name3(','.join(wanted_ligands))</code></pre>
<h3 style="text-align: left;"><span style="font-family: Times; white-space: normal;">Huston, we have a problem</span></h3><code><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXJWSR11eV-8APDjINmnTZ6XH5XF2KOk-fQLy9Ct5YW0mpBUn-euofNQQZggcHKPsFbJWa9Q4YGuVTjX2_99yFdtzq-QEb9FkTjwR_5RA8oJ0ANoPFjhKMPYNa6kVvCIlHSw17NnVUdsA/s640/bleww.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="476" data-original-width="640" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXJWSR11eV-8APDjINmnTZ6XH5XF2KOk-fQLy9Ct5YW0mpBUn-euofNQQZggcHKPsFbJWa9Q4YGuVTjX2_99yFdtzq-QEb9FkTjwR_5RA8oJ0ANoPFjhKMPYNa6kVvCIlHSw17NnVUdsA/s320/bleww.png" width="320" /></a></div><br />In some cases, an error may have occurred but there is a glitched outputted structure.<br />
Generally it is wiser to try to figure out the why than hack it, but if one were in a pinch one could simply align and steal the coordinates. For example, say we know everything after 498 is bad, then we could align the backbone of this and copy over:<p></p><pre><code># Viewer discretion adviced<br />mapping = pyrosetta.rosetta.std.map_core_id_AtomID_core_id_AtomID()
r = 498
assert threaded.residue(r).natoms() == af_pose.residue(r).natoms()
for a in threaded.residue(r).mainchain_atoms(): # should always be 1,2,3
mapping[pyrosetta.AtomID(a, r)] = pyrosetta.AtomID(a, r)
pyrosetta.rosetta.core.scoring.superimpose_pose(af_pose, threaded, mapping)
for r in range(r, af_pose.total_residue()+1):
assert threaded.residue(r).natoms() == af_pose.residue(r).natoms()
print(r)
for a in range(1, threaded.residue(r).natoms() + 1):
threaded.residue(r).set_xyz(a, af_pose.residue(r).xyz(a))</code></pre><p>If an in-between bit is the problem one would have to close the gap as discussed <a href="http://blog.matteoferla.com/2020/07/filling-missing-loops-proper-way.html" target="_blank">here</a>.</p></code>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-90840527173996503062021-08-23T09:33:00.009-07:002022-09-01T01:43:44.319-07:00Tweaking AlphaFold2 models with PyRosetta<p>In <a href="https://blog.matteoferla.com/2021/07/what-to-look-out-for-with-alphafold2.html">a previous post</a>
I explored the pitfalls of an AlphaFold2 model from EBI.
Here I thought I'd share some PyRosetta methods that may be handy to use with AlphaFold2 models.</p>
<a name='more'></a>
<h2 id="overview">Overview</h2>
<p>This is meant to be a handy collection of function, wherein one can search for what one wants.</p>
<p>It is not meant to be a primer for PyRosetta, but I hope these notes may be helpful also for those who have never used PyRosetta, but would like a taster as it is a framework well worth learning to use! As a result I will try and be as comprehensive as is reasonable, so I apologise if any parts are too basic for advanced PyRosetta users.</p>
<p>If you want to use Colabs, here is a <a href="https://colab.research.google.com/github/matteoferla/pyrosetta_help/blob/main/colabs-pyrosetta.ipynb">Colabs notebook with PyRosetta</a>
</p><h2 id="download-or-make">Download or make</h2>
<p>There are two options for an AlphaFold2 model, one is to download ofd the EBI and the other is to make it from scratch with relevant mods on a Colabs notebook. The Twittersphere contains lots of useful information about doing the later. The original Colab notebook now has a few siblings, with different functionalities, but the best are in the <a href="https://github.com/sokrypton/ColabFold">ColabFold GitHub repo</a> by Sergey "Krypton" Ovchinnikov (@sokrypton, a PI at Harvard). Whereas the EBI downloads provide a single model, creating the model oneself allows an output of multiple models, 5 by default, and these often represent different conformations. And these can be oligomers and protein-protein complexes, etc.<br />But unless you are using the ruse of high-throughput to not polish, generally more work is required, such as adding ligands, stretching, mutagenesis etc. etc.</p>
<h2 id="helper-module">Helper module</h2>
<p><em>Shameless advertisement</em>. I keep putting general snippets of PyRosetta code in my <a href="https://pypi.org/project/pyrosetta-help/">pyrosetta_help module</a>, which consists of a collection of functions that do diverse things, and the functions I mention here are there too —so no need to copy-paste.</p>
<div>For example, for starting Pyrosetta I like to initialise like so:</div>
<pre><code><span class="hljs-built_in">import</span> pyrosetta
from pyrosetta_help <span class="hljs-built_in">import</span> make_option_string, configure_logger, get_log_entries
<span class="hljs-comment"># capture to log</span>
<span class="hljs-attr">logger</span> = configure_logger()
<span class="hljs-comment"># give CLI attributes in a civilised way</span>
pyrosetta.distributed.maybe_init(<span class="hljs-attr">extra_options=make_option_string(no_optH=False,</span>
<span class="hljs-attr">ex1=None,</span>
<span class="hljs-attr">ex2=None,</span>
<span class="hljs-attr">mute='all',</span>
<span class="hljs-attr">ignore_unrecognized_res=False,</span>
<span class="hljs-attr">load_PDB_components=False,</span>
<span class="hljs-attr">ignore_waters=False)</span>
)
</code></pre><h2 id="model">Model</h2>
<h3 id="download">Download</h3>
<p>The most obvious function is a AF2-EBI downloader</p>
<pre><code><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> pyrosetta
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">pose_from_alphafold2</span><span class="hljs-params">(uniprot:str)</span> -> pyrosetta.Pose:</span>
reply = requests.get(f<span class="hljs-string">'https://alphafold.ebi.ac.uk/files/AF-{uniprot}-F1-model_v1.pdb'</span>)
<span class="hljs-keyword">assert</span> reply.status_code == <span class="hljs-number">200</span>
pdbblock = reply.text
pose = pyrosetta.Pose()
pyrosetta.rosetta.core.import_pose.pose_from_pdbstring(pose, pdbblock)
<span class="hljs-keyword">return</span> pose
</code></pre><p>As of writing, only one model can be downloaded and there is only version 1. The address accepts a Uniprot accession, which is a letter plus digits, e.g. P01112, and should not be confused with a Uniprot name, RASH_HUMAN. Understandably only key species have their proteome modelled. Here is a dictionary of species available and their NCBI taxon identifiers (taxids).</p>
<pre><code>alphamodels = { # name: taxon id
<span class="hljs-symbol">'Arabidopsis</span> thaliana': <span class="hljs-number">3702</span>,
<span class="hljs-symbol">'Caenorhabditis</span> elegans': <span class="hljs-number">6239</span>,
<span class="hljs-symbol">'Candida</span> albicans': <span class="hljs-number">5476</span>,
<span class="hljs-symbol">'Danio</span> rerio': <span class="hljs-number">7955</span>,
<span class="hljs-symbol">'Dictyostelium</span> discoideum': <span class="hljs-number">44689</span>,
<span class="hljs-symbol">'Drosophila</span> melanogaster': <span class="hljs-number">7227</span>,
<span class="hljs-symbol">'Escherichia</span> coli': <span class="hljs-number">562</span>,
<span class="hljs-symbol">'Glycine</span> max': <span class="hljs-number">3847</span>,
<span class="hljs-symbol">'Homo</span> sapiens': <span class="hljs-number">9606</span>,
<span class="hljs-symbol">'Leishmania</span> infantum': <span class="hljs-number">5671</span>,
<span class="hljs-symbol">'Methanocaldococcus</span> jannaschii': <span class="hljs-number">243232</span>,
<span class="hljs-symbol">'Mus</span> musculus': <span class="hljs-number">10090</span>,
<span class="hljs-symbol">'Mycobacterium</span> tuberculosis': <span class="hljs-number">1773</span>,
<span class="hljs-symbol">'Oryza</span> sativa': <span class="hljs-number">4530</span>,
<span class="hljs-symbol">'Plasmodium</span> falciparum': <span class="hljs-number">5833</span>,
<span class="hljs-symbol">'Rattus</span> norvegicus': <span class="hljs-number">10116</span>,
<span class="hljs-symbol">'Saccharomyces</span> cerevisiae': <span class="hljs-number">4932</span>,
<span class="hljs-symbol">'Schizosaccharomyces</span> pombe': <span class="hljs-number">4896</span>,
<span class="hljs-symbol">'Staphylococcus</span> aureus': <span class="hljs-number">1280</span>,
<span class="hljs-symbol">'Trypanosoma</span> cruzi': <span class="hljs-number">5693</span>}
</code></pre><h3 id="visualise">Visualise</h3>
<p>The next requirement is to visualise it. The PyMOLMover is great, but I generally work on a remote notebook (as discussed <a href="https://blog.matteoferla.com/2020/10/rosettapyrosetta-on-cluster-or-in-cloud.html">here</a> and <a href="https://blog.matteoferla.com/2020/11/remote-notebooks-and-jupyter-themes.html">here</a>) so I prefer to use NGL, which is embedded in the Jupyter notebook as a widget. And alternative plugin is py3Dmol (as used by colabfold notebooks as the former does not work in Colabs), but I am more familar with NGL so I use that!<br />Here is a class that adds a <code>add_rosetta</code> bound method as discussed in the <a href="http://blog.matteoferla.com/2021/02/multiple-poses-in-nglview.html">Multiple poses in NGLView post</a>, but colouring by b-factor by default, and a method to show a pose in salmon (#F8766D) and the original pose in turquoise (#00B4C4).</p>
<pre><code><span class="hljs-keyword">import</span> nglview <span class="hljs-keyword">as</span> nv
<span class="hljs-title">from</span> io <span class="hljs-keyword">import</span> StringIO
<span class="hljs-keyword">import</span> pyrosetta
<span class="hljs-class">
<span class="hljs-keyword">class</span> <span class="hljs-type">ModNGLWidget</span>(<span class="hljs-title">nv</span>.<span class="hljs-type">NGLWidget</span>):
def add_rosetta(<span class="hljs-title">self</span>, <span class="hljs-title">pose</span>: <span class="hljs-title">pyrosetta</span>.<span class="hljs-type">Pose</span>, <span class="hljs-title">color_by_bfactor</span>:<span class="hljs-title">bool</span>=<span class="hljs-type">True</span>):
buffer = pyrosetta.rosetta.std.stringbuf()
pose.dump_pdb(<span class="hljs-title">pyrosetta</span>.<span class="hljs-title">rosetta</span>.<span class="hljs-title">std</span>.<span class="hljs-title">ostream</span>(<span class="hljs-title">buffer</span>))
fh = <span class="hljs-type">StringIO</span>(<span class="hljs-title">buffer</span>.<span class="hljs-title">str</span>())
c = self.add_component(<span class="hljs-title">fh</span>, <span class="hljs-title">ext</span>='<span class="hljs-title">pdb'</span>)
if color_by_bfactor:
c.update_cartoon(<span class="hljs-title">color</span>='<span class="hljs-title">bfactor'</span>)
return c
def make_pose_comparison(<span class="hljs-title">self</span>, <span class="hljs-title">pose</span>: <span class="hljs-title">pyrosetta</span>.<span class="hljs-type">Pose</span>, <span class="hljs-title">original</span>: <span class="hljs-title">pyrosetta</span>.<span class="hljs-type">Pose</span>) -> <span class="hljs-type">ModNGLWidget</span>:
c0 = self.add_rosetta(<span class="hljs-title">original</span>)
c0.update_cartoon(<span class="hljs-title">color</span>='#00B4C4', <span class="hljs-title">smoothSheet</span>=<span class="hljs-type">True</span>)
c1 = self.add_rosetta(<span class="hljs-title">pose</span>)
c1.update_cartoon(<span class="hljs-title">color</span>='#<span class="hljs-type">F8766D</span>', <span class="hljs-title">smoothSheet</span>=<span class="hljs-type">True</span>)
return view</span>
</code></pre><p>Note how the bfactor is a colour (<code>component.update_cartoon(color='bfactor')</code>), this is a special case, similar to the colour "atomic" in PyMOL (<code>colorValue</code> argument sets carbon only in NGL).
Parenthesis: in my <a href="https://pypi.org/project/pyrosetta-help/">pyrosetta_help module</a> I have these methods, but as a monkey patch to the <code>NGLWidget</code> class. So importing pyrosetta_help alters nglview —I should state that adding methods to a class in a different module (monkeypatching) is heavily frowned upon.</p>
<h2 id="error">Error</h2>
<p>However, a problem are the spaghetti loops, that is to say not all residues are as good. Luckily, EBI or colabfold provide with two useful pieces of information, pLDDT and the distance errors in Ångström for each residue pair. The AF2-EBI site explains these very well, so I will not go into detail, but pLDDT is a 0-100 value stored as the isotropic temperature factor in the PDB files that represents the confidence (high is good) of the residue, while the per-residue errors are the errors associated with each pair of atoms. The latter is stored in a JSON as a list whose sole element list is a dictionary with keys 'residue1', 'residue2' and 'distance'.</p>
<h3 id="download">Download</h3>
<p>The dictionary format is a bit naff, hence, why the following function, <code>get_alphafold2_error</code> returns a normal numpy <em>n</em> × <em>n</em> matrix, after calling <code>reshape_errors</code>.</p>
<pre><code>import requests
from typing import *
import numpy <span class="hljs-keyword">as</span> np
def get_alphafold2_error(unipro<span class="hljs-variable">t:str</span>, reshaped=True) -> Union[np.ndarray, <span class="hljs-keyword">list</span>]:
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
Returns the distances errors either <span class="hljs-keyword">as</span> numpy matrix (``reshaped=True``)
<span class="hljs-built_in">or</span> <span class="hljs-keyword">as</span> the weird format from AF2-EBI —see ``<span class="hljs-keyword">help</span>(reshape_errors)`` <span class="hljs-keyword">for</span> more.
Remember that the matrix <span class="hljs-keyword">is</span> zero indexed <span class="hljs-built_in">and</span> that these <span class="hljs-built_in">values</span> are in Ångströ<span class="hljs-keyword">m</span>
<span class="hljs-built_in">and</span> are not pLDDT, which are stored <span class="hljs-keyword">as</span> <span class="hljs-keyword">b</span>-factors.
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
# http<span class="hljs-variable">s:</span>//alphafold.ebi.ac.uk/<span class="hljs-keyword">files</span>/AF-Q00341-F1-predicted_aligned_error_v1.json
reply = requests.<span class="hljs-built_in">get</span>(<span class="hljs-keyword">f</span><span class="hljs-string">'https://alphafold.ebi.ac.uk/files/AF-{uniprot}-F1-predicted_aligned_error_v1.json'</span>)
assert reply.status_code == <span class="hljs-number">200</span>
errors = reply.json()
<span class="hljs-keyword">if</span> reshaped:
<span class="hljs-keyword">return</span> reshape_errors(errors)
<span class="hljs-keyword">else</span>:
<span class="hljs-keyword">return</span> errors
def reshape_errors(error<span class="hljs-variable">s:</span> List[Dict[str, <span class="hljs-keyword">list</span>]]) -> np.array:
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
The JSON from AF2 <span class="hljs-built_in">has</span> <span class="hljs-keyword">a</span> single element <span class="hljs-keyword">list</span>.
the sole element <span class="hljs-keyword">is</span> <span class="hljs-keyword">a</span> dictionary with <span class="hljs-built_in">keys</span> <span class="hljs-string">'residue1'</span>, <span class="hljs-string">'residue2'</span> <span class="hljs-built_in">and</span> <span class="hljs-string">'distance'</span>.
This method returns <span class="hljs-keyword">a</span> matrix of distances reshaped based <span class="hljs-keyword">on</span> the stated residue indices.
This <span class="hljs-keyword">is</span> rather unlikely <span class="hljs-keyword">to</span> differ from <span class="hljs-keyword">a</span> regular reshape...
but idiotically I <span class="hljs-keyword">am</span> not taking <span class="hljs-keyword">changes</span> assuming it <span class="hljs-keyword">is</span> always sorted.
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
n_residues = <span class="hljs-keyword">int</span>(np.<span class="hljs-built_in">sqrt</span>(<span class="hljs-built_in">len</span>(errors[<span class="hljs-number">0</span>][<span class="hljs-string">'distance'</span>])))
error_matrix = np.zeros((n_residues, n_residues)) * np.nan
<span class="hljs-keyword">for</span> i,d in enumerate(errors[<span class="hljs-number">0</span>][<span class="hljs-string">'distance'</span>]):
error_matrix[errors[<span class="hljs-number">0</span>][<span class="hljs-string">'residue1'</span>][i] - <span class="hljs-number">1</span>, errors[<span class="hljs-number">0</span>][<span class="hljs-string">'residue2'</span>][i] - <span class="hljs-number">1</span>] = d
<span class="hljs-keyword">return</span> error_matrix
</code></pre><p>If using a local saved JSON:</p>
<pre><code>filename = f<span class="hljs-string">'{folder}/rank_{rank}_model_{model}_ptm_seed_{seed}_pae.json'</span>
<span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(filename, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fh:
errors = reshape_errors(json.<span class="hljs-built_in">load</span>(fh))
</code></pre><p>It is always wise to check that there is no misunderstanding involved in the downloaded data, so here is a method that returns a Plotly figure, that is a PAE in the same colour scale —but using a nice interactive graphing library.</p>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_pae_plot</span><span class="hljs-params">(errors: np.ndarray)</span> -> go.Figure:</span>
<span class="hljs-string">"""
Make AlphaFold2-EBI like PAE plot
"""</span>
fig = px.imshow(errors, color_continuous_scale=[(<span class="hljs-number">0</span>,<span class="hljs-string">'green'</span>), (<span class="hljs-number">1</span>, <span class="hljs-string">'white'</span>)])
<span class="hljs-keyword">return</span> fig
</code></pre><h3 id="constrain">Constrain</h3>
<p>For a crystal structure, minimising against the ccp4 map is a wise call (see <a href="https://blog.matteoferla.com/2020/04/how-to-set-up-electron-density.html">past post</a>), so likewise these errors provide a similar function as one can make a constraint* set with them, thus preventing nice parts from"<a href="http://www.gromacs.org/Documentation_of_outdated_versions/Terminology/Blowing_Up">blowing up</a>" to use the Gromacs term for it.</p>
<p>(* In Rosetta a "constraint" is a crystallography refinement "restraint").</p>
<pre><code>def constrain_distances(pose: pyrosetta.Pose,
error<span class="hljs-variable">s:np</span>.ndarray,
cutoff:float=<span class="hljs-number">5</span>,
weigh<span class="hljs-variable">t:float</span>=<span class="hljs-number">1</span>,
blank:bool=True):
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
Add constrains <span class="hljs-keyword">to</span> the pose based <span class="hljs-keyword">on</span> the errors matrix.
NB. this matrix <span class="hljs-keyword">is</span> <span class="hljs-keyword">a</span> reshaped <span class="hljs-keyword">version</span> of what AF2 returns.
A harmonic <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">is</span> <span class="hljs-title">added</span> <span class="hljs-title">to</span> <span class="hljs-title">CA</span> <span class="hljs-title">atoms</span> <span class="hljs-title">that</span> <span class="hljs-title">are</span> <span class="hljs-title">in</span> <span class="hljs-title">residues</span> <span class="hljs-title">with</span> <span class="hljs-title">the</span> <span class="hljs-title">error</span> <span class="hljs-title">under</span> <span class="hljs-title">a</span> <span class="hljs-title">specified</span> <span class="hljs-title">cutoff</span>.</span>
The mu <span class="hljs-keyword">is</span> the current distance <span class="hljs-built_in">and</span> the standard deviation of the harmonic <span class="hljs-keyword">is</span> the error times ``weight``.
``blank`` <span class="hljs-keyword">as</span> False keeps the current constaints.
To <span class="hljs-keyword">find</span> out how many were added:
>>> <span class="hljs-built_in">len</span>(pose.constraint_set().get_all_constraints())
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
get_ca = lambda r, i: pyrosetta.AtomID(atomno_in=r.atom_index(<span class="hljs-string">'CA'</span>), rsd_in=i)
HarmonicFunc = pyrosetta.rosetta.core.scoring.func.HarmonicFunc
AtomPairConstraint = pyrosetta.rosetta.core.scoring.constraints.AtomPairConstraint
<span class="hljs-keyword">cs</span> = pyrosetta.rosetta.core.scoring.constraints.ConstraintSet()
<span class="hljs-keyword">if</span> not blank:
<span class="hljs-keyword">previous</span> = pyrosetta.rosetta.core.scoring.constraints.ConstraintSet().get_all_constraints()
<span class="hljs-keyword">for</span> <span class="hljs-keyword">con</span> in previou<span class="hljs-variable">s:</span>
<span class="hljs-keyword">cs</span>.add_constraint(<span class="hljs-keyword">con</span>)
<span class="hljs-keyword">for</span> r1_idx, r2_idx in np.argwhere(errors < cutoff):
d_error = errors[r1_idx, r2_idx]
residue1 = pose.residue(r1_idx + <span class="hljs-number">1</span>)
ca1_atom = get_ca(residue1, r1_idx + <span class="hljs-number">1</span>)
residue2 = pose.residue(r2_idx + <span class="hljs-number">1</span>)
ca2_atom = get_ca(residue2, r2_idx + <span class="hljs-number">1</span>)
ca1_xyz = residue1.xyz(ca1_atom.atomno())
ca2_xyz = residue2.xyz(ca2_atom.atomno())
d = (ca1_xyz - ca2_xyz).<span class="hljs-keyword">norm</span>()
apc = AtomPairConstraint(ca1_atom, ca2_atom, HarmonicFunc(x0_in=d, sd_in=d_error * weight))
<span class="hljs-keyword">cs</span>.add_constraint(apc)
setup = pyrosetta.rosetta.protocols.constraint_movers.ConstraintSetMover()
setup.constraint_set(<span class="hljs-keyword">cs</span>)
setup.apply(pose)
<span class="hljs-keyword">return</span> <span class="hljs-keyword">cs</span>
</code></pre><p>In the above the pose's constraint set gets replaced, this is not really required and is simply playing it safe as the ConstraintSet returned by <code>pose.constraint_set()</code> is not a copy so the method <code>add_constraint</code> will work fine —unlike <code>pose.constraint_set().get_all_constraints().append</code>, which adds to a vector1-derived objected that does not modify the set.</p>
<p>To see how many constraints a pose has, one can do <code>len(pose.constraint_set().get_all_constraints())</code>.</p>
<p>Also, adding constraints to a pose is only half the job, as one needs to set the weight on the scorefunction for that constraint.</p>
<pre><code><span class="hljs-keyword">scorefxn </span>= pyrosetta.get_fa_scorefxn() <span class="hljs-comment"># default is ref2015</span>
ap_st = pyrosetta.rosetta.core.<span class="hljs-keyword">scoring.ScoreType.atom_pair_constraint
</span><span class="hljs-keyword">scorefxn.set_weight(ap_st, </span><span class="hljs-number">1</span>)
</code></pre><p>To check how much do the constraints contribute to the scorefunction, one can use the <code>print_constraint_scores</code> or <code>constraints2pandas</code> in my <code>pyrosetta_help</code> module (see <a href="https://github.com/matteoferla/pyrosetta_help/blob/main/pyrosetta_help/common_ops/constraints.py">code</a>). Also to visualise the constraints in a pose, check out the <code>add_constraints</code> in the monkeypatched NGLView (may crash the browser for thousands of constraints):</p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiK0xpyiIWiQ7sY_in951Qht8f8YbOCSgS7vF5YFfrz_FdRj6t9C30YKDWmIYdOjMuFaFum908pYhgu53e2yn6zqc0XE1TSAkBQgDvC_kwwN5fIqabqYEvuXrkyEPv5iWXJqXxF-kVd7b0/s2899/spider.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1085" data-original-width="2899" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiK0xpyiIWiQ7sY_in951Qht8f8YbOCSgS7vF5YFfrz_FdRj6t9C30YKDWmIYdOjMuFaFum908pYhgu53e2yn6zqc0XE1TSAkBQgDvC_kwwN5fIqabqYEvuXrkyEPv5iWXJqXxF-kVd7b0/s320/spider.png" width="320" /></a></div><br /><p><br /></p><p>As mentioned this is to prevent odd things such as residues moving apart due to a clash etc. in a first instance. After which, it is best to remove them, or the operations will lag.</p>
<h3 id="long-distance">Long distance</h3>
<p>Traditionally, a multidomain protein is cut up and solved in chunks, leaving a problem of whether the domains interact. With AlphaFold2, this is not the case as long distance interactions are revealed.
To find the longest distance of interacting residues, one can cheat and simply use the error matrix. To do so two rounds of <code>np.argwhere</code> (one for a error-distance cutoff, the second for a residue separation), will do the trick:</p>
<pre><code>import numpy as <span class="hljs-built_in">np</span>
distance_cutoff = <span class="hljs-number">5</span> # <span class="hljs-number">5</span>Å
residue_cutoff = <span class="hljs-number">170</span>
closer = <span class="hljs-built_in">np</span>.argwhere(<span class="hljs-built_in">error</span><distance_cutoff) # x,y <span class="hljs-built_in">indices</span> less than ``distance_cutoff``
resi_distances = <span class="hljs-built_in">np</span>.<span class="hljs-built_in">abs</span>(closer[:,<span class="hljs-number">0</span>] - closer[:,<span class="hljs-number">1</span>])
# the following +<span class="hljs-number">1</span> <span class="hljs-built_in">is</span> because <span class="hljs-built_in">indices</span> are C <span class="hljs-built_in">style</span>, <span class="hljs-keyword">while</span> <span class="hljs-built_in">residue</span> index <span class="hljs-built_in">is</span> human/Fortran <span class="hljs-built_in">style</span>
distant_pairs = <span class="hljs-built_in">np</span>.squeeze(closer[<span class="hljs-built_in">np</span>.argwhere(resi_distances > residue_cutoff)] + <span class="hljs-number">1</span>)
</code></pre><p>Alternatively, to see the most sequence-distant spacially close (according to error) residue pairs</p>
<pre><code><span class="hljs-built_in">print</span>(f'{<span class="hljs-built_in">np</span>.<span class="hljs-built_in">max</span>(resi_distances)} residues apart:')
closer[<span class="hljs-built_in">np</span>.reshape(<span class="hljs-built_in">np</span>.argwhere(resi_distances == <span class="hljs-built_in">np</span>.<span class="hljs-built_in">max</span>(resi_distances)), -<span class="hljs-number">1</span>)] + <span class="hljs-number">1</span>
</code></pre><p>(If fairly unfamiliar with numpy, don't worry about the <code>np.squeeze</code> or <code>np.reshape(array, -1)</code> calls: they simply remove singleton dimensions and flatten to a vector respectively, because <code>np.argwhere</code>, the command to get the indices where a condition is met, likes to return extra dimensions).</p>
<p>In the above I should emphasise I am using the errors as a proxy for distance. In reality the cartesian distance in the model may differ. To get a matrix of that:</p>
<pre><code>import numpy <span class="hljs-keyword">as</span> np
distances = np.zeros((pose.total_residue(),pose.total_residue()))
ca_xyzs = [pose.residue(r).xyz('<span class="hljs-keyword">CA</span>') <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> <span class="hljs-keyword">range</span>(1, pose.total_residue()+1)]
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-keyword">range</span>(len(ca_xyzs)):
<span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> <span class="hljs-keyword">range</span>(i):
<span class="hljs-keyword">d</span> = ca_xyzs[i].distance(ca_xyzs[j])
distances[i, j] = <span class="hljs-built_in">d</span>
distances[j, i] = <span class="hljs-built_in">d</span>
</code></pre><p>Using the same numpy operation will give residues that are close. The difference between error and distance matrices will be how much the position is misleading.</p>
<p>I like to stretch models to avoid confusion, which is discussed in the "Stretch" section below.</p>
<h3 id="minimisation">Minimisation</h3>
<p>In order to cover the basics, I ought to mention that to get a ∆G for a model —say to compare to the <em>same</em> sequence folded different— one needs to minimise, <em>i.e.</em> make minor changes to the pose to make it more energetically favourable according to the forcefield.</p>
<pre><code><span class="hljs-attr">cycles</span> = <span class="hljs-number">3</span> <span class="hljs-comment"># equivalent to quick mode in the binary</span>
<span class="hljs-attr">cycles</span> = <span class="hljs-number">15</span> <span class="hljs-comment"># equivalent to thorough mode in the binary</span>
<span class="hljs-attr">relax</span> = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, <span class="hljs-number">3</span>)
relax.apply(pose)
</code></pre><p>Now, a thing to note is that <code>scorefxn</code> is the one from above and has <code>atom_pair_constraint</code> set to a non-zero value and the pose has many constraints. </p>
<pre><code>ap_st = pyrosetta<span class="hljs-selector-class">.rosetta</span><span class="hljs-selector-class">.core</span><span class="hljs-selector-class">.scoring</span><span class="hljs-selector-class">.ScoreType</span><span class="hljs-selector-class">.atom_pair_constraint</span>
<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'Weight:'</span>, scorefxn.get_weight(ap_st)</span></span>)
<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(<span class="hljs-string">'N_constraints:'</span>, len(pose.constraint_set()</span></span>.get_all_constraints())
</code></pre><p>This is unsuitable for comparing one model with another even if they contain the same residues. For that a scorefunction without the <code>atom_pair_constraint</code> weight is required.</p>
<pre><code>vanilla_scorefxn = pyrosetta.get_fa_scorefxn()
dG = vanilla_scorefxn(pose)
<span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(f<span class="hljs-string">'{dG} kcal/mol'</span>)</span></span>
</code></pre><p>Obviously, to dump a scored PDB, the unweighted one would make more sense.</p>
<pre><code><span class="hljs-selector-tag">pose</span><span class="hljs-selector-class">.dump_scored_pdb</span>(<span class="hljs-string">'foo.pdb'</span>, vanilla_scorefxn)
</code></pre><p>If you want to see where the models differ, you may want to check out my other post about <a href="http://blog.matteoferla.com/2021/07/per-residue-rmsd.html">per residue RMS using PyRosetta</a>.</p>
<p>If you want to analyse the energetic contribution of the residues of a minimised model, you may want to check out the pose per residue energy scores to pandas function I wrote:</p>
<pre><code>from pyrosetta_help import pose2pandas
scores = pose2pandas(pose)
<span class="hljs-section"># example: get the worst</span>
scores.loc[<span class="hljs-string">scores.total_score > 5</span>][<span class="hljs-symbol">['residue', 'total_score'</span>]]
</code></pre><h2 id="plddt">pLDDT</h2>
<p>Another useful value is the pLDDT stored as b-factors in the PDB file. These are not b-factors as the many warnings for molecular replacement remind us. These are a percentage value of how good the residue location is, where 100 is perfect and 0 is aweful —so the inverse of b-factors.</p>
<h3 id="superimposition">superimposition</h3>
<p>To superimpose two structures ("align" to use a nasty PyMOL term), one would preferably want to superimpose the non-spaghetti parts, so this function, <code>superimpose_by_pLDDT</code>, (modified from <a href="https://gist.github.com/asford/c2404c8b045700f016fda8893325c807">this github gist</a>) can help:</p>
<pre><code><span class="hljs-keyword">import</span> difflib
<span class="hljs-keyword">import</span> pyrosetta
<span class="hljs-keyword">from</span> pyrosetta.rosetta.std <span class="hljs-keyword">import</span> map_core_id_AtomID_core_id_AtomID
<span class="hljs-keyword">from</span> pyrosetta.rosetta.core.id <span class="hljs-keyword">import</span> AtomID
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">common_residue_indices</span><span class="hljs-params">(a: pyrosetta.Pose, b: pyrosetta.Pose)</span>:</span>
<span class="hljs-string">"""Get paired indicies of common residues in two structures."""</span>
aseq = a.sequence()
bseq = b.sequence()
astart, bstart, align_len = difflib.SequenceMatcher(a=aseq, b=bseq).find_longest_match(<span class="hljs-number">0</span>, len(aseq), <span class="hljs-number">0</span>, len(bseq))
<span class="hljs-keyword">return</span> [(astart + i, bstart + i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(align_len)]
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">superimpose_by_pLDDT</span><span class="hljs-params">(pose: pyrosetta.Pose,
original: pyrosetta.Pose,
cutoff=<span class="hljs-number">70</span>,
pose_range=None)</span> -> map_core_id_AtomID_core_id_AtomID:</span>
<span class="hljs-string">"""
Superimpose two poses, based on residues with pLDDT above a given threshold.
:param pose:
:param original:
:param cutoff: %
:param pose_range: optional argument to subset (start:int, end:int)
:return:
"""</span>
ca_map = map_core_id_AtomID_core_id_AtomID()
<span class="hljs-keyword">for</span> w, a <span class="hljs-keyword">in</span> paired_residue_inds(pose, original):
<span class="hljs-keyword">if</span> original.pdb_info().bfactor(a + <span class="hljs-number">1</span>, <span class="hljs-number">1</span>) <= cutoff:
<span class="hljs-keyword">continue</span>
<span class="hljs-keyword">if</span> pose_range <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span> <span class="hljs-keyword">and</span> (w < pose_range[<span class="hljs-number">0</span>] <span class="hljs-keyword">or</span> w > pose_range[<span class="hljs-number">1</span>]):
<span class="hljs-keyword">continue</span>
ca_map[AtomID(pose.residue(w + <span class="hljs-number">1</span>).atom_index(<span class="hljs-string">"CA"</span>), w + <span class="hljs-number">1</span>)] = AtomID(
original.residue(a + <span class="hljs-number">1</span>).atom_index(<span class="hljs-string">"CA"</span>), a + <span class="hljs-number">1</span>
)
<span class="hljs-keyword">assert</span> len(ca_map), <span class="hljs-string">'No atoms greater than cutoff'</span>
pyrosetta.rosetta.core.scoring.superimpose_pose(pose, original, ca_map)
<span class="hljs-keyword">return</span> ca_map
</code></pre><p>However, this will not work when there are independent domains. For which case, one would have to pick which domain to care about (<code>pose_range</code> argument).</p>
<h3 id="selector">Selector</h3>
<p>Unfortunately, there is not a <code>ResidueSelector</code> that selects by b-factors, so one has to iterate across each the PDBInfo:</p>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_bfactor_vector</span><span class="hljs-params">(pose: pyrosetta.Pose, cutoff: float, above=True)</span> -> pyrosetta.rosetta.utility.vector1_bool:</span>
<span class="hljs-string">"""
Return a selection vector based on b-factors.
above = get all above. So to select bad b-factors above is ``True``,
but to select AF2 bad ones. above is ``False``
"""</span>
pdb_info = pose.pdb_info()
vector = pyrosetta.rosetta.utility.vector1_bool(pose.total_residue())
<span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, pose.total_residue() + <span class="hljs-number">1</span>):
<span class="hljs-keyword">try</span>:
atom_index = pose.residue(r).atom_index(<span class="hljs-string">'CA'</span>)
<span class="hljs-keyword">except</span> AttributeError:
atom_index = <span class="hljs-number">1</span>
bfactor = pdb_info.bfactor(r, atom_index)
<span class="hljs-keyword">if</span> above <span class="hljs-keyword">and</span> bfactor >= cutoff:
vector[r] = <span class="hljs-keyword">True</span>
<span class="hljs-keyword">elif</span> <span class="hljs-keyword">not</span> above <span class="hljs-keyword">and</span> bfactor <= cutoff:
vector[r] = <span class="hljs-keyword">True</span>
<span class="hljs-keyword">else</span>:
<span class="hljs-keyword">pass</span> <span class="hljs-comment"># vector[r] = False</span>
<span class="hljs-keyword">return</span> vector
</code></pre><h2 id="stretch">Obsolete pdb_info</h2><div>One thing to watch out of is the pose.pdb_info().obsolete() value. This is set to true when certain operations are done and as a result of it, the PDBInfo gets ignored when dumping to file —setting it to true or adding the PDBInfo of an unmodified pose will circumvent this.</div><h2 id="stretch">Stretch</h2>
<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Solar_System_true_color.jpg/640px-Solar_System_true_color.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img alt="syzygy" border="0" data-original-height="360" data-original-width="640" height="113" src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Solar_System_true_color.jpg/640px-Solar_System_true_color.jpg" width="200" /></a></div><br />Independent domains should ideally be spaced out to avoid ambiguity to humans and they make for better pictures —like planets in syzygy as metioned in the previous post. As a result, adding a constrait that is not too strenous on the scorefunction would do it:<p></p>
<pre><code>def add_stretch_constraint(pose: pyrosetta.Pose,
weigh<span class="hljs-variable">t:</span> float = <span class="hljs-number">5</span>,
slope_in: float = -<span class="hljs-number">0.05</span>,
residue_index_A: <span class="hljs-keyword">int</span> = <span class="hljs-number">1</span>,
residue_index_B: <span class="hljs-keyword">int</span> = -<span class="hljs-number">1</span>,
distance: Optional[
float] = None) -> pyrosetta.rosetta.core.scoring.constraints.AtomPairConstrain<span class="hljs-variable">t:</span>
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
Add <span class="hljs-keyword">a</span> constraint <span class="hljs-keyword">to</span> <span class="hljs-string">"stretch out"</span> the model, because ``slope_in`` <span class="hljs-keyword">is</span> negative.
:param pose: Pose <span class="hljs-keyword">to</span> <span class="hljs-built_in">add</span> constraint <span class="hljs-keyword">to</span>
:param weigh<span class="hljs-variable">t:</span> how strength of constraint (<span class="hljs-built_in">max</span> of <span class="hljs-number">0.5</span> <span class="hljs-keyword">for</span> ``SigmoidFunc``)
:param slope_in: negative <span class="hljs-keyword">number</span> <span class="hljs-keyword">to</span> stretch
:param residue_index_A: <span class="hljs-keyword">first</span> residue?
:param residue_index_B: <span class="hljs-keyword">last</span> residue <span class="hljs-keyword">is</span> <span class="hljs-string">"-1"</span>
:param distance: <span class="hljs-keyword">if</span> omitted, the midpoint of Sigmoid will <span class="hljs-keyword">be</span> the current distance
:<span class="hljs-keyword">return</span>:
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
# <span class="hljs-built_in">get</span> current length
first_ca = pyrosetta.AtomID(atomno_in=pose.residue(residue_index_A).atom_index(<span class="hljs-string">'CA'</span>),
rsd_in=residue_index_A)
<span class="hljs-keyword">if</span> residue_index_B == -<span class="hljs-number">1</span>:
residue_index_B = pose.total_residue()
last_ca = pyrosetta.AtomID(atomno_in=pose.residue(residue_index_B).atom_index(<span class="hljs-string">'CA'</span>),
rsd_in=residue_index_B)
first_ca_xyz = pose.residue(<span class="hljs-number">1</span>).xyz(first_ca.atomno())
last_ca_xyz = pose.residue(pose.total_residue()).xyz(last_ca.atomno())
<span class="hljs-keyword">if</span> distance <span class="hljs-keyword">is</span> None:
distance = (first_ca_xyz - last_ca_xyz).<span class="hljs-keyword">norm</span>()
# <span class="hljs-keyword">make</span> & <span class="hljs-built_in">add</span> <span class="hljs-keyword">con</span>
<span class="hljs-keyword">sf</span> = pyrosetta.rosetta.core.scoring.func
AtomPairConstraint = pyrosetta.rosetta.core.scoring.constraints.AtomPairConstraint
fun = <span class="hljs-keyword">sf</span>.ScalarWeightedFunc(weight, <span class="hljs-keyword">sf</span>.SigmoidFunc(x0_in=distance, slope_in=slope_in))
<span class="hljs-keyword">con</span> = AtomPairConstraint(first_ca, last_ca, fun)
pose.constraint_set().add_constraint(<span class="hljs-keyword">con</span>)
<span class="hljs-keyword">return</span> <span class="hljs-keyword">con</span>
</code></pre><p>Note that the <code>slope_in</code> is negative. On many of the functions, including the harmonic function, setting a negative sigma/slope will result in repulsion and not attraction... Obviously, a negative sigma on a harmonic function (<code>HarmonicFunc</code>) is just an explosion, whereas a flat harmonic (<code>FlatHarmonicFunc</code>) is a better choice. A sigmoid (<code>SigmoidFunc</code>) with a negative slope as used above has the nice thing that below the distance x0 the score is positive (bad), while above it it is negative (good) and it plateaus at ±0.5, hence the additional <code>ScalarWeightedFunc</code> to beef it up.</p>
<p>NB. Setting constraints too high makes them dominant the score. Ideally, they should contribute less than 5% of the final score.</p>
<p>Let's see what the function looks like:</p>
<pre><code><span class="hljs-built_in">import</span> plotly.express as px
<span class="hljs-built_in">import</span> numpy as np
<span class="hljs-comment"># example data</span>
<span class="hljs-attr">slope_in</span> = -<span class="hljs-number">0.05</span>
<span class="hljs-attr">distance</span> = <span class="hljs-number">30</span>
<span class="hljs-attr">weight</span> = <span class="hljs-number">5</span>
<span class="hljs-attr">x</span> = np.linspace(<span class="hljs-number">0</span>, <span class="hljs-number">200</span>, <span class="hljs-number">100</span>)
<span class="hljs-comment"># y = f(x) via PyRosetta</span>
<span class="hljs-attr">sf</span> = pyrosetta.rosetta.core.scoring.func
<span class="hljs-attr">fun</span> = sf.ScalarWeightedFunc(weight, sf.SigmoidFunc(<span class="hljs-attr">x0_in=distance,</span> <span class="hljs-attr">slope_in=slope_in))</span>
<span class="hljs-comment"># fun.func is a function that given a float returns another one. No extra values required.</span>
<span class="hljs-attr">y</span> = np.vectorize(fun.func)(x)
<span class="hljs-comment"># equivalent to:</span>
<span class="hljs-comment"># y = (np.reciprocal(np.exp(( x-distance ) * (-slope_in)) + 1) - 0.5) * weight</span>
<span class="hljs-comment"># plot</span>
<span class="hljs-attr">fig</span> = px.line(<span class="hljs-attr">x=x,</span> <span class="hljs-attr">y=y,</span> <span class="hljs-attr">title=f'Sigmoid</span> function <span class="hljs-attr">x_0={distance},</span> <span class="hljs-attr">m={slope_in}')</span>
fig.add_vline(<span class="hljs-attr">x=distance,</span> <span class="hljs-attr">line_color="gainsboro")</span>
fig.update_layout(<span class="hljs-attr">xaxis=dict(title='Distance</span> [Å]'), <span class="hljs-attr">yaxis=dict(title='Score'))</span>
fig
</code></pre><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-mo56nLjXeV_N7ZO-IUApsdAOKfXHT0gqXmtLv80yhPWuuZF6BQ8XNKC0N-2mPXUL_JMrrkARzLFxSPmh_amgyw4O8Kppntbt2zjlTX9SPhGmaRqZh4fXS37yo7UmPPpwERp36Roc2PY/s871/sigmoid.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="525" data-original-width="871" height="193" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-mo56nLjXeV_N7ZO-IUApsdAOKfXHT0gqXmtLv80yhPWuuZF6BQ8XNKC0N-2mPXUL_JMrrkARzLFxSPmh_amgyw4O8Kppntbt2zjlTX9SPhGmaRqZh4fXS37yo7UmPPpwERp36Roc2PY/s320/sigmoid.png" width="320" /></a></div><br /><p><br /></p>
<h3 id="example">Example</h3>
<p>Let's take as example, spectrin-alpha. This protein forms scaffolding, whereas in EBI-AF2 the model is curled up on itself.</p>
<pre><code>pose = ph.pose_from_alphafold2(<span class="hljs-string">'Q13813'</span>)
import nglview <span class="hljs-keyword">as</span> nv
<span class="hljs-keyword">view</span> = nv.show_rosetta(pose)
<span class="hljs-keyword">view</span>.update_cartoon(color=<span class="hljs-string">'bfactor'</span>)
<span class="hljs-keyword">view</span>
</code></pre><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1n3ktckbw5-QqMXZvyN9mhigRvyAhidlsdEtT3l4UFWs5QALy-dfJwnlzyjaK5uesxwQHawWBTD6cC6sApnWtVbUEbWLFkvJkEVHtidg64p29mn8FsfpYVGGXrESZctga1Ko_RgKZP3c/s2899/coiled_spectrin.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1085" data-original-width="2899" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1n3ktckbw5-QqMXZvyN9mhigRvyAhidlsdEtT3l4UFWs5QALy-dfJwnlzyjaK5uesxwQHawWBTD6cC6sApnWtVbUEbWLFkvJkEVHtidg64p29mn8FsfpYVGGXrESZctga1Ko_RgKZP3c/s320/coiled_spectrin.png" width="320" /></a></div>Then we could add the errors as contraints:<p></p>
<pre><code><span class="hljs-keyword">error </span>= ph.get_alphafold2_error('Q13813')
ph.add_pae_constraints(pose, error)
</code></pre><p>Or check to see if there are any long distance interactions (viz. "Long distance" section above).</p>
<p>Having seen that we definetely want to stretch it isoenergetically:</p>
<pre style="text-align: left;"><code>ph.add_stretch_constraint(<span class="hljs-built_in">pose</span>, <span class="hljs-number">10</span>)
# scorefxn is weighted <span class="hljs-keyword">for</span> atom pairs
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, <span class="hljs-number">3</span>)
relax.<span class="hljs-built_in">apply</span>(<span class="hljs-built_in">pose</span>)
</code></pre><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcguojWifrd0sbgG5sbb4NOx8RRO055LizuE1vdKxT5yoRVOS-mzXKSh-Ga9AWkE_qtqf0BJB63DItQhORDGGaaYu7JC0nr4vhVD_ZtNv4KFV7S6oA75DdINvnuShMjhaCiajmtvHq92E/s2899/stretched_spectrin.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1085" data-original-width="2899" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcguojWifrd0sbgG5sbb4NOx8RRO055LizuE1vdKxT5yoRVOS-mzXKSh-Ga9AWkE_qtqf0BJB63DItQhORDGGaaYu7JC0nr4vhVD_ZtNv4KFV7S6oA75DdINvnuShMjhaCiajmtvHq92E/s320/stretched_spectrin.png" width="320" /></a></div>Often however, it may be necessary to remove the stretching constraint (or leave it) and add a new one because beyond a certain point, the slightest unfavourable Ramachandran angle may disfavour the streching constraint.<h3 id="parenthesis-about-stretching-with-fastrelax">Parenthesis about stretching with FastRelax</h3>
<p>Now, FastRelax is not really meant for this and may not be as fast as hoped. One could use FastRelax, but with only backbone movement allowed in the movemap. One could use a more basic mover as MonteCarlo trials. And one may be tempted to do something in centroid mode. The latter is not a great idea.</p>
<p>But first, taking a step back, I'd like to mention that whereas we thing of atoms moving in 3D cartesian space along a X, Y, Z coordinate, the majority of operations in FastRelax consist of rotating bonds, which may results in a large change in catersian space —exploited here. So just tweaking the settings of FastRelax makes sense:</p>
<p>We could minimise only the bad residues, using the bfactor selector mentioned above:</p>
<pre><code><span class="hljs-comment"># select only the bad bits.</span>
<span class="hljs-keyword">bv </span>= ph.get_bfactor_vector(pose, <span class="hljs-number">70</span>, False)
<span class="hljs-keyword">movemap </span>= pyrosetta.rosetta.core.kinematics.<span class="hljs-keyword">MoveMap()
</span><span class="hljs-keyword">movemap.set_chi(bv)
</span><span class="hljs-keyword">movemap.set_bb(bv)
</span><span class="hljs-comment"># scorefxn is weighted for atom pairs</span>
relax = pyrosetta.rosetta.protocols.relax.FastRelax(<span class="hljs-keyword">scorefxn, </span><span class="hljs-number">3</span>)
relax.set_movemap(<span class="hljs-keyword">movemap)
</span>relax.apply(pose)
</code></pre><p>One could simply not touch the sidechains:</p>
<pre><code>...
movemap.set_chi(<span class="hljs-literal">False</span>)
...
</code></pre><p>Regarding the other options mentioned. I'll just touch on them for completeness. The simple mover that does rotations is <code>ShearMover</code>, which applied as a MonteCarlo mover would look something like:</p>
<pre><code><span class="hljs-keyword">shear_mover </span>= pyrosetta.rosetta.protocols.simple_moves.<span class="hljs-keyword">ShearMover(movemap_in=movemap,
</span> temperature_in=<span class="hljs-number">0</span>.<span class="hljs-number">5</span>,
n_moves=<span class="hljs-number">6</span>)
<span class="hljs-keyword">shear_mover.angle_max(30) </span><span class="hljs-comment"># degrees</span>
<span class="hljs-comment"># set up the monteCarlo sampler</span>
<span class="hljs-keyword">scorefxn(pose)
</span>monégasque = pyrosetta.MonteCarlo(pose, <span class="hljs-keyword">scorefxn, </span>kT)
<span class="hljs-comment"># this is pyrosetta.rosetta.protocols.moves.MonteCarlo not </span>
<span class="hljs-comment"># pyrosetta.rosetta.protocols.monte_carlo.GenericMonteCarloMover,</span>
<span class="hljs-comment"># which has the loop/trial thing within and is fancier and faster</span>
trial = rosetta.TrialMover(<span class="hljs-keyword">shear_mover, </span>monégasque)
<span class="hljs-comment"># the trial accepts/rejects a pose based upon the MonteCarlo object</span>
for i in range(<span class="hljs-number">10</span>):
<span class="hljs-keyword">scorefxn(pose) </span><span class="hljs-comment"># update energies</span>
monégasque.recover_low(pose)
trial.apply(pose)
monégasque.<span class="hljs-keyword">show_counters()
</span>monégasque.recover_low(pose)
</code></pre><p>Whereas the centroid option (coarse grain) revolves around replacing the sidechain with a <code>CEN</code> atom and recovering them after all is done.</p>
<pre><code>pr_sm = pyrosetta.rosetta.protocols.simple_moves
original_pose = <span class="hljs-built_in">pose</span>.clone()
cen_switch = pr_sm.SwitchResidueTypeSetMover(<span class="hljs-string">"centroid"</span>)
cen_switch.<span class="hljs-built_in">apply</span>(<span class="hljs-built_in">pose</span>)
... # <span class="hljs-built_in">do</span> something
recover_sidechains = pr_sm.ReturnSidechainMover(<span class="hljs-built_in">pose</span>)
recover_sidechains.<span class="hljs-built_in">apply</span>(<span class="hljs-built_in">pose</span>)
</code></pre><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRc-JKCdRPgRpA5d4d8HSF7-AI3-iZzXCJIHNbCRUs4yH8x39-U8xEeQKIil_ARS2pjmsGxoHdcyII8v0QEuEisy8MT8j1p_veeaDGsUfry5EQHWroan6d3AUrk8ef4_XWMvpgUv67FRU/s2899/centroid+%25281%2529.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1085" data-original-width="2899" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRc-JKCdRPgRpA5d4d8HSF7-AI3-iZzXCJIHNbCRUs4yH8x39-U8xEeQKIil_ARS2pjmsGxoHdcyII8v0QEuEisy8MT8j1p_veeaDGsUfry5EQHWroan6d3AUrk8ef4_XWMvpgUv67FRU/s320/centroid+%25281%2529.png" width="320" /></a></div><br /><p><br /></p><p>NB. That the switch happens in place, so it is best to avoid mark in the variable name if it is centroid or not.</p>
<p>Centroid is good for some applications, but it not really intended for this. Also requires a centroid scorefunction and it's own variant of <code>ClassicRelax</code>, <code>CentroidRelax</code>.</p>
<pre><code>print(cen_relax.get_scorefxn().get_name()) # <span class="hljs-string">'cen_std'</span>
pyrosetta.create_score_function(<span class="hljs-string">'cen_std_smooth'</span>)
cen_relax = pyrosetta.rosetta.protocols.relax.CentroidRelax()
#cen_relax.set_movemap(movemap)
cen_relax.set_scorefxn(scorefxn_smooth)
#cen_relax.set_rounds(<span class="hljs-number">5</span>)
# cen_relax.min_type() # <span class="hljs-string">'lbfgs_armijo_nonmonotone'</span>
cen_relax.apply(pose)
</code></pre><p>This occasionally actually may crash CEN atoms are too close without recovering the best pose when giving the errors "NAN occurred in H-bonding calculations!" or "AtomTree torsion_angle_dof_id angle range error".</p>
<h2 id="best-of-five">Best of five</h2>
<p>So far, I have discussed methods to analyse a specific model. However, AlphaFold2 returns multiple and it is nice to check all. Here are some methods I used to look at the interface.</p>
<h3 id="exploring-the-folder-of-goodies">Exploring the folder of goodies</h3>
<p>From a <a href="https://github.com/sokrypton/ColabFold">ColabFold notebook</a> run one gets the following files:</p>
<ul>
<li>msa.pickle</li>
<li>msa_coverage.png</li>
<li>rank_{👾}<em>model</em>{🗿}_ptm<em>seed</em>{🌰}.png</li>
<li>rank_{👾}<em>model</em>{🗿}_ptm<em>seed</em>{🌰}_pae.json</li>
<li>rank_{👾}<em>model</em>{🗿}_ptm<em>seed</em>{🌰}_relaxed.pdb</li>
<li>rank_1<em>model</em>{🗿}_ptm<em>seed</em>{🌰}_unrelaxed.pdb</li>
<li>settings.txt</li>
</ul>
<p>Where 👾 is the rank (value between 1 and N, where N=5 is the default) and 🗿 is the model number and 🌰 is the seed value, default is zero.
The JSON is the same as downloaded (see above for reshaping function). <code>Settings.txt</code> contains the summary scores in addition to the settings.</p>
<p>Rank 1 is the best according to the chosen metric, which can be misleading, especially if the top are close.</p>
<h3 id="pandas">Pandas</h3>
<p>To keep the data in a tidy way, I used pandas via the following method (which does not require pyrosetta):</p>
<pre><code>def make_AF2_dataframe(folder:str) -> pd.DataFrame:
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
Given <span class="hljs-keyword">a</span> folder from ColabFold <span class="hljs-keyword">return</span> <span class="hljs-keyword">a</span> pandas dataframe
with key rank <span class="hljs-built_in">index</span>
<span class="hljs-built_in">and</span> value <span class="hljs-keyword">a</span> dictionary of details
This <span class="hljs-keyword">is</span> convoluted, but it may have been altered by <span class="hljs-keyword">a</span> human.
<span class="hljs-string">""</span><span class="hljs-comment">"</span>
filenames = os.listdir(folder)
# group <span class="hljs-keyword">files</span>
ranked_filenames = {}
<span class="hljs-keyword">for</span> filename in filename<span class="hljs-variable">s:</span>
<span class="hljs-keyword">if</span> not re.<span class="hljs-keyword">match</span>(<span class="hljs-string">'rank_\d+.*\.pdb'</span>, filename):
<span class="hljs-keyword">continue</span>
rex = re.<span class="hljs-keyword">match</span>(<span class="hljs-string">'rank_(\d+)_model_(\d+).*seed_(\d+)(.*)\.pdb'</span>, filename)
rank = <span class="hljs-keyword">int</span>(rex.group(<span class="hljs-number">1</span>))
model = <span class="hljs-keyword">int</span>(rex.group(<span class="hljs-number">2</span>))
seed = <span class="hljs-keyword">int</span>(rex.group(<span class="hljs-number">3</span>))
other = rex.group(<span class="hljs-number">4</span>)
<span class="hljs-keyword">if</span> rank in ranked_filenames <span class="hljs-built_in">and</span> ranked_filenames[rank][<span class="hljs-string">'relaxed'</span>] == True:
<span class="hljs-keyword">continue</span>
data = dict(name=filename,
path=os.path.<span class="hljs-keyword">join</span>(folder, filename),
rank=rank,
model=model,
seed=seed,
relaxed=<span class="hljs-string">'_relaxed'</span> in other
)
ranked_filenames[rank] = data
# <span class="hljs-keyword">make</span> dataframe
details = pd.DataFrame(<span class="hljs-keyword">list</span>(ranked_filenames.<span class="hljs-built_in">values</span>()))
# <span class="hljs-built_in">add</span> data from settings.
pLDDTs = {}
pTMscores = {}
with <span class="hljs-keyword">open</span>(os.path.<span class="hljs-keyword">join</span>(folder, <span class="hljs-string">'settings.txt'</span>), <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fh:
<span class="hljs-keyword">for</span> <span class="hljs-keyword">match</span> in re.findall(<span class="hljs-string">'rank_(\d+).*pLDDT\:(\d+\.\d+)\ pTMscore:(\d+\.\d+)'</span>, fh.<span class="hljs-keyword">read</span>()):
pLDDTs[<span class="hljs-keyword">int</span>(<span class="hljs-keyword">match</span>[<span class="hljs-number">0</span>])] = float(<span class="hljs-keyword">match</span>[<span class="hljs-number">1</span>])
pTMscores[<span class="hljs-keyword">int</span>(<span class="hljs-keyword">match</span>[<span class="hljs-number">0</span>])] = float(<span class="hljs-keyword">match</span>[<span class="hljs-number">2</span>])
details[<span class="hljs-string">'pLDDT'</span>] = details[<span class="hljs-string">'rank'</span>].<span class="hljs-keyword">map</span>(pLDDTs)
details[<span class="hljs-string">'pTMscore'</span>] = details[<span class="hljs-string">'rank'</span>].<span class="hljs-keyword">map</span>(pTMscores)
<span class="hljs-keyword">return</span> details
</code></pre><p>So running <code>scores = make_AF2_dataframe(folder = 'prediction_38cac')</code> gives me:</p>
<table>
<thead>
<tr>
<th style="text-align: right;">rank</th>
<th style="text-align: right;">model</th>
<th style="text-align: right;">seed</th>
<th style="text-align: left;">relaxed</th>
<th style="text-align: right;">pLDDT</th>
<th style="text-align: right;">pTMscore</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right;">1</td>
<td style="text-align: right;">5</td>
<td style="text-align: right;">0</td>
<td style="text-align: left;">True</td>
<td style="text-align: right;">78.21</td>
<td style="text-align: right;">0.6023</td>
</tr>
<tr>
<td style="text-align: right;">2</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">0</td>
<td style="text-align: left;">False</td>
<td style="text-align: right;">77.82</td>
<td style="text-align: right;">0.5788</td>
</tr>
<tr>
<td style="text-align: right;">3</td>
<td style="text-align: right;">2</td>
<td style="text-align: right;">0</td>
<td style="text-align: left;">False</td>
<td style="text-align: right;">77.33</td>
<td style="text-align: right;">0.5772</td>
</tr>
<tr>
<td style="text-align: right;">4</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">0</td>
<td style="text-align: left;">False</td>
<td style="text-align: right;">77.05</td>
<td style="text-align: right;">0.5866</td>
</tr>
<tr>
<td style="text-align: right;">5</td>
<td style="text-align: right;">3</td>
<td style="text-align: right;">0</td>
<td style="text-align: left;">False</td>
<td style="text-align: right;">76.87</td>
<td style="text-align: right;">0.5711</td>
</tr>
</tbody>
</table>
<p>Then I fetch the errors and poses, but without putting them in the table:</p>
<pre><code><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">def</span> get_folder_errors(<span class="hljs-string">scores:</span> pd.DataFrame, <span class="hljs-string">path_column:</span>str=<span class="hljs-string">'path'</span>) -> <span class="hljs-string">dict:</span>
errors = dict()
<span class="hljs-keyword">for</span> i, row <span class="hljs-keyword">in</span> scores.iterrows():
filename = row[path_column].replace(<span class="hljs-string">'.pdb'</span>, <span class="hljs-string">'.json'</span>)\
.replace(<span class="hljs-string">'_unrelaxed'</span>, <span class="hljs-string">'_pae'</span>)\
.replace(<span class="hljs-string">'_relaxed'</span>, <span class="hljs-string">'_pae'</span>)
with open(filename, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> <span class="hljs-string">fh:</span>
errors[row[<span class="hljs-string">'rank'</span>]] = ph.reshape_errors(json.load(fh))
<span class="hljs-keyword">return</span> errors
poses = get_folder_poses(scores)
errors = get_folder_errors(scores)
<span class="hljs-keyword">def</span> get_folder_poses(<span class="hljs-string">df:</span> pd.DataFrame, <span class="hljs-string">path_column:</span>str=<span class="hljs-string">'path'</span>) -> pyrosetta.rosetta.utility.<span class="hljs-string">Dict[int, pyrosetta.Pose]:</span>
poses = dict() # Not pyrosetta.rosetta.utility.vector1_core_pose_Pose(len(df))
<span class="hljs-keyword">for</span> i, row <span class="hljs-keyword">in</span> df.iterrows():
poses[row[<span class="hljs-string">'rank'</span>]] = pyrosetta.pose_from_file(row[path_column])
<span class="hljs-keyword">return</span> poses
</code></pre><p>In an earlier implementation I was using pyrosetta.rosetta.utility.vector1_core_pose_Pose as opposed to a dictionary, which actually returns a clone when an array item was retrieved.</p><p>Having done that I got some measurements of the complexes:</p>
<pre><code><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial
<span class="hljs-keyword">import</span> pyrosetta_help <span class="hljs-keyword">as</span> ph
pr_rs = pyrosetta.rosetta.core.select.residue_selector
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_interaction_vectors</span><span class="hljs-params">(pose:pyrosetta.Pose, chain_id:int, threshold:float = <span class="hljs-number">3.</span>)</span> -> pr_rs.ResidueVector:</span>
<span class="hljs-string">"""
For ``analyse_complex``.
"""</span>
<span class="hljs-keyword">assert</span> pose.num_chains() > <span class="hljs-number">1</span>, <span class="hljs-string">'Single chain!'</span>
chain_sele = pr_rs.ChainSelector(chain_id)
other_chains_sele =pr_rs.NotResidueSelector(chain_sele)
cc_sele = pr_rs.CloseContactResidueSelector()
cc_sele.central_residue_group_selector(other_chains_sele)
cc_sele.threshold(float(threshold))
other_cc_sele = pr_rs.AndResidueSelector(chain_sele, cc_sele)
<span class="hljs-keyword">return</span> pr_rs.ResidueVector(other_cc_sele.apply(pose))
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_get_errors</span><span class="hljs-params">(row, i)</span>:</span>
<span class="hljs-string">"""
For ``analyse_complex``.
"""</span>
residues = row[f<span class="hljs-string">'interchain_residues_{i}'</span>]
error = errors[row[<span class="hljs-string">'rank'</span>]]
<span class="hljs-keyword">return</span> ([np.round(np.min(error[r<span class="hljs-number">-1</span>,:]), <span class="hljs-number">1</span>) <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> residues])
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">analyse_complex</span><span class="hljs-params">(poses, details: pd.DataFrame)</span>:</span>
details[<span class="hljs-string">'interchain_residues_1'</span>] = details[<span class="hljs-string">'rank'</span>].apply(<span class="hljs-keyword">lambda</span> rank: _get_interactions(poses[rank], <span class="hljs-number">1</span>))
details[<span class="hljs-string">'interchain_residues_2'</span>] = details[<span class="hljs-string">'rank'</span>].apply(<span class="hljs-keyword">lambda</span> rank: _get_interactions(poses[rank], <span class="hljs-number">2</span>))
details[<span class="hljs-string">'N_interchain_residues_1'</span>] = details[<span class="hljs-string">'interchain_residues_1'</span>].apply(len)
details[<span class="hljs-string">'N_interchain_residues_2'</span>] = details[<span class="hljs-string">'interchain_residues_2'</span>].apply(len)
details[<span class="hljs-string">'errors_interchain_residues_1'</span>] = details.apply(partial(_get_errors, i=<span class="hljs-number">1</span>), <span class="hljs-number">1</span>)
details[<span class="hljs-string">'errors_interchain_residues_2'</span>] = details.apply(partial(_get_errors, i=<span class="hljs-number">2</span>), <span class="hljs-number">1</span>)
</code></pre><p>Namely <code>analyse_complex</code> adds a column to the dataframe for the interface residues in chain 1, one for those in chain 2 and two columns for the number of these.</p>
<p>Note that the residue indices are pose residue indices, not PDB residue numbers, so chain B will continue on from the numbering of chain A.</p>
<p>The median pLDDT for the interface residues can be calculated:</p>
<pre><code><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_median_interface_bfactor</span><span class="hljs-params">(pose, residues)</span></span>:
pbd_info = pose.pdb_info()
bfactors = [pbd_info.bfactor(r, pose.residue(r).atom_index(<span class="hljs-string">'CA'</span>)) <span class="hljs-keyword">for</span> r <span class="hljs-keyword">in</span> residues]
<span class="hljs-keyword">return</span> np.array(bfactors).median()
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_ibf</span><span class="hljs-params">(row)</span></span>:
residues = list(row.interchain_residues_1) + list(row.interchain_residues_2)
pose = poses[row[<span class="hljs-string">'rank'</span>]] <span class="hljs-comment"># ``row.rank`` is a function, just like ``row.name``.</span>
<span class="hljs-keyword">return</span> get_median_interface_bfactor(pose, residues)
<span class="hljs-comment"># run:</span>
details[<span class="hljs-string">'interface_pLDDT'</span>] = details.apply(get_ibf ,<span class="hljs-number">1</span>)
</code></pre><p>For the interface strength, I need to minimise first:</p>
<pre><code>scorefxn = pyrosetta.get_fa_scorefxn()
ap_st = pyrosetta.rosetta.core.scoring.ScoreType.atom_pair_constraint
scorefxn.set_weight(ap_st, <span class="hljs-number">1</span>)
<span class="hljs-keyword">for</span> pose, <span class="hljs-built_in">error</span> <span class="hljs-keyword">in</span> zip(poses, <span class="hljs-built_in">errors</span>):
ph.constrain_distances(pose, <span class="hljs-built_in">error</span>)
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, <span class="hljs-number">3</span>)
relax.<span class="hljs-built_in">apply</span>(pose)
# run:
details['dG'] = details['<span class="hljs-built_in">rank</span>'].<span class="hljs-built_in">apply</span>(<span class="hljs-built_in">lambda</span> <span class="hljs-built_in">rank</span>: scorefxn(poses[<span class="hljs-built_in">rank</span>]))
</code></pre><p>Then I can calculate the interface strength:</p>
<pre><code>def calculate_interface(details, interface = <span class="hljs-string">'A_B'</span>):
newdata = []
<span class="hljs-keyword">for</span> rank in scores[<span class="hljs-string">'rank'</span>]:
<span class="hljs-keyword">ia</span> = pyrosetta.rosetta.protocols.analysis.InterfaceAnalyzerMover(interface)
<span class="hljs-keyword">ia</span>.apply(poses[rank])<br /> newdata.<span class="hljs-keyword">append</span>({<span class="hljs-string">'complex_energy'</span>: <span class="hljs-keyword">ia</span>.get_complex_energy(),
<span class="hljs-string">'separated_interface_energy'</span>: <span class="hljs-keyword">ia</span>.get_separated_interface_energy(),
<span class="hljs-string">'complexed_sasa'</span>: <span class="hljs-keyword">ia</span>.get_complexed_sasa(),
<span class="hljs-string">'crossterm_interface_energy'</span>: <span class="hljs-keyword">ia</span>.get_crossterm_interface_energy(),
<span class="hljs-string">'interface_dG'</span>: <span class="hljs-keyword">ia</span>.get_interface_dG(),
<span class="hljs-string">'interface_delta_sasa'</span>: <span class="hljs-keyword">ia</span>.get_interface_delta_sasa()})
# adding multiple columns, hence why not apply route.
# the order <span class="hljs-keyword">is</span> not chanced, <span class="hljs-keyword">so</span> <span class="hljs-keyword">all</span> good
newdata = pd.DataFrame(newdata)
<span class="hljs-keyword">for</span> column in newdata.column<span class="hljs-variable">s:</span>
details[column] = newdata[column]
</code></pre><p></p>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com1tag:blogger.com,1999:blog-9015174234871442237.post-79462902301634743082021-07-27T01:26:00.004-07:002022-08-15T08:53:19.250-07:00What to look out for with an AlphaFold2 model<p>There is nothing more disheartening than telling someone "Sorry, I cannot help you with your protein, because no homologue structures of your protein are solved and any model will be rubbish". Now, with AlphaFold2 proteome release this is no longer the case. Or mostly: in fact there are several pitfalls and issues that need to be looked at, because the algorithm does not account for three things: binding partners and ligands, oligomerisation and alternate conformations.</p><a name='more'></a><p></p><h2 style="text-align: left;">Gamechanger</h2><div>Before starting, I ought to address whether AlphaFold2 is actually a gamechanger.</div><div>AlphaFold2 stunned the world with its accuracy in the CASP competition and now DeepMind have teamed up with <a href="https://alphafold.ebi.ac.uk/" target="_blank">EMBL EBI </a>to provide structures for the whole proteome of key model organisms. So what does this mean for science? And am I out of a job now?</div><div><br /></div><div>Prior to this, the PDB contains many protein, in different conformation and bound states, while Swiss-Model provides threaded models for parts of any human protein that are homologous to a solved structure. The former is actually rather tricky to navigate as there may be a given human protein domain in complex with two bovine protein, all expressed in <i>E. coli</i>, in two different states. Even if I am used to the PDB, I often look at the <a href="https://michelanglo.sgc.ox.ac.uk/name" target="_blank">name route in Michelanglo</a> because it tells me what is bound and what are the ranges. Swiss-Model has a very clean interface nowadays and does have many models, but it is not as a well-known resource as of the PDB or as of this week the EBI's AlphaFold2 repository. Therefore, it is clear why many are excited.</div><h2 style="text-align: left;">Model making: threading</h2><p>Whereas the structures on AlphaFold2 are better than models from other online servers or databases, but there are some caveats.</p><p>There are several ways to make a model. A model can be a one-to-one threaded model, an <i>ab initio</i> model or a hybridisation of threaded/<i>ab initio</i> models.</p><p style="text-align: left;">In threading, a structure is taken and the target sequence is mapped on based on a pairwise alignment. ModBase (database), Swiss-Model (database), Phyre2 one-to-one threading (on request) and Rosetta's ThreadingMover (local calculations) are example of threading applications. This performs poorly when there sequences are divergent, but when they are close it has several benefits.</p><p>First, if one has a protein in two conformations both conformations can be threaded separately. For example, with a carrier protein one generally has an open and a close face on either side of the membrane, if there are two structures of close homologues (>30–50 sequence identity) in these two conformations, one can thread twice to get both states and better understand how the protein operates. Unfortunately, AlphaFold2 does not take into account that proteins can have alternate conformations, so if one state has a stronger evolutionary signal the Evoformer will disregard the evolutionary contacts that favour the other states.</p><p>Second, one can migrate the bound ligands and/or protein after threading from the template to the model. This is incredibly useful, but runs on the assumption that the target has the same binding partners as the template. Evolution often takes measures to prevent crosstalk resulting in divergent binding patterns, but it is a mostly safe assumption that similar binding partners bind in a similar manner except for the divergent parts. In the case of small molecules, I-Tasser, an <i>ab initio + </i>threading hybridisation online algorithm, suggests the possibility of migrating these from the templates. Unfortunately, AlphaFold2 does not work by templates, or at least it is unaware of what specific model was used as a template in its training, therefore it does not do any such suggestion, although one can find out what may be useful (see supplementary notes below).</p><h2 style="text-align: left;">Loops</h2><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7dsZ7ccYum5D8TeU-L_PfzPNP-dkBvjtpDLCIwfVjoWqAL4XyzTjl5ts0vptrx1fljCzv64jLq6h8_ULjfLOphKy4WoiwT7Q9sBwsSE3S4VbFuCoK-IB1B_TLMfO7l0SoK_5ETi6UUpg/s1280/spaghetti.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="951" data-original-width="1280" height="174" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7dsZ7ccYum5D8TeU-L_PfzPNP-dkBvjtpDLCIwfVjoWqAL4XyzTjl5ts0vptrx1fljCzv64jLq6h8_ULjfLOphKy4WoiwT7Q9sBwsSE3S4VbFuCoK-IB1B_TLMfO7l0SoK_5ETi6UUpg/w234-h174/spaghetti.png" width="234" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="color: #999999;">Not a bowl of spaghetti,<br />but a poor Phyre2 model</span></td></tr></tbody></table>In <i>ab initio</i> models and hybrids based on these, loops that do not have an intrinsic structure appear as spaghetti. These may not have been modelled correctly (cf. picture) or may be bound as a flexible peptide to a target protein or may wrap around DNA (cf. picture).<p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgawPEelg6WTlq3KeAa_6MXiaE0a9X0sbdXJxcph9ZuII7W8eYFf5hILlYhoMZV2_zElnqa5zAx__0dxbM-e4OYhQ2hGrCKkbV6yytSR3HqOURorE7I3nKIyVN5ObVjkfYYhQLYvDW12sg/s1066/cadherin.png" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="526" data-original-width="1066" height="158" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgawPEelg6WTlq3KeAa_6MXiaE0a9X0sbdXJxcph9ZuII7W8eYFf5hILlYhoMZV2_zElnqa5zAx__0dxbM-e4OYhQ2hGrCKkbV6yytSR3HqOURorE7I3nKIyVN5ObVjkfYYhQLYvDW12sg/w320-h158/cadherin.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="color: #999999;">PDB:3L6X.<br />E-cadherin peptide (turquoise)<br />bound to catenin.</span></td></tr></tbody></table><br /><p>In a crystal these unbound disordered loops lack density (i.e. missing density) and are therefore absent from the solved structure. I blogged about their addition <a href="https://blog.matteoferla.com/2020/07/filling-missing-loops-proper-way.html?m=0">in a past post</a> and enumerated many reasons why this operation can be a bad idea. Many human AlphaFold2 models present with long spaghetti loops. Let's take MEF2C as an example: <a href="https://alphafold.ebi.ac.uk/entry/Q06413">https://alphafold.ebi.ac.uk/entry/Q06413</a>. This is not a random choice, but a protein that I have worked before on and my threaded model looked very different: <a href="https://michelanglo.sgc.ox.ac.uk/r/mef2c">https://michelanglo.sgc.ox.ac.uk/r/mef2c</a>. Spot the difference:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC5bIEUn_4_W-ZpYEpZ_MSptz-X2m4ZhKKcUdeUtMk9Omrz8a5koqyq-G6n7oJ0eXcRNCUCBo710jeJ_MfJLJVuQFPtggaa2wZsnXf8d3y5KWyKi6bXFWtkAG6FIEfmCVzdmjeaES9V8I/s923/AF-Q06413-F1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="460" data-original-width="923" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC5bIEUn_4_W-ZpYEpZ_MSptz-X2m4ZhKKcUdeUtMk9Omrz8a5koqyq-G6n7oJ0eXcRNCUCBo710jeJ_MfJLJVuQFPtggaa2wZsnXf8d3y5KWyKi6bXFWtkAG6FIEfmCVzdmjeaES9V8I/s320/AF-Q06413-F1.png" width="320" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVwGaGn3i1vbRa4vm5q_zNn-XATVWf_6jngM5Pe5Uds2V2X5JzEp1ro-wtaFcOWwl7gKcszLkLilFmSEkjIuBvoWTV-cgfYQ_FMOgys9PZ0l8DQXWd81wHFEyW5zRwvTnwfJv5WXznmYs/s613/Screenshot+2021-07-26+at+17.05.10.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="393" data-original-width="613" height="136" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVwGaGn3i1vbRa4vm5q_zNn-XATVWf_6jngM5Pe5Uds2V2X5JzEp1ro-wtaFcOWwl7gKcszLkLilFmSEkjIuBvoWTV-cgfYQ_FMOgys9PZ0l8DQXWd81wHFEyW5zRwvTnwfJv5WXznmYs/w213-h136/Screenshot+2021-07-26+at+17.05.10.png" width="213" /></a></div><br /><p>Three and a half things are to be noted.</p><p></p><ol style="text-align: left;"><li>My model is a dimer. Although in the Twittersphere there are already colabs notebooks that can model oligomers with AlphaFold2.</li><li>My model is DNA.</li><li>I do not have the loops with low pLDDT</li><li>The different in information one learns from them is staggering!</li></ol><div>However, it should be said that low pLDDT loops are still likely to have a role. In fact, in this protein, if the C-terminal spaghetti were junk, there would be many homozygous truncations in gnomAD —yet there are none. Looking at <a href="https://www.phosphosite.org/proteinAction.action?id=999&showAllSites=true" target="_blank">PhosphoSitePlus</a>, reveals that the spaghetti is peppered with post-translation modifications, including C-terminal phosphorylations (a common nuclear protein regulatory mechanism) and several lysine acetylations (affecting DNA binding). One residue is commonly found phosphorylated, Ser222. Googling this residue reveals it is a residue with <a href="https://pubmed.ncbi.nlm.nih.gov/29431698/" target="_blank">a studied effect</a>. It is part of a spaghetti loop on AlphaFold2 model, but in reality it must wrap around something, recruiting things. In reality having something bind DNA does not make the RNA polymerase magically appear next to it ready to transcribe but instead a complicated interplay of protein, involving not only the RNA polymerase complex, but also the mediator complex, needs to happen. Namely, the eukaryotic RNA polymerase is recruited to the transcription factor indirectly: first the transcription factor binds the DNA, then it binds a certain part of the mediator complex (evidence suggests different parts may be bound by different transcription factor tails (cf. <a href="https://link.springer.com/article/10.1007/s00018-013-1265-9" target="_blank">this review</a>), thus integrating the signal with other factors, which then recruits the RNA polymerase.</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAOOizbhFWTKcgRCz4j0C0XFb4NrWC5LeanDzcK_PxLqrV4JG5kzDLZqaYq8UZWxV1UaGMY5hI-2CrHeuvg3dj6dDkzc_NtaShNOQ2yy86mZuQONyJDimp1-UOcjoHiSh1vCBiRpRvqFo/s1600/PhosphoSitePlus_MEF2C.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="800" data-original-width="1600" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAOOizbhFWTKcgRCz4j0C0XFb4NrWC5LeanDzcK_PxLqrV4JG5kzDLZqaYq8UZWxV1UaGMY5hI-2CrHeuvg3dj6dDkzc_NtaShNOQ2yy86mZuQONyJDimp1-UOcjoHiSh1vCBiRpRvqFo/s320/PhosphoSitePlus_MEF2C.png" width="320" /></a></div><br /><h2 style="text-align: left;">Beads on a string</h2><p></p><p>Even when there are no spaghetti loops, some multidomain protein may have structured domains, which do their thing independently of the other domains, like beads on a string. Namely, they are flexible in solution, but have an arrangement when bound. For example, whereas it is easy to predict how a KH domain looks like, it is not easy to predict how several KH domains interact with each other. For example, KSRP <a href=" https://alphafold.ebi.ac.uk/entry/Q92945" target="_blank">in AlphaFold2</a> looks like:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyrwHGOBqlKyVsAr0AQ_CdY5Y2ljmlFDb9i5vWXtshlGAY3T9FymmWiH4d7Bu6BwG8SWLENNhr-1zQmKjxTG1uTnxly1WlkKQviwI0ohtXHCwdA5-MKQqlJgUbgzOmQJFV678k9Uadm5k/s923/AF-Q92945-F1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="460" data-original-width="923" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyrwHGOBqlKyVsAr0AQ_CdY5Y2ljmlFDb9i5vWXtshlGAY3T9FymmWiH4d7Bu6BwG8SWLENNhr-1zQmKjxTG1uTnxly1WlkKQviwI0ohtXHCwdA5-MKQqlJgUbgzOmQJFV678k9Uadm5k/s320/AF-Q92945-F1.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="color: #999999;">In blue are the KH domains,<br />which have high pLDDT</span></td></tr></tbody></table>The KH domains appear to take on a certain arrangement, but the loops in between have low confidence and actually there are no residues between the domains that are in contact. Looking further at the loops between the domains shows they are flexible glycine-proline-charged rich linkers. When unbound these will move about, while when bound to RNA these will align along it.<div>For visual purposes I personally prefer to have these protein in a line, like an illustration of the planets, which are actually very very rarely in sysygy —for a script that can do this, see <a href="https://github.com/matteoferla/protein_fuser" target="_blank">my GitHub</a>.</div><div><h2 style="text-align: left;">Transmembrane</h2><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY8CNKw15ZC21PMGLRgmfafptbQTD7y1YNjLIzh6wzstf5dffD5L91Oz-l-QbdEHdaI89_ayZie-WaY9dbEeyd1dc9cdMYn6o3j2zCIYZtmFf_zHWkB2yJdHId0aOTefttclkitvygg2o/s923/AF-Q8K1P8-F1.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="460" data-original-width="923" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY8CNKw15ZC21PMGLRgmfafptbQTD7y1YNjLIzh6wzstf5dffD5L91Oz-l-QbdEHdaI89_ayZie-WaY9dbEeyd1dc9cdMYn6o3j2zCIYZtmFf_zHWkB2yJdHId0aOTefttclkitvygg2o/s320/AF-Q8K1P8-F1.png" width="320" /></a></div>A common problem with models from I-Tasser, Phyre2 and AlphaFold2 of transmembrane protein is that the bells-and-whistles parts that decorate the protein on either side of the membrane may be placed looping back into where one would picture the membrane, which needs correcting as the algorithms do not really see membrane as a plane to cross (see footnote about adding markers for the membrane). The membrane plane invading segment can be deleted or moved even with autosculpting feature in PyMOL discussed in <a href="https://blog.matteoferla.com/2017/10/hacking-pdbs-for-fusion-protein.html">the previous blog post about filling missing density the hacky way</a> as the span will have low confidence anyway.</div><h2 style="text-align: left;">Docking</h2><div><p>There is a real potential in discovering new drugs that bind to the AlphaFold2 models. However, it is not a simple task and worse it may backfire. One problem that is going to arise in my opinion from this release is a swathe of docking studies that will not really lead anywhere —pun indented. This is because docking is a complicated toolset that requires scepticism, human-attention-to-details and majorly validation of a large subset of targets.</p><p>Covid19 is a good example of the issues with docking. With Covid19 the Zhang group (I-Tasser) released models of the viral protein and shortly afterwards there were two effects. The milder effect was a deluge of questions in Stack Exchange (SE) Bioinformatics, SE Chemistry and even Stack Overflow by users trying their hand at docking clearly without understanding what they were doing and without an understanding the system at hand —e.g. setting the grid the size of the protein, not protonating catalytic residues correctly, improper data analysis <i>etc</i>. The more problematic effect was a deluge of papers in BioRxiv and ChemRxiv, which later translated to actual publications, claiming that a drug could be repurposed based solely on a docking screen. In the subsequent sporadic publications in which these were tested the compounds generally proved to be near ineffective, but enough for a publication with the word Covid19 in the title. This proved to be such a major drain on funding that apparently most funding bodies were refusing to consider grant related to drug discovery for Covid19...</p><p>Another usage is discovering what proteins are bound by certain drugs that cause off-target effects by doing a longitudinal docking study with a given compound and find its target protein amidst the human proteome. This is however problematic because a key element is determining which parts of a protein are of interest —say binding somewhere on the surface is unlikely to have any effect on an enzyme for example. There are several algorithms that are trained to spot active sites, so it is not an impossible task, but it is far from as trivial as it sounds.</p><p>A further usage, and one less riddled with issues, is that AlphaFold2 models could reveal the binding mechanism of certain drugs whose targets could not be crystallised. As a results better variants could be designed.</p></div><h2 style="text-align: left;">Interpretation</h2><p>A reoccurring concept in this blog post has been "interpretation". A structure or model is a great piece of information that nevertheless needs to be interpreted. Triosephosphateisomerase and alanine racemase are both TIM barrels, but catalyse completely different reactions. Therefore, an extra step is required to understand how the protein functions, and if required design an inhibitor or engineer it.</p><p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgK0EWBYFaULRhnRam2cjTt2Y4ARPJYGRNpLo1cF58CdjCNrRDoRVrl8IkgcS_Lbdhupm7UGeH1EtTnB9Sb7W7_pKdb2nj5LHFtb-5NMIatthI1eRF3ebADzHEFtwkKAb4B9qBXWeEia4/s2048/Screenshot+2021-07-24+at+09.48.30.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1198" data-original-width="2048" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgK0EWBYFaULRhnRam2cjTt2Y4ARPJYGRNpLo1cF58CdjCNrRDoRVrl8IkgcS_Lbdhupm7UGeH1EtTnB9Sb7W7_pKdb2nj5LHFtb-5NMIatthI1eRF3ebADzHEFtwkKAb4B9qBXWeEia4/s320/Screenshot+2021-07-24+at+09.48.30.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="color: #999999;">A protein in Michelanglo.</span><br style="color: #999999;" /><span style="color: #999999;">The green links are "prolinks", which change</span><br style="color: #999999;" /><span style="color: #999999;">the protein representation.</span></td></tr></tbody></table><br />In order to help in the dissemination of protein information, I developed <a href="https://michelanglo.sgc.ox.ac.uk/" target="_blank">Michelanglo</a>, a tool to share webpages with a user-written descriptions that can control an interactive protein. All too often when making a page for me to share, I realise something about the protein when annotating it!</p><p>A paper about a crystal structure generally has a lot of useful information and is not simply an inane description of what the authors can see. Okay, some papers are unfortunately like that, but often there are insightful connections that aren't obvious and biochemical assays probing the function.</p><p>Therefore, the Alphafold2 release is a great tool for biochemists and actually gives them more work and not less, because a lot of work is needed to interpret this treasure trove of data and convert it into knowledge.</p><p>The AlphaFold2 models do not replace crystallographers and actually will be used to solve crystals by molecular replacement without the need for the many tricks that have popped up along the years, allowing them to crystallise the protein in different states with different ligands. The phase problem is a problem, to the point that in the past decade, several structures, such as <a href="https://www.rcsb.org/structure/3sqf">M-PMV retroviral protease</a>, had to be solved by citizen scientists with FoldIt.</p><p>In fact, it is rather misleading hailing AlphaFold2 as a gamechanger for medicine (i.e. benefitting humans) more than hailing it as a gamechanger for our understanding of biochemistry (i.e. fundamental science), even if a better understanding of how proteins function ultimately is translated into medicine. AlphaFold2 does not circumvents the fundamental science part.</p><h2 style="text-align: left;">Supplementary notes</h2><h3 style="text-align: left;">Finding partners</h3><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhR42X_IhPJLI7946c0rTyf985GrDhaLpOnVMHAFBAMVSM_bbmez-HUEhvEgeNdIayQxD0pck1bf19ELt6m6qReiCoMdhxQpKcV4aA7B4bhydh29_fcpMF_YEVN9YPrMk-bgUGCPLsRLps/s816/Screenshot+2021-07-26+at+15.12.15.png" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="332" data-original-width="816" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhR42X_IhPJLI7946c0rTyf985GrDhaLpOnVMHAFBAMVSM_bbmez-HUEhvEgeNdIayQxD0pck1bf19ELt6m6qReiCoMdhxQpKcV4aA7B4bhydh29_fcpMF_YEVN9YPrMk-bgUGCPLsRLps/s320/Screenshot+2021-07-26+at+15.12.15.png" width="320" /></a></div>To migrate binding partners or ligands, one must know what the donor structures are.<br />To find out what is similar, go to NCBI Blast Protein and choose PDB as the database in "Choose search set".</div><div>The results should containt PDB 4 letter codes with an underscore chain letter.<br />If no results are found, a two step process will help. First, a PSI-Blast search ("Program selection") against RefSeq, preferably with the exclusion of over-represented taxa (with an included row with All taxon to avoid an occasional glitch) for a few iterations need to be done, then the PSSM matrix needs exporting (old view, export PSSM matrix), lastly a new search this time against the PDB dataset and with the downloaded PSSM matrix uploaded in algorithm parameters.</div><div>The candidate donors can be inspected in the PDB (<a href="https://www.rcsb.org/">https://www.rcsb.org/</a>) to see what else is there. Do note that DNA is not marked in the description (e.g. <a href="https://www.rcsb.org/structure/1N6J">https://www.rcsb.org/structure/1N6J</a>).</div><div>Also note that the numbering in NCBI corresponds to the position in the sequence in the SEQRES record of a PDB, not the PDB residue index, what is called the pose index in Rosetta or the residue position in the canonical Uniprot isoform.</div><div><h4 style="text-align: left;">Migrating partners</h4></div><div>First determine what part of the donor protein is homologous to your AlphaFold2 model. Then in PyMOL align this to your model. Assuming you want to move your donor onto the model, the command <code>align donor, alphafold</code> may be too broad, so try <code>align donor and chain 👽 and resi 👹-👹, alphafold and resi 👾-👾</code>, where 👽 is the chain letter that is homologous and 👹-👹 is the range of PDB residue indices (see note from previous section). If this does not work due to excessive sequence divergence, <code>cealign alphafold and resi 👾-👾, donor and chain 👽 and resi 👹-👹</code> (inverted synthax) will do it. Then make sure no chains that are to be migrated are chain A (if they are do <code>alter donor and chain A, chain='X'</code> and then <code>sort</code>. Then do <code>create combo, alphafold or (donor and not chain 👽)</code> to make a new combined structure.<br /></div><h4 style="text-align: left;">Membrane</h4><div>A special case is adding membranes. This can be done in two ways:</div><div><ul style="text-align: left;"><li>adding an actual membrane. CHARMM GUI (<a href="https://www.charmm-gui.org">https://www.charmm-gui.org</a>/) is a great tool for input generation for MD and can add a customly defined membrane.</li><li>adding dots as a membrane marker. The site OPM (orientation of protein in membrane, <a href="https://opm.phar.umich.edu/">https://opm.phar.umich.edu/</a>) is a great resource which contains membrane structures oriented with a membrane marked as two leaves of dummy atoms (DUM residue).</li></ul><div>The former is nice, but does make for a very heavy and rather tiresome to display model, the latter is nice and simple, but does have the issue that some viewers assume the dummy atoms are proximity bonded to the structure. Given the choice, I'd go with the latter.</div></div></div>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-61710465539967092292021-07-07T10:54:00.006-07:002021-09-26T05:04:44.727-07:00Per residue RMSD<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgh0EJby0nwyPnPf8WEa4NVPhkG6M-bNRjzXXhWVAL-9_fzOHXl2JQiW4ENLDPh13Nggwjb1IYRE0y_WAgulJ-bxnTx445yUz2gsvhiR5dyqzh8ILIyoCZ7eCr8NB6asSdOL53mOfqaAcY/s1016/Screenshot+2021-07-07+at+19.28.12.png" style="display: block; padding: 1em 0; text-align: center; clear: right; float: right;"><img alt="" border="0" height="200" data-original-height="1016" data-original-width="782" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgh0EJby0nwyPnPf8WEa4NVPhkG6M-bNRjzXXhWVAL-9_fzOHXl2JQiW4ENLDPh13Nggwjb1IYRE0y_WAgulJ-bxnTx445yUz2gsvhiR5dyqzh8ILIyoCZ7eCr8NB6asSdOL53mOfqaAcY/s200/Screenshot+2021-07-07+at+19.28.12.png"/></a></div><p style="text-align: left;">Recently I calculated the local RMSD caused by each residue and I thought I'd share the methods I used using PyRosetta —it is nothing at all novel, but I could not find a suitable implementation. The task is simple given two poses, find out what residue's backbone is changing the most by scanning along comparing each a short peptide window from each.</p><span><a name='more'></a></span><h4 style="text-align: left;"><span>Premise</span></h4><div><span>First, I should premise that this is the most basic approach and contact area difference (CAD) is actually preferred, but for some tasks it is nice to keep it simple.</span><h4 style="text-align: left;"><span>Model</span></h4></div><div>A key part of science is doing appropriate controls. A lovely control for this is the tryptophan-cage mini protein NMR structure (<a href="https://www.rcsb.org/structure/1L2Y" target="_blank">PDB:1L2Y</a>), because it is very small and has thirty-something poses.</div><div><br /></div><div>Loading a multi-model structure in PyRosetta is rather bizarre as it gets loaded as one pose —unless there is a secret function I am missing. So it needs to be split:</div>
<pre><code>original = pyrosetta.toolbox.rcsb.pose_from_rcsb('1L2Y')
mini = pyrosetta.rosetta.protocols.grafting.return_region(original, 1,20)
nmr = pyrosetta.rosetta.utility.vector1_core_pose_Pose()
n = [pyrosetta.rosetta.protocols.grafting.return_region(original, 1+i,20+i) for i in range(0, original.total_residue(),20)]
nmr.extend(n)</code></pre>
<h4 style="text-align: left;">RMSD</h4><p>RMSD (root mean square deviation) is a metric in Ångström of how much on average do the atoms positions differ between two conformations (<a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation" target="_blank">wiki</a>). If the comparison is between a conformation with an imaginary centroid of an ensemble it is called RMSF (fluctuation). This is a very simple metric and gives the most weight to atoms that change the most and does not take into account several things such functional groups etc. for which there are many "shapes-and-colours" metrics. But RMSD is simple and clear.</p><p>In <code>pyrosetta.rosetta.core.scoring</code> there are several functions that calculate different forms. They do a superposition beforehand, but it is always wise to check.</p><p>One can check manually or simply do a test. So lets copy the first pose and translate it:</p>
<pre><code>import numpy as np
x, y, z = (10, 0, 0)
rototrans = np.array([[1, 0, 0, x], [0, 1, 0, y],[0, 0, 1, z],[0, 0, 0, 1]])
copy = nmr[1].clone()
copy.apply_transform(rototrans)
pyrosetta.rosetta.core.scoring.CA_rmsd(nmr[1], copy, 1, 3),\
pyrosetta.rosetta.core.scoring.CA_rmsd(nmr[1], copy),\
pyrosetta.rosetta.core.scoring.bb_rmsd_including_O(nmr[1], copy),\
pyrosetta.rosetta.core.scoring.all_atom_rmsd(nmr[1], copy),\
pyrosetta.rosetta.core.scoring.all_atom_rmsd_nosuper(nmr[1], copy)</code></pre>
<p>As hoped all bar the last give near zero values —if the rototranslation part seems gibberish, you may want to read up on 4x4 rototranslational matrices: the learning curve is very steep, but worth it.</p>
<h4 style="text-align: left;">Simple</h4><div><p>With <code>CA_rmsd</code> one can specify a window and if the poses have the same residue, the code is very simple. </p><br /><pre><code>def calculate_simple(ref:pyrosetta.Pose, target: pyrosetta.Pose, window: int= 3):
assert (window - 1) % 2 == 0, 'Odd number windows only'
half_window = int((window - 1)/2)
residue_rsmds = [float('nan')] * half_window
for off_i in range(half_window, ref.total_residue() - half_window):
i = off_i + 1
residue_rsmds.append(pyrosetta.rosetta.core.scoring.CA_rmsd(ref, target, i-half_window, i+half_window))
residue_rsmds.extend([float('nan')] * half_window)
return residue_rsmds</code></pre></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_05aFVNaxGd4xDMnXa1xPn4-rpJF63TF-hb8i38MqU3RuOBYMVKJ4bNCXNOx4g2ydRkaLRc03LLFrBKPFHIAA7qnu6y-LwBrr_AWv-J2PicqnE3O7hmFt8tZu4oJgVwDowioz-nBfwRs/s700/trpcage_wobble_3.jpg" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="700" height="229" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_05aFVNaxGd4xDMnXa1xPn4-rpJF63TF-hb8i38MqU3RuOBYMVKJ4bNCXNOx4g2ydRkaLRc03LLFrBKPFHIAA7qnu6y-LwBrr_AWv-J2PicqnE3O7hmFt8tZu4oJgVwDowioz-nBfwRs/w320-h229/trpcage_wobble_3.jpg" width="320" /></a>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQzHV3pdxbhk6AAgUTegXje-ev6a2hKbTJ-elttzI9UTv3vrzI90Duxi3LVwu83fMa0l27H37hjlpf3pIkrw0py6TE4P-dVp8XNANHV9cGV25tCJFSHB2Oco78DJFXLBlhLY6LkZp7H6s/s700/trpcage_wobble_windows.jpg" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="500" data-original-width="700" height="229" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQzHV3pdxbhk6AAgUTegXje-ev6a2hKbTJ-elttzI9UTv3vrzI90Duxi3LVwu83fMa0l27H37hjlpf3pIkrw0py6TE4P-dVp8XNANHV9cGV25tCJFSHB2Oco78DJFXLBlhLY6LkZp7H6s/w320-h229/trpcage_wobble_windows.jpg" width="320" /></a>
<p>Here I am comparing to the first model because whereas my control is an ensemble, I am interested in comparing conformational switches between two or more poses. If I did care about an ensemble, I could use an averaged pose —generated with ProDy say to get the RMSF for each atom or residue.<br />Generally, the window size used is 5. But one could use a larger one.</p></div><h4 style="text-align: left;">Real world</h4><p>This all nice, howerver, my conformers have different residues due to missing density and I would like all the backbone. So the code becomes instantly more complex, because the PDB residues need to be compared and one has to make sure only the a given chain is compared. As a result slicing the peptide out of the pose based on PDB numbering has to be done.</p><pre><code>from typing import *
def slice_for_comparison(pose:pyrosetta.Pose,
from_res: Tuple[int, str],
to_res: Tuple[int, str]) -> Union[None, pyrosetta.Pose]:
"""
Returns None is there is missing density in the range or different chains
"""
# convert to pose
from_resi, from_chain = from_res
to_resi, to_chain = to_res
pdb2pose = pose.pdb_info().pdb2pose
from_r = pdb2pose(res=from_resi, chain=from_chain)
to_r = pdb2pose(res=to_resi, chain=to_chain)
# validity checks.
if from_chain != to_chain: # different chain. user's doing...
return None
elif from_r==0 or to_r == 0: # missing density
return None
elif to_r - from_r != to_resi - from_resi: # gap
return None
else: # slice
return pyrosetta.Pose(pose, from_r, to_r)
def calculate_window(ref:pyrosetta.Pose,
target: pyrosetta.Pose,
from_res: Tuple[int, str],
to_res: Tuple[int, str],
rmsd_fx: Callable= pyrosetta.rosetta.core.scoring.CA_rmsd
):
sliced_ref = slice_for_comparison(ref, from_res, to_res)
sliced_target = slice_for_comparison(target, from_res, to_res)
# from_resi, to_resi are valid only from CA_rmsd, hence the slicing
if sliced_ref is None or sliced_target is None:
return float('nan')
else:
return rmsd_fx(sliced_ref, sliced_target)
def calculate(ref:pyrosetta.Pose,
target: pyrosetta.Pose,
chain: str='A',
window: int= 3,
rmsd_fx: Callable= pyrosetta.rosetta.core.scoring.CA_rmsd):
assert (window - 1) % 2 == 0, 'Odd number windows only'
half_window = int((window - 1)/2)
# get the boundaries of the pose residue indices making that chain.
ref_pdb = ref.pdb_info().number
chain_id = pyrosetta.rosetta.core.pose.get_chain_id_from_chain(chain, ref)
from_resi = ref_pdb(ref.chain_begin(chain_id))
to_resi = ref_pdb(ref.chain_end(chain_id))
# calculate
residue_rsmds = {i: calculate_window(ref,
target,
(i-half_window, chain),
(i+half_window, chain),
rmsd_fx) for i in range(from_resi+half_window, to_resi-half_window+1)}
return residue_rsmds</code></pre><p>But despite the complexity caused by the slicing per PDB numbering, the method is nice and simple really...</p><h4 style="text-align: left;">NGL</h4><p>As an added bonus, one can use NGL to show multiple poses, as seen in a <a href="http://blog.matteoferla.com/2021/02/multiple-poses-in-nglview.html">previous blog post</a>.</p><pre><code>import nglview as nv
from io import StringIO
import time
class ModNGLWidget(nv.NGLWidget):
def add_rosetta(self, pose: pyrosetta.Pose):
buffer = pyrosetta.rosetta.std.stringbuf()
pose.dump_pdb(pyrosetta.rosetta.std.ostream(buffer))
fh = StringIO(buffer.str())
view.add_component(fh, ext='pdb')
view = ModNGLWidget()
view.add_rosetta(nmr[1])
view.add_rosetta(nmr[21])
view.component_0.update_cartoon(color='turquoise')
view.component_1.update_cartoon(color='salmon')
view.component_0.add_hyperball(selection='15-18', colorValue='turquoise')
view.component_1.add_hyperball(selection='15-18', colorValue='salmon')
view</code></pre>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0tag:blogger.com,1999:blog-9015174234871442237.post-2179173940099831932021-04-26T11:17:00.005-07:002021-09-26T05:04:57.983-07:00Remodel in Pyrosetta<p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdgqwSxFK40-074WnKERdMasz1_8QMHIkSNMDx1cAYoMBLzmxKRsIFXjFzTfYjxKeJopGzZpArs7aChoXKChR43bj9rZ1m-qYutNpbrHv5GMql1IJ3Lhbw-mSuOnPenUGfS8NYI3_Ko7Y/s2048/IMG_2932.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="1536" data-original-width="2048" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdgqwSxFK40-074WnKERdMasz1_8QMHIkSNMDx1cAYoMBLzmxKRsIFXjFzTfYjxKeJopGzZpArs7aChoXKChR43bj9rZ1m-qYutNpbrHv5GMql1IJ3Lhbw-mSuOnPenUGfS8NYI3_Ko7Y/s320/IMG_2932.jpg" width="320" /></a></div><br />The Rosetta binary Remodel is a great tool as it allows interesting designs to be made. However, it is rather incompatible with Rosetta Scripts and Pyrosetta as it is heavily dependent on command line options for customisation and repeats some of the processes internally. Despite this, it can be cohersed rather effectively to work in Pyrosetta with some convenience and this is how.
<p></p><a name='more'></a><p></p>
<p>Remodel is a tool that modifies a model based on a blueprint file, allowing residues to be added, removed and whole swathes of secondary structure changed. Using it is a bit of an art, but it is very powerful. For smaller tasks, such as designing a better neighbourhood of sidechains for a ligand, FastRelax (with appropriate packers) will work file, but to design radical backbone changes Remodel is still the go to application.</p>
<p>The mover that is called by the binary is, unsurprisingly, <code>pyrosetta.rosetta.protocols.forge.remodel.RemodelMover()</code>. Like other movers, it has the bound method <code>.apply(pose)</code> that accepts a pose to work on.
It has three issues that need to be circumvented:</p>
<ul>
<li>command line options</li>
<li>writing a blueprint</li>
<li>PDB vs. pose numbering</li>
</ul>
<h3>Helping hand</h3>
<p>To make life easier, I have written a python class to make blueprint files, <code>Blueprinter</code>. It and a few more things can be found <a href="https://github.com/matteoferla/pyrosetta_help">here</a>.</p>
<pre><code>pip3 install pyrosetta-help
</code></pre>
<p>In Python3</p>
<pre><code>from pyrosetta_help import Blueprinter
</code></pre>
<p>The major advantage of this class is that is helps write the blueprint file, which normally is done by editing a file.
Namely, given a pose one can do the following:</p>
<pre><code>blue = Blueprinter.from_pose(pose)
blue[20:22] = 'PIKAA G' # change residues 20, 21 and 22 (inclusive) to Glycine
blue.wobble_span(20,25) # remodel keeping original amino acid
del blue[15:20] # requires preceeding and suceeding residues to be 'wobbled' though.
blue.del_span(15, 20) # same as above, but wobbles the preceeding and suceeding 1 residues
blue[22] = 'PIKAA W'
blue.pick_native(21)
blue.pick_native(23)
blue.mutate(22, 'W') # same as above, but wobbling adjecent residues.
blue.expand_loop_wobble()
blue.set('mutant.blu')
rm = blue.get_remodelmover(dr_cycles=5, max_linear_chainbreak=0.1)
rm.apply(pose)
blue.show_aligned(pose) # jupyter notebook (requires Biopython)
# if all goes wrong there is `blue.correct_and_relax(pose)`
</code></pre>
<p>I should say, that I call the operation of repacking the adjecent amino acids to an indel or mutation with the original ones "wobble", but this is my made up term.</p>
<h3>Command line options</h3>
<p>A regular remodel mover command is</p>
<pre><code>rm = pyrosetta.rosetta.protocols.forge.remodel.RemodelMover()
# ...
rm.register_options()
rm.apply(pose)
</code></pre>
<p>At the start of the <code>apply</code> call, various command line options are read, including the blueprint file, resulting in an instance of a <code>pyrosetta.rosetta.protocols.forge.remodel.RemodelData</code> that guides the design.
Unfortunately, whereas it is theoretically possible to make this object, there is no way to supply it to the mover and the <code>apply</code> method is a rather monolithic piece so rewriting with a few PyRosetta calls isn't an option. So one is stuck with command line options.
Likewise, all the classes in the <a href="https://graylab.jhu.edu/PyRosetta.documentation/pyrosetta.rosetta.protocols.forge.remodel.html">documentation</a> are called by apply, so as far as I know none are useful.
To set a command line option one calls:</p>
<pre><code>pyrosetta.rosetta.basic.options.set_boolean_option('remodel:design:find_neighbors', value)
pyrosetta.rosetta.basic.options.set_file_option('remodel:blueprint', filename)
pyrosetta.rosetta.basic.options.set_string_option('remodel:generic_aa', value)
</code></pre>
<p>Where the function is the specific one from the options module/namespace (<a href="https://graylab.jhu.edu/PyRosetta.documentation/pyrosetta.rosetta.basic.options.html">docs</a>) for the argument type and the first parameter is the name as seen in the <a href="https://new.rosettacommons.org/docs/latest/full-options-list#remodel">list of Rosetta commands</a>. It's both a getter (no second parameter) and a setter (the second parameter is the actual argument wanting to be passed). Do remember that if a option is set after instantiation, one must call the <code>.register_options()</code> method of the instance.</p>
<pre><code>rm = pyrosetta.rosetta.protocols.forge.remodel.RemodelMover()
pyrosetta.rosetta.basic.options.set_file_option('remodel:blueprint', filename)
rm.register_options()
</code></pre>
<p>This is a bit simpler with my <code>Blueprinter</code> class:</p>
<pre><code>blue = Blueprinter.from_pose(pose)
blue.blueprint = 'mutant.blu'
blue.find_neighbors = True
blue.generic_aa = 'G'
blue.quick_and_dirty = True
</code></pre>
<h3>PDB residue numbering</h3>
<p>In Rosetta there are two residue numberings. In the pose, each residue gets a sequential index, starting from 1 and without skipped residues. This is a totally sensible system as it makes working with the various arrays straight forward. RDKit does the same, albeit starting from 0. Additionally, there is the PDB residue info. This reflects missing residues, different chains, segments and —shudder— insertion codes. RDKit and Pyrosetta store this as a separate object, albeit differently (atom level and pose level).</p>
<pre><code>r = pose.pdb_info().pdb2pose(res=18, chain='A')
print(pose.residue(r).name3())
</code></pre>
<p>Generally, in PyRosetta, an integer is a pose index and a string is a PDB residue number+chain (<code>\d+\w</code>). In a few inputs, for example constraints, an isolated number is interpreted as a pose index and a number plus letter is a PDB index. This is without spaces, so different from <code>pose.pdb_info().pose2pdb(r)</code> output, which has white spaces for insertion codes and segment ids. Parenthetically, personally I like the NGL viewers selection algebra strings, where <code>12:A.NZ</code> would be atom <code>NZ</code> of residue <code>12</code> of chain <code>A</code> and <code>[LYS]12%C^B:A/0.NZ</code> would be insertion code <code>C</code> in segment <code>B</code> of model <code>0</code>, so I end up writing functions to interconvert from this glorius shorthand format.</p>
<p>Unfortunately, Remodel works slightly differently and really cannot handle PDB indices. For example, given a pose with the first residue as 10, in the blueprint file, to mutate Q15E, the following, in pose indicing, works:</p>
<pre><code>3 Y .
4 I H PIKAA I
5 Q H PIKAA E
6 W H PIKAA W
7 L .
</code></pre>
<p>while in PDB indicing</p>
<pre><code>13A Y .
14A I H PIKAA I
15A Q H PIKAA E
16A W H PIKAA W
17A L .
</code></pre>
<p>has the very curious effect of correctly applying Q15E, but also changes pose index 15 to the generic amino acid (valine below).</p>
<pre><code>NLYIEWLKDGGPCSGRPPPSZ
....*........***.....
NLYIQWLKDGGPCVVVPPPSZ
</code></pre>
<p>Actually, if the mutated residue did not have a pose number, the <code>ResfileReaderException</code> would be raised when the mutated line is tokenised.</p>
<pre><code>File: /Volumes/MacintoshHD3/benchmark/W.fujii.release/rosetta.Fujii.release/_commits_/main/source/src/core/pack/task/ResfileReader.cc:1577
On line 3, the pose does not have residue with chain=A, PDBnum=8.
</code></pre>
<p>This issue plays a part in N-terminus terminal deletions, where even if a pose numbering is provided that <code>ResfileReaderException</code> is raised. However, <code>RemodelMover</code> cannot do terminal deletions anyway (vide infra).</p>
<p>Consequently, blanking the PDB info is required. Confusingly, setting <code>.obsolute(True)</code> on the PDBInfo does not work (yet it is set after Remodel), so a new PDBInfo may need passing:</p>
<pre><code>pdb_mover = pyrosetta.rosetta.protocols.simple_moves.AddPDBInfoMover()
pdb_mover.apply(pose)
</code></pre>
<p>As a result, the <code>Blueprinter</code> uses pose numbering and <code>AddPDBInfoMover</code> is assumed to be called on the pose. One can play with pdb numbering thusly (not recommended):</p>
<pre><code>with open('mut.blu', 'w') as w:
w.write(blue.to_pdb_str(pose))
blue.blueprint = 'mut.blu
</code></pre>
<p>Once remodel is done, it sets the obsolete flag of the PDBInfo to True.
This means that if one does <code>pose.dump_pdb(filename)</code> the residues with be reset so that pose and PDB number match. However, if there was a third connection (a LINK record) it will have the obsoleted numbering —a bug. To undo <code>pose.pdb_info().obsolete(True)</code>, works if the pose was not reset before the remodel. Blueprinter has a wee function to help restore the original numbering.</p>
<pre><code>mutant = original_pose.clone()
pdb_mover = pyrosetta.rosetta.protocols.simple_moves.AddPDBInfoMover()
pdb_mover.apply(mutant)
remodel_mover = blue.get_remodelmover()
remodel_mover.apply(mutant)
blue.copy_pdb_info(original_pose, mutant)
</code></pre>
<h3>Other observations</h3>
<ul>
<li>Ligands and other chains are tolerated (ommitted from blueprint) —there is a <code>RemodelLigandHandler</code>, which I have not used, but I assume it might play a role if the ligand were part of the remodel.</li>
<li>Unlike the binary terminal deletions are not applied —a glitch I assume. However, these residues can be deleted with <code>pyrosetta.rosetta.protocols.grafting.delete_region(pose, start, stop)</code> or <code>pose.delete_residue_slow(1)</code></li>
<li>For an indel it makes sense to remodel to themselves the adjecent residues —else how would the gap be closed—, but with mutations this is a more confusing requirement. However, most often it is compulsory.</li>
<li>To mutate to a non-canonical amino acid, the blueprint file used to be able to contain the line <code>EMPTY NC XXX</code> where XXX is a three letter code. However, this has recently changed and is no longer usable (see <a href="https://www.rosettacommons.org/node/11134">this forum question</a>).</li>
<li><code>NATAA</code> is meant to chose the natural amino acid. However, this results in the generic amino acid to be chosen. <code>PIKAA X</code> with itself is the way forward. For the Blueprinter, the <code>.pick_native(res)</code> command does this.</li>
<li>The generic amino acid is valine. This is great for structured elements. However, glycine is a better choice for loops.</li>
<li>If remodel failed to converge you get a structure with the generic amino acids and potentially a gap that needs closing (see post about <a href="http://blog.matteoferla.com/2020/07/filling-missing-loops-proper-way.html">loop fixing</a>). The blueprinter method <code>.correct_and_relax(pose)</code> circumvents it in a pinch, but a correct blueprint is a better solution.</li>
<li>Always run a positive control for a remodel (say modelling an indel with a wobbled span, run a few models with the same wobble span)</li>
<li>The angles and bond lengths may not be great, so it may be wise to minimise with the cartesian scorefunction.</li>
</ul>
<p>Namely:</p>
<pre><code>scorefxn = pyrosetta.create_score_function('ref2015_cart')
cycles = 15
relax = pyrosetta.rosetta.protocols.relax.FastRelax(scorefxn, cycles)
relax.cartesian(True)
relax.minimize_bond_angles(True)
relax.minimize_bond_lengths(True)
relax.apply(pose)
</code></pre>Matteo Ferlahttp://www.blogger.com/profile/04090452288769979595noreply@blogger.com0