Wednesday, 4 September 2019

PDB numbering rollercoaster

The position in a crystal structure and the protein sequence rarely match. In fact, there are four parts of start-end:
  • position in whole protein,
  • position in extracted sequence,
  • position in residues stated in the PDB/mmCIF structure and 
  • position which actually has coordinates.

Thursday, 8 August 2019

Jupyter notebook progressbar

I have this rather handy wee piece of code I'd like to share: a Jupyter notebook Progress bar.

Saturday, 3 August 2019

When will the PDB run out of 4-letter codes?

The PDB ids are really nice and short: 4 letter codes. But when will all the combinations run out? Actually, not for a long long time.
The current total is 155,618 structures and new ones are added at a rate of 12000 structures per year, which means that, assuming a constant growth, in 125 years —(36 ^ 4 - 155,618 ) /  12,000 —the PDB will finish codes to allocate.
2145. That is a few years after the setting of Kim Robinson's New York 2140, where New York is a flooded super-Venice, so I am guessing the RCSB PDB, in San Diego, will have long been flooded so lack of 4-letter codes is not top of their concerns.

Tuesday, 2 July 2019

Wikipedia datamining

There are several online sites that can be data-mined to reveal really nice trends, top-10s and topdown summaries. Twitter is the archetype site for this, thanks to hashtags making an easy job for anyone wanting to investigate trends. I prefer Reddit for datamining specific trends as it powered by folk having arguments on topics they are passionate about as opposed to ideas of celebrities, corporate spokespeople and ФСБ agents. eBay is also fun as it reveals what people are willing to pay for things. But the best source of data, even for other datasets, is Wikipedia. Not only to read up on things, but also to get data for things within a given "category".

Friday, 28 June 2019

Exporting Jupyter notebooks with Plotly graphs

If it is a small project or analysis, I opt for Jupyter notebook rather than an IDE such as PyCharm, which is great for large projects, but not such much for a small analyse as go project. Plotly is my goto for graphs —I proselytise about it. The advantage is that it is a wrapper for a JS library which allows interactive. However, in my system at least, using the plotly.offline.iplot plotter, when I export it as a HTML an error is thrown due to require not being set up correction. This is easily fixed.

Friday, 31 May 2019

A note on the PyMOL1.8 C01 atom oddity

This weird bug has been haunting me for ages. The PyMOl 1.8 (not 2) builder creates residues with a Cα called C01 as opposed to CA. If any operation is done to these (e.g. Rosetta Relax), they will be discarded during the reading of the file. That is, they will not be fixed and worse if Rosetta Remodel is used, it will assume that the residue never existed, because Remodel does not understand PDB numbering annoyingly. Simply substituting all 'C01' to 'CA' fixes the problem.

Thursday, 16 May 2019

The secondary metabolism of pineberry strawberries

For an upcoming open-day we will extract DNA from strawberries. For this I made a slide that explains how DNA mutations lead to protein variants, than in turn lead to different phenotypes (redness in the strawberry's case). In doing this, I got fascinated by a strawberry cultivar called "Pineberry". But not because it is unpigmented, but because the reviews online say it is bland, which means that a rather early enzyme is missing resulting in a unpigmented phenotype and a bland phenotype.