Monday 28 September 2015

Pythonic spinner

Python is fun: it has lovely libraries, is a beauty to type and there are constant surprises — I only recently found out that 3.4 had introduced defaultdict() (collections library), which is phenomenal. With the web there are three options:

  • It can be used on the server-side on the web with the wsgi library or the Danjo framework. Open-shift is a great fremium script hosting service —I have used it here for example.
  • There are also some attempts to make JS parse python in the browser, namely Skupt and Brython, but as they convert the python code into JS code, so they are not amazingly fast (1), but you are showing the world python code.
  • One can transpile python to javascript, which is faster and less buggy, but that is unethical as you'd be serving JS and not a python script —CoffeeScript gets a lot of bad rep for that reason.

Thanks to CSS3 there are a lot of cool spinners out there to mark code that is loading, but none cater for python users. Therefore I made my own spinner icon, specifically: .
The code is hosted in my dropbox:
<link href="" rel="stylesheet"></link>
<span class=pyspinner></span>

(1) I tried Brython and liked that it had a mighty comprehensive series of libraries and that it had DOM interactions similar to JQuery. However, I could not get over the fact that, for me at least, changes to DOM elements were not committed until the code finished or crashed —which brought back bad Perl memories. Also the lack of CSS changes and the nightmare of binding functions to events makes me think I might try other options.

Publication-driven complexification

"Any sufficiently advanced technology is indistinguishable from magic."
—A. Clarke's Third Law
I like knowing what I am doing, it helps me figure out if something is wrong. However, there is a trend for accepted operations to become "mathemagical" either out of necessity... or out of publication race.  

RNASeq is often cheered for being absolute reads and thus requiring less normalisation than microarrays. Anyone that has tried to understand Loess or Lowess takes that as a blessing.
However, the situation is not that simple as different samples have different total amounts of DNA and, due to publications trying to out do the other it quickly gets complicated, where acceptant nodding is the best option lest one want to spend an evening reading a method's section and the appendix on a paper.
An arithmetic average is discouraged as the highly expressed genes will wreak havoc with other genes. On those lines, Mortazavi et al. 2008 (PMID: 18516045) introduced a per mille CDS-length normalised scale, which was called  RPKM —the letters don't mean much and feel like a weird backronym, which I don't get, although had I been them I would have made people have to use the ‰ glyph. The underlying logic is clear and for a while it was popular. RPKM got ousted by Anders and Huber 2010 (PMID:20979621) and Robinson et al. 2010 (PMID:19910308) with DESeq and edgeR, which do many calculations. The normalisation in DESeq is done by taking the median for each sample of each of the ratios of the gene count for that samples over the geometric mean of the values of that genes across the samples. That makes sense, albeit cumbersome to grasp. For the test of significance, the improvement on a negative binomial GLM spans a page or two and works by magic, or more correctly mathemagic, maths that is so advanced it might as well be magic —Google shows the word is taken by some weird mail order learn maths course or somesuch, but shhh! The sequel to DESeq, DESeq2 does everything automatically and in 3 lines everything is done. It is really good although I do like to know what I am doing.
Long story aside, I did in parallel the analysis with DESeq methods in MatLab, using their agonising tutorial, which was tedious, but good to see how it everything works  —to quote Richard Feynman: "what I cannot create I cannot understand". It wants a true waste of time as it made me realise what graphs would be helpful as the various protocols and co. try and out do each in terms of graph complexity. It really bugged me that an obvious graph, double-log plot of each replicate against its counterpart with a Pearson's ρ thrown in, was omitted everywhere in favour of graphs, such as the empirical cumulative function of the distribution of the variance against the χ-squared of the variance estimates —a great way of inspecting if there are any oddities in a distribution, but really not the simplest graph. It is used because it is sophisticated, which doesn't necessarily mean better (incorrect assumptions can go a long way in distorting data), but it means more publishable.
Curiously, once the alignment, normalisation and significance steps are done, it is cowboy territory especially for bacteria. For example, for bacterial operon composition there is either a program, Rockhopper, which does not like my genome (I had to submit for realignment each replicon on different runs to get operons for the plasmids and refuses to do the chromosome) and another paper where the supplementary zip file was labelled .docx and the docx was labelled zip, which goes to indicate the scarce interest.
The main use of differential expression is functional enrichment, which depends on annotations that are rough as nails...

I am in awe at the mathematics involved in the calculation, but I cannot help, but feel annoyed at the fact that they are embarrassingly more sophisticated than they ought to be, especially since the complexity gives somewhat marginal improvements and the created dataset will be badly mishandled with clustering based on wildy guessed functions.

Saturday 12 September 2015

Unnatural amino acid biosynthesis

In the synthetic biology experiments with an expanded genetic code the biosynthesis of unnatural amino acid is not taken into consideration as the system is rather rickety and the amino acids unusual.

Similar amino acids

A lot of introduced amino acids are dramatically different from the standard set. However, there are several amino acids that never made it to the final version of the genetic code as they are subtly different from the canonical amino acids and must have been too hard for the high promiscuous primordial systems to differentiate. However, subtle differences would be useful for finetuning. Examples include aminobutyric acid (homoalanine), norvaline and norleucine, allo-threonine ("allonine"), ornithine or aminoadipate. If there was a way to introduce them and increase the fidelity it would be hugely beneficial. It would probably result in enzymes with higher fidelity and catalytic efficiency.
That Nature itself failed back then does not necessary mean that scientists would fail with modern metabolism. The main drawback is a selection system. Current approached to recoding rely on the new amino acid as fill in as opposed to something that makes the E. coli addicted to it. The latter would mean that the system could evolve to better handle the new amino acids. Phage with an unnatural amino acid have higher fitness (Hammerling et al., 2014). Unnatural RNA display (Josephson et al., 2005) could be used to generate a protein that is evolved to require the non-canonical amino acid as nearby residues are evolved to best suit that enzyme, if one really wanted to all that trouble. Alternatively and less reliably, GFP with different residues does behave differently and position 65 could handle stuff like homoalanine, but the properties between S65A or S65V are not too different. So if a good and simple selection method were present it could be doable.

Novel amino acid biosynthesis

This leaves with the biosynthesis of the novel amino acids, which is the main focus here. There are many possible amino acids to choose from, and a good source of information for that is the wikipedia article non-proteinogenenic amino acids  — I (reticently) wrote many years ago to sort out the mess that there was, but I subsequently left to the elements and it has become a bit cluttered like an unselected psuedogene. Why certain amino acids made it while other did not is discussed in a great paper by Weber and Miller in 1981.
Most of the amino acids that nature can make would just make structural variants. Furthermore, mechanistic diversity mostly comes from cofactors (metals, PLP, biotin, thiamine, MoCo, FeS clusters etc.), which is a more sensible solution given that there only one per certain type of enzyme. Some amino acids that are supplemented Nature cannot make with ease, in particular chlorination and fluorination reactions in Nature can be counted with one hand.

Homoserine, homocysteine, ornithine and aminoadipate.

E. coli already makes homoserine for methionine and threonine biosynthesis and homocysteine via homoserine for methionine. Ornithine is from arginine biosynthesis and was kicked out of the genetic code by it. Gram positive bacteria, which do not require diaminopimelate, make lysine via aminoadipate ("homoglutamate").


The simplest novel amino acid is homoalanine (aminobutyrate). 2-ketobutyrate is produced during isoleucine biosynthesis, which if it were transaminated it would produce homoalanine. Therefore it is likely that the branched chain transaminase probably must go to some effort to not produce homoalanine (forbidden reaction). This amino acid is found in meteorites and is really simple, but its similarity to alanine, hence why it must have lost out.

Norvaline, norleucine and homonorleucine.

In the Weber and Miller paper the presence of branched chain amino acids was mentioned as potentially a result of the frozen accident. From a biochemical point of view, the synthesis of valine and isoleucine from pyruvate + pyruvate and ketobutyrate + pyruvate follows a simple pattern (decarboxylative aldol condensation, reduction, dehydration, transamination). The leucine branch is slightly different and is actually a duplication of some of the TCA cycle enzymes (Jensen 1976), specifcially it condenses ketoisovalerate (valine sans amine) and acetyl-CoA, dehydrates, rehydrates, reduces, decarboxylates and transaminates. If the latter route is used with ketobutyrate and acetyl-CoA one would get norvaline, with ketopentanoate (norvaline sans amine) and acetyl-CoA one would get norleucine. These biosynthetic pathway have actually been studied in the 80s. The interesting thing is that it can be used further making homonorleucine. Straight chain amino acids make sturdier protein, so there is a definite benefit there. The reason why norleucine is not in the genetic code is that it was evicted by methionine, which finds an additional use in SAM cofactor (Ferla and Patrick, 2014). Parenthetically, as a result the AUA isoleucine codon is unusal —it is also a rare codon (0.4%) in E. coli— as to avoid methionine has an unusual tRNA (ileX), which would make an easy target for recoding.


Threonine synthase is the sole determinant of the chirality of threonine's second centre.


Chorismate is rearranged to prephenate and then oxidatively decarboxylated and transaminated to make tyrosine, while the hydroxyl group of chorisate is swapped for amine for folate biosynthesis. If the product of the latter followed the tyrosine pathway one would aminophenylalanine.

Rethinking phenylalanine.

While on the topic, the whole route for phenylalanine biosynthesis is odd. It feels like an evolutionary remnant. If one were to draw up phenylalanine biosynthesis without knowing about the shikimate/chorismate pathway the solution would be different. If I were to design the phenylalanine pathway I would start with a tetraketide (terminal acetyl-CoA derived; 2,4,6,8-tetraoxononanoyl-CoA), cyclise (Aldol addition of C9 in enol form to C4 ketone), two rounds of reduction (6-oxo and 8-oxo) and three dehydrations (4,6,8-hydroxyl), followed by a transamination (2-oxo): phenylalanine by polyketide synthesis!


Phenylalanine has a single aromatic ring, naphthylalanine has two. Using the logic for the rethought phenylalanine synthesis we get a synthesis by hexaketide (2,4,6,8,10,12-hexaoxotridecanoyl-CoA). Namely cyclise (Aldol addition of C11 in enol form to C6 ketone and C13 to C4), two rounds of reduction (8-oxo and 10 or 12-oxo) and five dehydrations (4,6,8,10,12-hydroxyl), followed by a transamination (2-oxo). Naphthylalanine has been added to the genetic code (Wang et al., 2002), but was added as a supplement. GFP doesn’t work with either form of naphthylalanine (Kajihara et al., 2005), therefore there isn’t a good selection marker where the amino acid itself is beneficial. So probably the worst sketched pathway to possibly make.

Other aromatic compounds.

Secondary metabolites are often made by polyketide or isoprenoid biosynthesis, which are rather flexible so some extra compounds could be drawn up.

LEGO style.

All amino acid backbones are make in different ways and only cysteine, selenocysteine, homocysteine and tryptophan operate by a join-side-chain-on-with-backbone approach. Tryptophan synthase does this trick by aromatic electophilic substitution, where the electrophile is phosphopyridoxyl-dehydroalanine (serine on PLP after hydroxyl has left), while the sulfur/seleno amino acids are by the similar Micheal addition. So this trick could be extended to other aromatic, carbanions and enols, if one really wanted to, but that if far from a nice one size fits all approach.