The art of blowing up protein

Wednesday, 28 October 2015

Pseudomonas, Ralstonia and Burkholderia: HGT buddies

Recently I was reminded of an interesting thing that intrigued me: Ralstonia and Burkholderia were formerly classed as Pseudomonas spp. and now are from different proteobacterial classes, but they share several close genes. One hypothesis that can be made is that a series of horizontal gene transfer events occurred at some point and the phenotypes of the three species became so close that they were mistakenly grouped together.

I came across the quirk while doing research for a paper from Dr. Monica Gerth and Prof. Paul Rainey:

Gerth ML, Ferla MP, Rainey PB. The origin and ecological significance of multiple branches for histidine utilization in Pseudomonas aeruginosa PAO1. Environ Microbiol. 2012 Aug;14(8):1929-40. doi: 10.1111/j.1462-2920.2011.02691.x. Epub 2012 Jan 9. PubMed PMID: 22225844.

The Ralstonia and Burkholderia are in different families within the Burkholderiales (Betaproteobacteria), while the Pseudomonas is in the Gammaproteobacteria. The taxonomic genus Burkholderia was coined when 7 Pseudomonas spp. were moved in 1993, while Ralstonia was created when two Burkholderia spp. were moved in 1996. The two are not sister genera.
In addition to the hut operon these three species are phylogenetically close to each other for many protein.

Using the Darkhorse server set to genus level detail, the top hits for Pseudomonas aeruginosa are

Azotobacter vinelandii, a bona fide pseudomonad, but given its own genus for silly reasons —okay, technically it is the Pseudomonas genus should be split.
Ralstonia metallidurans
Bermanella marisrubri
Chromohalobacter salexigens
Burkholderia xenovorans
Burkholderia thailandensis

With Pseudomonas fluorescens B. xenovorans is in third place. With Pseudomonas putida it's in second place, while R. metallidurans in fourth. While Ralstonia and Burkholderia pick each other up. The results for Ralstonia eutropha indicate that 2/3 of its large genome are from Cupriavidus taiwanensis (sister species of Burkholderia), which in reality means that there is some phylogenetic issue afoot.
Nevertheless, it does not explain the pseudomonad link, which is probably because the ancestors of Pseudomonas and of Burkholderia/Ralstonia got to know each other well and as a result today we have:

Burkholderia spp. have two chromosomes and lots of plasmids, R. eutropha has over six and half thousand genes and pseudomonads are gene collectors too.
They all have weird relationship with their sister genera or families.
The seem to have similar lifestyles.
The share lots of genes
They were mistaken as pseudomonads morphologically.

This is just a mix of speculation and quick checks, but there is nevertheless a link. Which is not only curiosity, it finds data in mistakes from the past and potentially tells of dangers that may assail genome-concatenation studies.

Wednesday, 14 October 2015

A note about serving static files with Python's wsgi

The best way to learn how to use python for the web is to make something simple in Flask, revisit the tutorial on how to set stuff up with more than a single file and then graduate to a framework like Pyramid or Django.
Starting vanilla and using wsgi is a different matter as one gets bogged down in stupid things and is not actually didactic. One issue I had was serving static files.

KEGG map colo(u)rs

KEGG has a handy map colouring feature, where one can feed it a list of KEGG IDs from whatever database of theirs and a colour, which it will use to colour the image of the pathway. KOBAS and other severs offer similar features, but this is the original.
Two things are frustrating. One is the classic strangeness of the maps that seem to have everything and more, except when you want a certain reaction, which inexplicably is not annotated on that map. The other are the colours, which are annoying but strangely addictive.
Basically, the colours can be RGB codes or names. The former scheme is fine, except the latter is more human readable and trying to figure out what is okay or not is rather addictive. Colours, or more correctly colors as the web is written in US English, can be described in many ways.
The code system ("hex triplet") is formed of three consecutive hexadecimal double-digit numerals prefixed with a hashtag (e.g. #c3a0d2), where each pair corresponds to Red, Green or Blue (RGB). KEGG accepts these fine.
In HTML and CSS colors have proper names —the most official source of the exact names of colours is Pantone, but this comes close. KEGG implements some single word ones, e.g. gray —not grey: an HTML mistake I always make—, red, blue, green, yellow, orange and gainsboro —a handy colour for padded div boxes, which seems like a mispelt Gainsborough—. Puce is not an HTML colour, so I am not surprised it does not like it, however, aqua is but it does not like it. It also accepts some of the two word names, like GreenYellow, but not others, like DarkGray. Funnily the capitalisation seems to matter: lower-case only for single word names, while CamelCase for the compound ones —HTML and CSS are case insensitive. "Brewer colors" can be excluded as dgray or lgray don't seem to work either. This all seems to point to the fact that some poor chap had to write the dictionary server side to covert the names to hex=triplets
One curious thing is that several names collapse into one colour. lavender is the same as gainsboro and lightgray.
All in all the hex-triplets and the smattering of names fulfil all needs, but the strange quirkiness is fun to unpick...
Here is a wee series of Reaction identifiers to colour down the glycolysis pathway if you want to try your luck at colour picking:


R00959 gray
R02740 purple
R02738 GreenYellow
R09084 orange
R04779 green
R01070 red
R01061 gray
R01512 gainsboro
R01518 #ddeedd

Tuesday, 6 October 2015

PrettyFastaJS

I wrote a small script to allow one to easily and prettily embed both nucleic acid and protein FASTA sequences on a webpage (it guesses based on EFILPQ which are not degerate bases, but residues). It was a mix of "I need this" and "I want to play with colours". The latter was rather painful. But the result is a JS modified and CSS coloured fasta files.
The files can be found in my GitHub repository for PrettyFastaJS and so are the explanations of how to use it. As a demo, here are two curious sequences. Two consecutive genes in the same operon that encode each a full length glycine dehydrogenases, which in other species is a homodimer, which suggests this may be a cool heterodimer.

>Gthg02641 Glycine dehydrogenase (EC 1.4.4.2) [Geobacillus thermoglucosidasius TM242] MLKNWGIAMHKDQPLIFELSKPGRIGYSLPELDVPAVRVEEVVPADYIRTEEPELPEVSELDIMRHYTALSKRNHGVDSGFYPLGSCTMKYNPKINENVARLAGFAHIHPLQPEETVQGALELMYDLQEHLKEITGMDAVTLQPAAGAHGEWTGLMMIRAYHEANGDFQRTKVIVPDSAHGTNPASATVAGFETITVKSTEDGLVDLEDLKRVVGPDTAALMLTNPNTLGLFEENILEMAKIVHDAGGKLYYDGANLNAVLSKARPGDMGFDVVHLNLHKTFTGPHGGGGPGSGPVGVKADLIPFLPKPVVEKGENGYYLDDDRPQSIGRVKPFYGNFAINVRAYTYIRSMGPEGLKAVAEYAVLNANYMMRRLAEYYDLPYDRHCKHEFVLSGKRQKKLGVRTLDIAKRLLDFGFHPPTVYFPLIVEECMMIEPTETESKETLDAFIDAMIQIAKEAEENPEIVQEAPHTTVVKRLDETKAARKPILRYQKQ >Gthg02642 Glycine dehydrogenase (EC 1.4.4.2) [Geobacillus thermoglucosidasius TM242] MLHRYLPMTEEDKREMLNVIGVDSIDDLFADIPESVRFRGELNIKKAKSEVELWKELSALAAKNADVKKYTSFLGAGVYDHYIPAIVDHVISRSEFYTAYTPYQPEISQGELQAIFEFQTMICELTGMDVANSSMYDGGTALAEAMLLSAAHTKKKKVLLAKTVHPEYRGVVKTYAKGQRLHVVEIPFQHGVTDLEALKAEMDDEVACVIVQYPNFFGQIEPLKDIEPVAHSCKSMFVVASNPLALGILAPPGKFGADIVVGDVQPFGIPMQFGGPHCGYFAVKAELMRKIPGRLVGQTTDEEGRRGFVLTLQAREQHIRRDKATSNICSNQALNALAASVAMTALGKKGIKEMATMNIQKAQYAKNELVKHGFAVPFTGPFFNEFVVCLAKPVAEVNKQLLRKGIIGGYDVGRDYPELQNHMLIAVTELRTKEEIDMFVKELGDCHA

Monday, 28 September 2015

Pythonic spinner

Python is fun: it has lovely libraries, is a beauty to type and there are constant surprises — I only recently found out that 3.4 had introduced defaultdict() (collections library), which is phenomenal. With the web there are three options:

It can be used on the server-side on the web with the wsgi library or the Danjo framework. Open-shift is a great fremium script hosting service —I have used it here for example.
There are also some attempts to make JS parse python in the browser, namely Skupt and Brython, but as they convert the python code into JS code, so they are not amazingly fast (1), but you are showing the world python code.
One can transpile python to javascript, which is faster and less buggy, but that is unethical as you'd be serving JS and not a python script —CoffeeScript gets a lot of bad rep for that reason.

Thanks to CSS3 there are a lot of cool spinners out there to mark code that is loading, but none cater for python users. Therefore I made my own spinner icon, specifically: .
The code is hosted in my dropbox: https://github.com/matteoferla/Pyspinner.

(1) I tried Brython and liked that it had a mighty comprehensive series of libraries and that it had DOM interactions similar to JQuery. However, I could not get over the fact that, for me at least, changes to DOM elements were not committed until the code finished or crashed —which brought back bad Perl memories. Also the lack of CSS changes and the nightmare of binding functions to events makes me think I might try other options.

Publication-driven complexification

"Any sufficiently advanced technology is indistinguishable from magic."

—A. Clarke's Third Law

I like knowing what I am doing, it helps me figure out if something is wrong. However, there is a trend for accepted operations to become "mathemagical" either out of necessity... or out of publication race.

RNASeq is often cheered for being absolute reads and thus requiring less normalisation than microarrays. Anyone that has tried to understand Loess or Lowess takes that as a blessing.
However, the situation is not that simple as different samples have different total amounts of DNA and, due to publications trying to out do the other it quickly gets complicated, where acceptant nodding is the best option lest one want to spend an evening reading a method's section and the appendix on a paper.
An arithmetic average is discouraged as the highly expressed genes will wreak havoc with other genes. On those lines, Mortazavi et al. 2008 (PMID: 18516045) introduced a per mille CDS-length normalised scale, which was called RPKM —the letters don't mean much and feel like a weird backronym, which I don't get, although had I been them I would have made people have to use the ‰ glyph. The underlying logic is clear and for a while it was popular. RPKM got ousted by Anders and Huber 2010 (PMID:20979621) and Robinson et al. 2010 (PMID:19910308) with DESeq and edgeR, which do many calculations. The normalisation in DESeq is done by taking the median for each sample of each of the ratios of the gene count for that samples over the geometric mean of the values of that genes across the samples. That makes sense, albeit cumbersome to grasp. For the test of significance, the improvement on a negative binomial GLM spans a page or two and works by magic, or more correctly mathemagic, maths that is so advanced it might as well be magic —Google shows the word is taken by some weird mail order learn maths course or somesuch, but shhh! The sequel to DESeq, DESeq2 does everything automatically and in 3 lines everything is done. It is really good although I do like to know what I am doing.
Long story aside, I did in parallel the analysis with DESeq methods in MatLab, using their agonising tutorial, which was tedious, but good to see how it everything works —to quote Richard Feynman: "what I cannot create I cannot understand". It wants a true waste of time as it made me realise what graphs would be helpful as the various protocols and co. try and out do each in terms of graph complexity. It really bugged me that an obvious graph, double-log plot of each replicate against its counterpart with a Pearson's ρ thrown in, was omitted everywhere in favour of graphs, such as the empirical cumulative function of the distribution of the variance against the χ-squared of the variance estimates —a great way of inspecting if there are any oddities in a distribution, but really not the simplest graph. It is used because it is sophisticated, which doesn't necessarily mean better (incorrect assumptions can go a long way in distorting data), but it means more publishable.
Curiously, once the alignment, normalisation and significance steps are done, it is cowboy territory especially for bacteria. For example, for bacterial operon composition there is either a program, Rockhopper, which does not like my genome (I had to submit for realignment each replicon on different runs to get operons for the plasmids and refuses to do the chromosome) and another paper where the supplementary zip file was labelled .docx and the docx was labelled zip, which goes to indicate the scarce interest.
The main use of differential expression is functional enrichment, which depends on annotations that are rough as nails...

I am in awe at the mathematics involved in the calculation, but I cannot help, but feel annoyed at the fact that they are embarrassingly more sophisticated than they ought to be, especially since the complexity gives somewhat marginal improvements and the created dataset will be badly mishandled with clustering based on wildy guessed functions.

Saturday, 12 September 2015

Unnatural amino acid biosynthesis

In the synthetic biology experiments with an expanded genetic code the biosynthesis of unnatural amino acid is not taken into consideration as the system is rather rickety and the amino acids unusual.

Similar amino acids

A lot of introduced amino acids are dramatically different from the standard set. However, there are several amino acids that never made it to the final version of the genetic code as they are subtly different from the canonical amino acids and must have been too hard for the high promiscuous primordial systems to differentiate. However, subtle differences would be useful for finetuning. Examples include aminobutyric acid (homoalanine), norvaline and norleucine, allo-threonine ("allonine"), ornithine or aminoadipate. If there was a way to introduce them and increase the fidelity it would be hugely beneficial. It would probably result in enzymes with higher fidelity and catalytic efficiency.
That Nature itself failed back then does not necessary mean that scientists would fail with modern metabolism. The main drawback is a selection system. Current approached to recoding rely on the new amino acid as fill in as opposed to something that makes the E. coli addicted to it. The latter would mean that the system could evolve to better handle the new amino acids. Phage with an unnatural amino acid have higher fitness (Hammerling et al., 2014). Unnatural RNA display (Josephson et al., 2005) could be used to generate a protein that is evolved to require the non-canonical amino acid as nearby residues are evolved to best suit that enzyme, if one really wanted to all that trouble. Alternatively and less reliably, GFP with different residues does behave differently and position 65 could handle stuff like homoalanine, but the properties between S65A or S65V are not too different. So if a good and simple selection method were present it could be doable.

Novel amino acid biosynthesis

This leaves with the biosynthesis of the novel amino acids, which is the main focus here. There are many possible amino acids to choose from, and a good source of information for that is the wikipedia article non-proteinogenenic amino acids — I (reticently) wrote many years ago to sort out the mess that there was, but I subsequently left to the elements and it has become a bit cluttered like an unselected psuedogene. Why certain amino acids made it while other did not is discussed in a great paper by Weber and Miller in 1981.
Most of the amino acids that nature can make would just make structural variants. Furthermore, mechanistic diversity mostly comes from cofactors (metals, PLP, biotin, thiamine, MoCo, FeS clusters etc.), which is a more sensible solution given that there only one per certain type of enzyme. Some amino acids that are supplemented Nature cannot make with ease, in particular chlorination and fluorination reactions in Nature can be counted with one hand.

Homoserine, homocysteine, ornithine and aminoadipate.

E. coli already makes homoserine for methionine and threonine biosynthesis and homocysteine via homoserine for methionine. Ornithine is from arginine biosynthesis and was kicked out of the genetic code by it. Gram positive bacteria, which do not require diaminopimelate, make lysine via aminoadipate ("homoglutamate").

Homoalanine.

The simplest novel amino acid is homoalanine (aminobutyrate). 2-ketobutyrate is produced during isoleucine biosynthesis, which if it were transaminated it would produce homoalanine. Therefore it is likely that the branched chain transaminase probably must go to some effort to not produce homoalanine (forbidden reaction). This amino acid is found in meteorites and is really simple, but its similarity to alanine, hence why it must have lost out.

Norvaline, norleucine and homonorleucine.

In the Weber and Miller paper the presence of branched chain amino acids was mentioned as potentially a result of the frozen accident. From a biochemical point of view, the synthesis of valine and isoleucine from pyruvate + pyruvate and ketobutyrate + pyruvate follows a simple pattern (decarboxylative aldol condensation, reduction, dehydration, transamination). The leucine branch is slightly different and is actually a duplication of some of the TCA cycle enzymes (Jensen 1976), specifcially it condenses ketoisovalerate (valine sans amine) and acetyl-CoA, dehydrates, rehydrates, reduces, decarboxylates and transaminates. If the latter route is used with ketobutyrate and acetyl-CoA one would get norvaline, with ketopentanoate (norvaline sans amine) and acetyl-CoA one would get norleucine. These biosynthetic pathway have actually been studied in the 80s. The interesting thing is that it can be used further making homonorleucine. Straight chain amino acids make sturdier protein, so there is a definite benefit there. The reason why norleucine is not in the genetic code is that it was evicted by methionine, which finds an additional use in SAM cofactor (Ferla and Patrick, 2014). Parenthetically, as a result the AUA isoleucine codon is unusal —it is also a rare codon (0.4%) in E. coli— as to avoid methionine has an unusual tRNA (ileX), which would make an easy target for recoding.

Allonine.

Threonine synthase is the sole determinant of the chirality of threonine's second centre.

Aminophenylanine.

Chorismate is rearranged to prephenate and then oxidatively decarboxylated and transaminated to make tyrosine, while the hydroxyl group of chorisate is swapped for amine for folate biosynthesis. If the product of the latter followed the tyrosine pathway one would aminophenylalanine.

Rethinking phenylalanine.

While on the topic, the whole route for phenylalanine biosynthesis is odd. It feels like an evolutionary remnant. If one were to draw up phenylalanine biosynthesis without knowing about the shikimate/chorismate pathway the solution would be different. If I were to design the phenylalanine pathway I would start with a tetraketide (terminal acetyl-CoA derived; 2,4,6,8-tetraoxononanoyl-CoA), cyclise (Aldol addition of C9 in enol form to C4 ketone), two rounds of reduction (6-oxo and 8-oxo) and three dehydrations (4,6,8-hydroxyl), followed by a transamination (2-oxo): phenylalanine by polyketide synthesis!

Naphthylalanine.

Phenylalanine has a single aromatic ring, naphthylalanine has two. Using the logic for the rethought phenylalanine synthesis we get a synthesis by hexaketide (2,4,6,8,10,12-hexaoxotridecanoyl-CoA). Namely cyclise (Aldol addition of C11 in enol form to C6 ketone and C13 to C4), two rounds of reduction (8-oxo and 10 or 12-oxo) and five dehydrations (4,6,8,10,12-hydroxyl), followed by a transamination (2-oxo). Naphthylalanine has been added to the genetic code (Wang et al., 2002), but was added as a supplement. GFP doesn’t work with either form of naphthylalanine (Kajihara et al., 2005), therefore there isn’t a good selection marker where the amino acid itself is beneficial. So probably the worst sketched pathway to possibly make.

Other aromatic compounds.

Secondary metabolites are often made by polyketide or isoprenoid biosynthesis, which are rather flexible so some extra compounds could be drawn up.

LEGO style.

All amino acid backbones are make in different ways and only cysteine, selenocysteine, homocysteine and tryptophan operate by a join-side-chain-on-with-backbone approach. Tryptophan synthase does this trick by aromatic electophilic substitution, where the electrophile is phosphopyridoxyl-dehydroalanine (serine on PLP after hydroxyl has left), while the sulfur/seleno amino acids are by the similar Micheal addition. So this trick could be extended to other aromatic, carbanions and enols, if one really wanted to, but that if far from a nice one size fits all approach.