Saturday, 5 December 2015

The future of enzymology?

EDIT: I called it! Turns out this was much closer to reality than I thought and a paper came out doing exactly this.
Assaying enzymes is rather laborious and, even though the data quality is much higher it does not compete in productivity with the other fields of biochemistry and genetics. So I gave some thought into where I believe enzymology will be in the future and I have come to the conclusion that in vitro assays will be for the most part seriously replaced by in vivo omics methods, but not any time soon as both proteomics and metabolomics need to come along way, along with systems biology modelling algorithms.

Uncompetitively laborious

Everyone that has assayed enzymes will tell you that a single table in a paper took years and years of assays. They will tell you horror stories that the enzyme did not express solubly, the substrate took months to make, the detection required a list of coupled enzymes or that the activity was so low that everything had to be meticulously calibrated and assayed individually. Personally, I had to assay  Thermotoga maritima MetC at 37°C due to the fact that for one reaction the indicator decomposed at 50°C, while for another activity the mesophilic coupled enzymes would otherwise denature. All while comparing it to homologues from Wolbachia and Pelagibacter ubique, which had to be monitored by kinetic assay   —as they melted if you looked at them— and individually as the Wolbachia MetC had a turnover of 0.01 s-1 (vestigial activity; cf. my thesis). And I was lucky as I did not have substrates that were unobtainable, unstable and bore acronyms as names.
The data from enzyme assays is really handy, but the question is how will fair after 2020?
The D&D 3.5 expression "linear fighter, quadratic wizard", which encapsulate the problem that with level progression wizards left fighters behind, seems rather apt as systems biology and synthetic biology seem to be just steaming ahead (quadratically) leaving enzymology behind.


Crystallography is another biochemical discipline that requires sweat and blood. But with automation, new technologies and a change of focus (top down), it is keeping up with omics world.
Enzymology I feel isn't. There is no such field as enzymonics  —only a company that sells enzymes, Google informs me.
A genome-wide high-throughput protein expression and then crystallographic screen may work for crystallography, but it would not work for enzymology as each enzyme has its own substrates and the product detection would be a nightmare.
This leads me to a brief parenthesis: the curious case of Biolog plates in microbiology. They are really handy as they are a panel of 96-well plates with different substrates and toxins. These phenotype "micro"arrays are terribly underutilised, because each plate inexplicably costs $50-100. Assuming that someone made "EC array plates" where each well tested an EC reaction a similar or worse problem would arise.
That is fine as a set of EC plates would be impossible to make as to work each well would need a lyophilised thermophilic enzyme that was evolved to generate a detectable change (e.g. NADH or something better still) for a specific product in order to do away with complex chains of coupled enzymes that may interfere with the reaction in question along with the substrate, which is often unstable. Not to mention that EC numbers are rather frayed around the edges, I think the most emphatic example is the fact that the reduction of acetaldehyde to ethanol was the first reaction described (giving us the word enzyme) and has the EC, while the reduction of butanal to butanol is EC 1.1.1.– as in, no number at present.
Therefore, screening cannot will with same format as crystallography.

Parallel enzymology

Some enzyme assays are easy, especially for central metabolism. The enzymes are fast, the substrates purchasable and the reaction product detectable. As a result, with the reduction of gene synthesis costs —currently cheaper than buying the host bug from ATCC and way less disappointing than emailing authors— panels of homologues for central enzymes can be tested with ease. There are some papers that are starting to do that and I am sure that more will follow. That is really cool, however, it is the weird enzymes that interests scientist the most.

In silico modelling

Even if it would seem like a straightforward thing, it is currently near impossible to determine in silico the substrate of an enzyme or the kinetic parameters with an enzyme structure and its substrate. The protein structure predictions are poor at best and in silico docking to find the substrates is not always reliable, although a few papers have found the correct substrate starting from crystal structures of the enzymes. Predicting the kinetic parameters requires computationally very heavy quantum-mechanical molecular dynamics simulations and the result would be an approximation at best. What is worse is that all these programs, from Autodock to Gaussian. are challenging to use, not because they present cerebral challenges, but they are simply very buggy. Furthermore, the picture would be only partial.

Deconvoluted in vivo data

Genetic engineering, metabolomics and proteomics might come to the rescue.
Currently, metabolomics is more hipster avant-garde than mainstream. The best way to estimate the intracellular concentration of something in the micromolar range is to get the Michaelis constant of the enzyme that uses it  —Go enzymology!—. But it is just a matter of time before one can detect even nanomolar compounds arising from spontaneous degradation or promiscuous reactions —"dark metabolome" if you really wanted to coin a word for it and write a paper about it.
Also, currently, flux balance analysis can be constrained with omics data —in order of quality: transcriptomics, proteomics and metabolomics data. If the latter two datasets were decent, systems biology models would need to come a long way before one could estimate from a range of conditions a rough guess of the kinetic parameters of all enzymes in the genome. The current models are not flexible or adaptive: one builds a model and the computer finds the best fitting equation and to do that they require fancy solvers. Then again, the data is lacking there and are not as CPU heavy as phylogeny or MD simulations. Consequently, they are poor benchmarks: if perfect proteomics and metabolics data were available, it would take Matlab milliseconds to find the reaction velocity (and as a consequence the catalytic efficiency) of all the enzymes in the model. Add a second condition (say a different carbon source or a knockout) and, yes, one could get better guestimates, but issues would pop up, like negative catalytic efficiencies. The catch is that some enzymes are inhibited, others are impostors in the model and other unmarked enzymes catalysing those reactions may result in subpar fits. Each enzyme may be inhibited by one or more of more than five thousand protein or small compounds in a variety of fashions and any enzyme may catalyse the same reaction secretly.
The maths would get combinatorially crazy quite quickly, but constrains and weights could be made, such previously obtained kinetic data, the similarity of Michaelis constant and substrate concentration or even extrapolation from known turnover rates for known reactions of that subclass.
Questioning gene annotation would open up a whole new bag of worms as unfortunately genome annotation does not have a standardised "certainty score" —it would be diabolically hard to devise as annotations travel like Chinese whispers—, so every gene would have to be equally likely to be cast in doubt, unless actual empirical data were used. So in essence it would have to be a highly interconnected process, reminiscent of the idealistic vision of systems biology.
Nevertheless, despite the technical challenges it is possible that with a superb heuristic model-adapting algorithm and near-perfect omics profiles under different conditions pretty decent kinetic parameters for all the enzymes in a cell could be done  —also giving a list of genes to verify by other means. When such a scenario would be mainstream is anyone's guess, mine is within the next ten to fifteen years.

Saturday, 28 November 2015

ABS biosynthesis

Lego, rumour has it, wants to biosynthesise acrylonitrile, butadiene styrene (ABS), the resin that gives their blocks their firm hold and transgenerational lifespan. This is cool for three reasons:
  1. metabolic engineering is cool by definition,
  2. Lego is cool by definition and
  3. one or two steps link back to a cool gene I found in Geobacillus


So what might they do to biosynthesise their resin? The processes are rather straightforward and one has to go out of one's way to dream up a cool route. In fact, there is a lot of repetition.
The three monomers for the polymerisation are styrene, acrylonitrile and butanediene. These would be made separately. But there are several commonalities, such as the terminal ene group.
There are a few ways to get a terminal ene group:
  1. Have a 2,3-ene and tautomerise it
  2. Have a 2,3-ene and terminal carboxyl and eliminate the carboxyl
  3. Reversible dehydration
  4. Irreversible dehydration via phopharylated intermediate
  5. Oxidative decarboxylation (oleT encoded p450-dependent fatty acid decarboxylase from Jeotgalicoccus sp.)

My guess is that their major challenge is that they will have to extensively modify a few enzymes and will be plagued with detection and screening. Nevertheless, I am still going to talk about the chemistry as it is a good excuse to sneak in a cool set of genes from Geobacillus.


There are two way to biosynthesise styrene. The simplest is decarboxylating cinnamic acid, while the more interesting one by dehydrating phenylethanol.

The tourist route

Phenylethanol —also unglily called phenylethyl alcohol— is in turn made from phenylacetate, which is made from phenylpyruvate.
Recently, while analysing a transcriptomic dataset for Prof. D. Leak, which resulted in an awesome website,, I stumbled across a really cool enzyme encoded among phenylalanine degradation genes, that I speculate is a phenylpyruvate dehydrogenase. This is a homologue of pyruvate dehydrogenase and follows the same mechanism, namely a decarboxylative oxidation followed by CoA attack.

There are other ways to make phenylacetate, but none allow such a shameless plug for my site —in fact, I should have talked about the 2-phenylethylamine biosynthetic route instead.
In nature the phenylacetate will go down the phenylacetate degradation pathway (paa genes), but it could be forced to go backwards and twice reduce the carboxyl group. Phenylacetaldehyde dehydrogenase is a common enzyme, which even E. coli has (faeB), but the phenylethanol dehydrogenase is not. I found no evidence that anyone has characterised one, but I am fairly certain that Gthg02251 in Geobacillus thermoglucosidasius is one as it is an alcohol dehydrogenase guiltily encoded next to faeB, which in turn is not with phenylethylamine deaminase (tynA).
So, that is how one makes phenylethanol. The dehydration part is problematic. A dehydratase would be reversible, but offers the cool advantage that it can be evolved by selecting for better variants that allow a bug with the paa genes and all these genes to survive on styrene as a carbon source. The alternative is phosphorylation and then dehydration as happens with several irreversible metabolic steps.

The actual route

That is the interesting way of doing it. Whereas the simple way is rather stereotypical. In plants there are really few secondary metabolites that are not derived from polyketides, isoprenoid, cinnamate/cumarate or a combination of these. Cinnamic acid is deaminated phenylalanine via a curious elimination reaction (catalysed by PAL). In the post metabolic engineering breaking bad I discuss how nature makes ephedrine, which is really complex and ungainly and then suggest a quicker way. Here the cinnamic acid route is actually way quicker as a simple decarboxylation does the trick. S. cerevisiae to defend itself from cinnamic acid, it has an enzyme PAD1p that decarboxylates cinnamic acid. Thefore, all that is needed is PAL and PAD1.


Previously I listed the possible routes to an terminal alkene, which were: 
  1. Tautomerise a 2,3-ene
  2. Decarboxylate a 2,3-ene with terminal carboxyl
  3. Dehydrate reversibly
  4. Dehydrate irreversible via phopharylated intermediate
  5. Decarboxylate oxidatively
In the case of butanediene, it is a 4 carbon molecule already, which forces one's hand in route choice. Aminoadipate is used to make lysine when diaminopimelate and dihydropicolinate are not needed. That means that a similar trick to the styrene biosynthetic route could be taken, namely aminoadipate is eliminated of the amine by a PAL mutant, decarboxylated by a PAD1 mutant and then oxidatively decarboxylated by a mutant OleT. But that requires changing a lot the substrate for three steps and the cells went to a lot of effort to make aminoadipate, so it is rather wasteful route.
Another way is to co-opt the butanol biosynthetic pathway to make butenol and dehydrate that.
A better way is to twice dehydrate butanediol.

As mentioned for styrene, a reversible dehydration means that selection could be done backwards. However, pushing the reaction to that route would require product clearance, otherwise there will be as much alcohol as the alkene. With butanediol and butanol there is a production and a degradation pathway, which would mean that selection could be done with the degradation route, while the actual production with the production route.


That is a curious molecule to biosynthesise. There are nitrile degrading bacteria and some pathways make it, so it is not wholly alien. preQ0 in queuosine is the first I encountered. QueC performs a ATP powered reaction where a carboxyl is converted to a nitrile. I am not sure why, but a cyano group seems (=Google) less susceptible to hydrolysis than a ketimine for some reason  —methylcyanoacrylate (superglue) follows a different reaction. Beta-alanine could be the starting compound, but it would require so many steps that it is a bad idea.
Substituting carboxyl for nitrile (nitrilating?) on acrylic acid with a QueC like enzyme would be better. Acrylic acid is small so it can be made by dehydration of lactic acid, oxidative decarboxylation of succinate or decarboxylation of fumarate. The latter sounds like the easiest solution as there are many decarboxylases that use similar molecules, such as malate or tartrate decarboxylase.


Basically, even if it seems like a crazy idea at first, the processes are rather straightforward —one or two engineered enzyme for each pathway—, but the chemistry is pretty hardcore, so the few engineered enzymes will have to be substantially altered. Given that the compounds are small, quantifying yields will be their main challenge. How one goes about designing a selection systems for these is an even bigger challenge as evolving repressors to respond to small and solely hydrophobic compounds would be nearly impossible... So they will have to do this most likely by rational design alone, which makes it seem like a crazy idea after all.

Sunday, 22 November 2015

Noah's ark dilemma in phylogeny

When looking at bacterial diversity, be it for a conserved gene or the organism itself, a common problem is the wealth of sister strains and sister species: this is a bother as often one would like a balanced representation. This leads to a Noah's ark dilemma of having to pick one.

Saturday, 14 November 2015

Pooled ORFome library by multiplex linear PCR

Some time back I dreamed up a method of making a pooled orfome library by multiplex linear amplification. I never did submit it for a grant as it is way to ambitious and, well, expensive for the end result. I really like the idea so I feel bad it never went anyway nor will it, but that is science for you and only tenured professors can do what they like.
Here is a sketch of the idea. The prices might not be slightly out of date.
In essence, the science is sound, albeit risky and expensive, but the generation of a pooled orfome library of a given species is not a technology that woud have much demand, because few people care about pathway holes anyore and, a transcriptome or fragmented genome library does the same thing without the pricetag, just with orders of magnitude lower efficiency.


Aim. The creation of a pool ASKA-like (orfome) library by doing a multiplex linear amplification reaction with a primer pool against genomic DNA generated by gene synthesis methods (custom GeneArt Strings job).
Use. The generated pooled library of amplicons can be used to spot unknown non-homologous genes that fill a pathway hole or promiscuous activities. Not only is identifying non-homologous genes a big challenge, but there is great novelty in the method itself.
Cost. In this proposal I have not got a cost due to the caginess of the Introgen rep, but I estimate it to be around $5k.


One use of the Aska library and other orfome libraries is to pool them to find rescuers of a knockout (Patrick et al., 2007). Revealing the whole set of promiscuous enzymes for that activity —Dan Anderson’s definition of promiscuome means all of the promiscuous activities in a genome. This works as well (or better) with main activities, except that with E. coli there are no easy pathway holes left to probed. With species that are not E. coli this would be useful, but the number of available orfome libraries is small, incomplete or problematic such as that of the T. maritima structural biology consortium (split on 3 different plasmids).


PCR marathon. The Aska and other orfomes are made by separate PCRs in a multiwell plate. If one is to pool them the effort seems wasted as it is a Herculean task.
Fragmented genome. Shelley Copley is using a fragmented genome library as a proxy for an orfome library. The problem is that most of the library will be composed of partial gene, out of frame genes and genes under their own promoter.
Depleted transcriptome. Nobody (that I know) has used a transcriptome library. The major problem is that the genes are represented one of each (orfome), but are present in certain concentrations, where the major transcripts are rRNA and ribosomal peptides. Depleting the former is doable, albeit not perfect, but to get a good coverage the library should be titanic.
Multiplex PCR. The PCR approach is best as it is targeted. If one wanted a pool, one might be able to pool the primers and do one or more multiplex PCRs. Three problems need to be addressed:
  • Making of primers
  • Maximum n-plex PCR
  • Avoiding biases
In light of the extremely high-plex nature, a linear amplification is a better solution as will be discussed.


One way to have a pool of primers is to make them with a GeneArt gene strings. At QMB I cornered the rep who said it was easily doable making upto ten thousand different strings (4300 genes in E. coli times two assuming no redundancy), but would not discuss pricing. For them it is easy as they skip a step and simply deliver the oligonucleotide synthesis product. However, everything is secret (or the rep didn’t know). He could not confirm that they make the oligos by lithography. Nor did he disclose the length distribution. However, Affimetrix ten years ago was making 25 base oligos (probes) for its GeneChip, so it is higher than 25. Illumina’s BeadArray uses 23+50 bases probes, but it might be made in a weird way. Consequently, the primers will be longer than 25 bases, which is perfect. The primers should have high melting temperatures, preferably at 72 °C, so should be 30 bases long.
The real problem I assume is that the completeness decreases exponentially with product length.

Let’s assume they obey the IDT’s graph on the right and GeneArt actually varies the length.

Consequently, a PAGE purification step is required no matter what.
But the yield would decrease. >200 ng is the guaranteed yield of gene strings.
30*8.6 k = 258,000 bases.  The size limit of the sequence is 3,000 bases, but I think that that is due to the overlap as an Affymetrix GeneChip has a genome repeated multiple times. Consequently, if the overlap is a shifting of one base pair it should cost the same as three 3k bp gene string reactions (assuming no discount for skipped ligase step). If it is of 5 base window, it would be the same as 15 3kb gene string reactions. If the window is 10, 30 3kb reactions. The latter is nonsensical as it would mean only 300 unique sequences, which is in stark contradiction to the diversity offered in their library options. Regardless, even at $10k it is ten times cheaper that one by one sequences. The first option seems the right one, so let’s say $3k for 600 ng. 400 ng after PAGE (HPLC preferable).
The yield is also okay. 400 ng divided by (327g/mol per base times 30 bases per oligo) is 40 pmol. 200 nM reaction concentration means 200 µl PCRs. The maximum yield would be 1 µg amplicon.


The other major issue is the limit of a multiplex PCR. Kapa multiplex PCR website talks 4-plex, 8-plex and 12-plex, but it mentions in the FAQ that 30 is highest mentioned. High-plex 30 amplicons, ≤1000 bp. 30-plex is no way close to 4300-plex. However, this does not mean it is impossible.
Multiplex PCR is often used in conjunction with detection methods (TaqMan etc.), and not for subsequent cloning. So some issues may not apply. Two issues however do apply, artefacts and biases.
On chip multiplex PCR. Parenthetically one paper makes a 100-plex PCR on chip to boost sensitivity (PMID: 21909519). This is a related system and a very plausible approach, however, it would require in situ DNA synthesis capabilities.
Primer dimers.  Primer dimers happen in normal PCR. The way to avoid it is using a higher annealing temperature (longer primers) and to avoid repetitive sequences. The oddity is that the Kapa manual gives an example annealing step as 60 °C. If the PCR were a two-step reaction, the reaction would be more efficient with less chance of noise. That means that AT-rich organisms, like P. ubique, are off. DMSO and betaine allow better PCR specificity especially with GC-rich sequences, so it might be good to go even more overkill with annealing length.
3’ trail primer binding.  Here a new issue arises: the first copy of a sequence might have a primer binding site on its 3’ end for the next gene. This truncated gene would not amplify with the other primer, but would act as an inhibitor soaking up the first primer. Amplicons can act as megaprimers, so it might be as bad. Nevertheless, it is worth worrying about. Furthermore, Taq polymerase does not have strand displacement activity but 5’ exonucleolytic activity. One way to overcome this is to have 5’ protected primers (IDT options), which may interfere with the plasmid construction set and may be problematic for overlapping genes. I would shy away from strand displacing enzymes are they may be non-optimal.
The amplification step. PCR amplification efficiencies differ between primers. I could model what may happen (simulate, BRENDA has all the constants I need). However, PCR reactions with low primer concentration might avoid sequences lagging behind. This combined with the previous problem raises the suspicion linear amplification may be best.
Linear amplification. One thing to avoid is exponentially amplifying biases. One option is to do a single primer linear amplification. At first I would say no as it is really wasteful in terms of primers (which are limited) and the yield would not be enough. Plus it would need a second step in the opposite direction to produce correctly sized sequences. Plus the major issue with high-plex reactions is the low yield of product, so this would be four-thousand fold worse when linearly amplified. However, what is not being considered is the fact that a 4300plex PCR isn’t amplifying exponentially. In 25 µl reaction with 100 ng template and with 200 nM primer pool, there are about 2.5 pM (for 4.6Mbp) genome strands and 50 pM of each primer (of 4300), which only allows for a twenty fold amplification.


How does one go about testing it? There will be smears if it is a nonspecific mess and if it worked. For the final product, NGS may be an option to see bias amplification, but not for routine troubleshooting. Transforming specific rescue strains is too laborious. The best option is RT-PCR of some genes (spread of sizes).


The next step once a pool has been made is to get them in plasmid form.
RE cloning, blunt. Messy products, low yield. Non-directional. Not worth it.
RE cloning, added 5’ site. It would require subsets to avoid cutting the amplicon. Which is would be a pain. Also it means adding to the primer.
Gibson assembly. It requires adding to the primer. This means longer primers, which is okay-ish, but I have an unscientific hunch it may mess up the multiplex primer.
TOPO clone. Directional TOPO is a good option, but has a drawback, there is no good plasmid. Also I am ignoring the 5’ primer protection possibility. The pET is not good as it requires T7 (cotransformation of pTARA plasmid and rescue screen has not been done). The pBAD and the pET have loads of crap at the end. The Aska collection is 10% non-functional due to the GFP tag. The thioredoxin and the hexahistidine tags may be cool, but they are in the way of a clean experiment. Adding a stuff to the primers to override the crap is a no-go due to the 5’ tag in multiplex concern and is seriously inelegant for downstream processes. So the TOPO ligation into pENTR plasmid followed by Gateway into a custom plasmid. It would allow a brief amplification step. A custom TOPO would be ideal, but it costs more than the GeneStrings step itself. A good homebrew TOPO-plasmid has never been achieved. So, disappointingly, a two-step TOPO is the TOPO way (I’ll keep an eye out for new TOPO kits).

Test subject

P. ubique is not an option as it is too AT-rich and its genes cannot be expressed even with pRARE. Even though this whole idea arose to tackle the pathway holes issue in P. ubique.
I want to try it for T. maritima to find if MetC is the sole racemase, but as there is the structural genomics consortium’s library, it makes it seem a tad redundant.
If I were to do this independently of my project, I would be easy to find a good candidate that:
  • has a solvable pathway hole for the field test of the library,
  • it has a small genome,
  • its proteins are expressible in E. coli and
  • has a 50% GC-content.
EDIT. After analysing the transcriptome of Geobacillus thermoglucosidasius and knowing how popular it is getting I would say a Geobacillus spp. is a perfect candidate.

Parenthesis: Name

For now, I’ll call it ORFome library by multiplex linear amplification.  I have been calling it Aska-like library. Although I’d like a better one. Yes, a name is a semantic handle, but there is nothing worse than a bad name. Delitto Perfetto is a cool method name, CAGE isn’t and I rather keep away from backronyms. Portmanteaux are where it’s at, especially for European grants. Geneplex is already taken, but -plex means -fold anyway. Genomeplex? Mass ORF conjuration spell? I would love to coin a name, but I should withhold for taken the fun from later on.

Sunday, 8 November 2015

Diaminopurine in cyanophage S-2L DNA

There is a Nature paper from 1977 reporting a cyanobacterial phage, S-2L, that uses diaminopurine in its DNA instead of adenine. diaminopurine has two amine groups as opposed to one, so can bind more tightly to thymine, thus changing the behaviour of its DNA and tricking its host.

The strange thing is that this cool discovery went nowhere even though it would have some really interesting applications. In fact, this paper was followed by two others, then a hiatus and then two papers in the 90s about melting temperature. No cloning, no nothing.
The phage was lost as there is a patent that gives its sequence. The sequence is unhelpfully unannounced, but the patent among the really boring bits has claim 270, which says:
On the other hand, it seems very likely that the D-base is formed by semi-replicative modification. Between the two biosynthesis routes of dDTP formation described above, the identification of a succinyladenylate synthetase gene homologue called ddbA (deoxyribodiaminopurine biosynthetic gene A) leads to the conclusion that it is the second route which is probably taken during phage infection (FIG. 2).
Several tests have been carried out in order to determine the activity of the corresponding protein. The results suggest that the expression of ddbA allows restoration of the growth of a strain of E. coli expressing the yaaG gene of Bacillus subtilis [yaaG (now dgk) encodes  deoxyguanosine kinase] in the presence of a high concentration of dG (10 mM). On the other hand, 2,6-diaminopurine becomes toxic (10 mM) to E. coli when it is in phosphorylated form (which has been tested in the same strain of E. coli expressing the yaaG gene of Bacillus subtilis i.e. MG1655 pSU yaaG) which makes it possible to have a screen in order to identify in vivo the complete biosynthesis route of the D-base.
So they have a purA homologue, but not a purB homologue or a kinase. This means that either Synechococcus purB and adk promiscously do the reactions, which seems odd given that the phage would give them a negative selection, or that an analogue eluded them. Parenthetically, the people that filed the patent were not the people who found the virus and no paper was published about it even though the first author of the patent studies odd nucleobases.
One interesting thing from the genome is the presence of many polymerases and of DNA gyrase, which is due to the odd DNA.
In conclusion, it is a real shame research into this curious pathway was minimal and has not progressed as it would be a perfect toolkit for synthetic biology.

Saturday, 7 November 2015

How shall I name my variables?

Python and naming conventions

Clarity and simplicity are part of the Zen underlying Python (PEP 20). Simplicity in a large system requires consistency and as a result there are various rules and guidelines. Overall, that and the large number of well documented libraries is what makes Python fun. When making websites consisting of a pythonic server and a javascribal browser, the merits are most obvious. JavaScript is chunky and tedious and, while JQuery makes document interactions great, it feels like a completely different language. There are such frustrating Pythonic things that pop up, such as the lack of autoincrementor, join as a list method and a few others —along with the annoyance of making blunders in writing JS objects and Python dictionaries in the other's style. But overall, it is very clean. One thing that is annoying is that the name styles are not consistent. There is a PEP, that names the naming styles (PEP8), but does not make good suggestions of when to use which. In my opinion this is a terrible shame as this is where stuff starts crumbling.
In brief the problem arises with joining words and three solutions are seen:
  • lowercase
  • lowercase_with_underscore
  • CamelCase (or CapWords in PEP8)
The first is the nice case, but that never happens. I mean, if a piece of python code more than ten lines long does not have a variable that holds something that cannot possibly be described in a word, it most likely should be rewritten in a more fun way with nested list comprehensions, some arcane trick from itertools and a lambda. So officially, joined_words_case is for all the variables and the CamelCase is for classes. Except... PEP8 states: "mixedCase is allowed only in contexts where that's already the prevailing style (e.g., to retain backwards compatibility", aka. they gave up.

Discrepancies and trends

That a class and a method are different seems obvious except in some cases where it becomes insane.
In the collections library defaultdictionary and namedtuple are in lowercase, while OrderedDictionary, is in CamelCase. Single word datatypes are equally inconsistent: Counter is in camel case, while deque is in lowercase. All main library datatypes are in lowercase, so it is odd that such a mix would arise, but the documentation blames how the were implemented in the C code. In the os library the method isdir() checks if a filepath (string) matches a directory, while in the generator returned by scandir() the entries have is_dir() as a method, which is most likely a sloppy workaround to avoid masking. Outside of the standard library, the messiness continues. I constantly use biopython, but I never remember what is underscored and what is not and keep having to check cheatsheets to the detriment of simplicity.
There are some trends in the conventions nevertheless. CamelCase is a C thing, while underscores is a Ruby thing: this probably makes me feel more safe using someone's library or script that uses CamelCase. Someone wrote a paper and found CamelCase to be more reliable in terms errors. Personally, I like lowercase all the way, no camels or underscores and the standard Python library seems to be that way and it is really pleasant.
FULL_UPPERCASE variables are often global variables used as settings or from code written by capslock angry people —actually, if Whitespace language is a thing, why is there no capslocks language?
Single letter variables are either math related or written by an amateur or someone who gave up towards the end —such as myself all the time.

My two pence: inane word newfangling

The built-in methods of the mainspace and datatypes all are lowercase without underscores (e.g. open("file.txt").readline()), so there is consistency at the heart of it. Except that lowercase without underscores is not often recommended as it is the hardest to read of the three main ways —it is the easiest to type and possibly remember. With the except of when a word is a verb and it could have been in the present, past or present participle forms. Plus open("file.txt").read_line() is ugly and I feel really anti_underscoring.
German and many other languages are highly constructive and words and affixes can be added together. I have never encountered German code, but I would guess the author would have had no qualms in using underscorless lowercase. The problem is that English in not overly constructive with words of English origin as most affixes are from Latin. The microbiology rule of -o- linker for Greek and -i- for Latin and nothing for English does not really work as Anglo-Latin hybrids look horrendous. Also using Greek or Latin words for certain modern concepts is a mission and, albeit fun lacks clarity. The Anglish moot has some interesting ideas if someone wanted a word fully stemming from Old English and free of Latin. Nevertheless, I like the idea of solely lowercase and coining new words is so fun —except that it quickly becomes hard to read. Whereas traditionally, getting the Graeco-Latin equivalents and joining them was the chosen way, nowadays portmanteaux are really trendy. In the collections module, deque is a portmanteau and I personally like it more as a name than defaultdictionary —How about defaultionary?
As a Hungarian notation for those variables that are just insane, I have taken to adding "bag" as a suffix for lists and sets (e.g. genebag) and "dex" for dictionaries (e.g. genedex), which I have found rather satisfying and actually has helped (until I have to type reduced_metagenedexbagdex).

Hungarian tangent

That leads me to a tangent, the hungarian notation. I wrote in Perl for years, so the sigil notations for an object's type left a mark. Writing st_ for string and other forms Hungarian notation would just be painful and wasteful in Python, but minor things can be done, such as lists as plural nouns and functions as verbs. Except it seems to go awry so quickly!
Lists, sets and dictionaries. Obviously, the elements should not be the singulars as that results in painful results, but I must admit I have done so myself too many times. Collective nouns are a curious case as it solves that problem and reads poetically (for sheep in flock), but there are not that many cases that happens.
Methods. An obvious solution for methods is to have a verb. However, this clearly turns out to be a minefield. If you take the base form, many will also be nouns. If you take the present participle (-ing) the code will be horrendous. If you take the agent noun (-er, -ant), you end up with the silliest names that sound like an American submarine (e.g. the USS listmaker).
Metal notation. The true reason why I have opened this tangent is to mention metal notation. If one has deadkeys configured (default on a Mac) typing accents is easy. This made me think of the most brutal form of notation: the mëtäl notation. Namely, use as many umlauts as possible. I hope the Ikea servers use this. Although I am not overly sure why anyone would opt for the mëtäl notation. In matlab there are a ridiculous number of functions with very sensible names that may be masked, so the mëtäl notation would be perfect, except for the detail that matlab does not like unicode in its variables. One day I will figure out a use…
Nevertheless, even though Hungarian notation is somewhat useful, Python seems to survive without it: I personally think that most of the time when issues happen is with instances of some weirdo class and not a standard datatype anyway. So there is no need to go crazy with these, it is just fun.


Nevertheless, even if there were a few exceptions, it is my opinion that a centralised Pythonic ruleset would have been better. The system that I would favo(u)r is compulsory lowercase, as is seen for the built-in names — parenthetically, American spelling is a given, it did not take me long to spell colour "color" and grey "gray". The reason why lowercase is disfavoured is because it is hard to read when the words are long. In my opinion variables names should not be long in the first place. One way around this is making a sensible portmanteau or a properly coined word and just restraining from overly descriptive variables. At the end of the day, arguments of legibility at the cost of consistent and therefore easy usage makes no sense. defaultdictionary takes a fractions of a second more to read, but looking up how a word is written takes even minutes.

Saturday, 31 October 2015

Fluorescent fats

Desaturated fats generally have a methylene-interrupted configuration, where a double bond is separated by two single bonds. Conjugated fatty acids (aka. polyene fatty acids) are fluorescent when in a lipid bilayer (Sklar et al., 1977), but few examples are know.
One example is parinaric acid (octadecatetraenoic acid; pictured) from Impatiens balsamina (formerly Parinarium laurinum) with four double bonds (ex. 320 nm → em. 420 nm) and with a characterised conjugase (Cahoon et al., 1999). Other conjugases have been studied (Rawat et al., 2012) and other conjugated fatty acids are known, including one with 5 double bonds (bosseopentaenoic (eicosapentaenoic) acid).
A single enzyme that makes a fluorescent reportable signal is rather appealing for synthetic biology. 
The wavelength is rather limiting compared to aromatic compounds obviously. Namely conjugated butadiene, hexatriene, octatetraene, decapentaene  and dodecahexaene absorb at 217 nm, 252 nm, 304 nm, 324 and 340 nm respectively. The fact that two or three conjugated ene bonds would not suffice as a tool raises the question of whether a conjugase could be evolved to make even more conjugated systems, such one acting on cervonic acid (22:6(n-3)), which would make a fluorophore of seven conjugated ene bonds. As a fluorescent signal is easily selectable by FACS, it would be definitely doable (if the UV lasers were available). The only problem is that E. coli lacks variety when it comes to membrane facts as is mainly composed of palmitic (hexadecanoic) acid, palmitoleic (cis-9-hexadecenoic) acid and cis-vaccenic (cis-11-octadecenoic) acid (Mansilla et al., 2004), therefore, the desaturation machiner to make the precursors would be needed. Nevertheless, it is rather interesting and I would love to see more colourful E. coli...

EDIT. Isoprenoid and is condensed derivatives (geranyl-PP, farnesyl-PP, geranylgeranyl-PP etc.) already have a methylene-interrupted configuration and geranylgeranyl is condensed into phytoene which is desaturated by phytoene dehydrogenase to lycopene.

Archaea use diphytanylglyceryl phosphoglycerol is their cell membranes: phytanol is reduced geranylgeranol. So their reduce it as opposed to desaturating it, if they did you'd 5-7 conjugated ene bonds, which would absorb in the yellow region.

Wednesday, 28 October 2015

Pseudomonas, Ralstonia and Burkholderia: HGT buddies

Recently I was reminded of an interesting thing that intrigued me: Ralstonia and Burkholderia were formerly classed as Pseudomonas spp. and now are from different proteobacterial classes, but they share several close genes. One hypothesis that can be made is that a series of horizontal gene transfer events occurred at some point and the phenotypes of the three species became so close that they were mistakenly grouped together.

I came across the quirk while doing research for a paper from Dr. Monica Gerth and Prof. Paul Rainey:
Gerth ML, Ferla MP, Rainey PB. The origin and ecological significance of multiple branches for histidine utilization in Pseudomonas aeruginosa PAO1. Environ Microbiol. 2012 Aug;14(8):1929-40. doi: 10.1111/j.1462-2920.2011.02691.x. Epub 2012 Jan 9. PubMed PMID: 22225844.

The Ralstonia and Burkholderia are in different families within the Burkholderiales (Betaproteobacteria), while the Pseudomonas is in the Gammaproteobacteria. The taxonomic genus Burkholderia was coined when 7 Pseudomonas spp. were moved in 1993, while Ralstonia was created when two Burkholderia spp. were moved in 1996. The two are not sister genera.
In addition to the hut operon these three species are phylogenetically close to each other for many protein.

Using the Darkhorse server set to genus level detail, the top hits for Pseudomonas aeruginosa are

  1. Azotobacter vinelandii, a bona fide pseudomonad, but given its own genus for silly reasons —okay, technically it is the Pseudomonas genus should be split.
  2. Ralstonia metallidurans
  3. Bermanella marisrubri
  4. Chromohalobacter salexigens
  5. Burkholderia xenovorans
  6. Burkholderia thailandensis
With Pseudomonas fluorescens B. xenovorans is in third place. With Pseudomonas putida it's in second place, while R. metallidurans in fourth. While Ralstonia and Burkholderia pick each other up. The results for Ralstonia eutropha indicate that 2/3 of its large genome are from Cupriavidus taiwanensis (sister species of Burkholderia), which in reality means that there is some phylogenetic issue afoot.
Nevertheless, it does not explain the pseudomonad link, which is probably because the ancestors of Pseudomonas and of Burkholderia/Ralstonia got to know each other well and as a result today we have:

  • Burkholderia spp. have two chromosomes and lots of plasmids, R. eutropha has over six and half thousand genes and pseudomonads are gene collectors too.
  • They all have weird relationship with their sister genera or families.
  • The seem to have similar lifestyles.
  • The share lots of genes
  • They were mistaken as pseudomonads morphologically.
This is just a mix of speculation and quick checks, but there is nevertheless a link. Which is not only curiosity, it finds data in mistakes from the past and potentially tells of dangers that may assail genome-concatenation studies.

Wednesday, 14 October 2015

A note about serving static files with Python's wsgi

Openshift is great and making a python server is really straight forward. The only catch is that wsgi tries to serve static files. The solution serverside is to add a folder wsgi/static, which works a treat, locally it is a different matter. Some Python subversions ago, I could get the localhost to server from static (not wsgi/static), which is fine given that the localhost is __main__, while on Openshift it isn't, so I just had two copies of static. Since some upgrade, my locally run hates serving static files. I looked on the web and nothing was the quick botch I needed. This is what I added to the part where a GET method is processed.

Sunday, 11 October 2015

KEGG map colo(u)rs

KEGG has a handy map colouring feature, where one can feed it a list of KEGG IDs from whatever database of theirs and a colour, which it will use to colour the image of the pathway. KOBAS and other severs offer similar features, but this is the original.
Two things are frustrating. One is the classic strangeness of the maps that seem to have everything and more, except when you want a certain reaction, which inexplicably is not annotated on that map. The other are the colours, which are annoying but strangely addictive.
Basically, the colours can be RGB codes or names. The former scheme is fine, except the latter is more human readable and trying to figure out what is okay or not is rather addictive. Colours, or more correctly colors as the web is written in US English, can be described in many ways.
The code system ("hex triplet") is formed of three consecutive hexadecimal double-digit numerals prefixed with a hashtag (e.g. #c3a0d2), where each pair corresponds to Red, Green or Blue (RGB). KEGG accepts these fine.
In HTML and CSS colors have proper names —the most official source of the exact names of colours is Pantone, but this comes close. KEGG implements some single word ones, e.g. gray —not grey: an HTML mistake I always make—, red, blue, green, yellow, orange and gainsboro —a handy colour for padded div boxes, which seems like a mispelt Gainsborough—. Puce is not an HTML colour, so I am not surprised it does not like it, however, aqua is but it does not like it. It also accepts some of the two word names, like GreenYellow, but not others, like DarkGray. Funnily the capitalisation seems to matter: lower-case only for single word names, while CamelCase for the compound ones —HTML and CSS are case insensitive. "Brewer colors" can be excluded as dgray or lgray don't seem to work either. This all seems to point to the fact that some poor chap had to write the dictionary server side to covert the names to hex=triplets
One curious thing is that several names collapse into one colour. lavender is the same as gainsboro and lightgray.
All in all the hex-triplets and the smattering of names fulfil all needs, but the strange quirkiness is fun to unpick...
Here is a wee series of Reaction identifiers to colour down the glycolysis pathway if you want to try your luck at colour picking:

R00959 gray
R02740 purple
R02738 GreenYellow
R09084 orange
R04779 green
R01070 red
R01061 gray
R01512 gainsboro
R01518 #ddeedd

Tuesday, 6 October 2015


I wrote a small script to allow one to easily and prettily embed both nucleic acid and protein FASTA sequences on a webpage (it guesses based on EFILPQ which are not degerate bases, but residues). It was a mix of "I need this" and "I want to play with colours". The latter was rather painful. But the result is a JS modified and CSS coloured fasta files.
The files can be found in my GitHub repository for PrettyFastaJS and so are the explanations of how to use it. As a demo, here are two curious sequences. Two consecutive genes in the same operon that encode each a full length glycine dehydrogenases, which in other species is a homodimer, which suggests this may be a cool heterodimer.

Monday, 28 September 2015

Pythonic spinner

Python is fun: it has lovely libraries, is a beauty to type and there are constant surprises — I only recently found out that 3.4 had introduced defaultdict() (collections library), which is phenomenal. With the web there are three options:

  • It can be used on the server-side on the web with the wsgi library or the Danjo framework. Open-shift is a great fremium script hosting service —I have used it here for example.
  • There are also some attempts to make JS parse python in the browser, namely Skupt and Brython, but as they convert the python code into JS code, so they are not amazingly fast (1), but you are showing the world python code.
  • One can transpile python to javascript, which is faster and less buggy, but that is unethical as you'd be serving JS and not a python script —CoffeeScript gets a lot of bad rep for that reason.

Thanks to CSS3 there are a lot of cool spinners out there to mark code that is loading, but none cater for python users. Therefore I made my own spinner icon, specifically: .
The code is hosted in my dropbox:
<link href="" rel="stylesheet"></link>
<span class=pyspinner></span>

(1) I tried Brython and liked that it had a mighty comprehensive series of libraries and that it had DOM interactions similar to JQuery. However, I could not get over the fact that, for me at least, changes to DOM elements were not committed until the code finished or crashed —which brought back bad Perl memories. Also the lack of CSS changes and the nightmare of binding functions to events makes me think I might try other options.

Publication-driven complexification

"Any sufficiently advanced technology is indistinguishable from magic."
—A. Clarke's Third Law
I like knowing what I am doing, it helps me figure out if something is wrong. However, there is a trend for accepted operations to become "mathemagical" either out of necessity... or out of publication race.  

RNASeq is often cheered for being absolute reads and thus requiring less normalisation than microarrays. Anyone that has tried to understand Loess or Lowess takes that as a blessing.
However, the situation is not that simple as different samples have different total amounts of DNA and, due to publications trying to out do the other it quickly gets complicated, where acceptant nodding is the best option lest one want to spend an evening reading a method's section and the appendix on a paper.
An arithmetic average is discouraged as the highly expressed genes will wreak havoc with other genes. On those lines, Mortazavi et al. 2008 (PMID: 18516045) introduced a per mille CDS-length normalised scale, which was called  RPKM —the letters don't mean much and feel like a weird backronym, which I don't get, although had I been them I would have made people have to use the ‰ glyph. The underlying logic is clear and for a while it was popular. RPKM got ousted by Anders and Huber 2010 (PMID:20979621) and Robinson et al. 2010 (PMID:19910308) with DESeq and edgeR, which do many calculations. The normalisation in DESeq is done by taking the median for each sample of each of the ratios of the gene count for that samples over the geometric mean of the values of that genes across the samples. That makes sense, albeit cumbersome to grasp. For the test of significance, the improvement on a negative binomial GLM spans a page or two and works by magic, or more correctly mathemagic, maths that is so advanced it might as well be magic —Google shows the word is taken by some weird mail order learn maths course or somesuch, but shhh! The sequel to DESeq, DESeq2 does everything automatically and in 3 lines everything is done. It is really good although I do like to know what I am doing.
Long story aside, I did in parallel the analysis with DESeq methods in MatLab, using their agonising tutorial, which was tedious, but good to see how it everything works  —to quote Richard Feynman: "what I cannot create I cannot understand". It wants a true waste of time as it made me realise what graphs would be helpful as the various protocols and co. try and out do each in terms of graph complexity. It really bugged me that an obvious graph, double-log plot of each replicate against its counterpart with a Pearson's ρ thrown in, was omitted everywhere in favour of graphs, such as the empirical cumulative function of the distribution of the variance against the χ-squared of the variance estimates —a great way of inspecting if there are any oddities in a distribution, but really not the simplest graph. It is used because it is sophisticated, which doesn't necessarily mean better (incorrect assumptions can go a long way in distorting data), but it means more publishable.
Curiously, once the alignment, normalisation and significance steps are done, it is cowboy territory especially for bacteria. For example, for bacterial operon composition there is either a program, Rockhopper, which does not like my genome (I had to submit for realignment each replicon on different runs to get operons for the plasmids and refuses to do the chromosome) and another paper where the supplementary zip file was labelled .docx and the docx was labelled zip, which goes to indicate the scarce interest.
The main use of differential expression is functional enrichment, which depends on annotations that are rough as nails...

I am in awe at the mathematics involved in the calculation, but I cannot help, but feel annoyed at the fact that they are embarrassingly more sophisticated than they ought to be, especially since the complexity gives somewhat marginal improvements and the created dataset will be badly mishandled with clustering based on wildy guessed functions.

Saturday, 12 September 2015

Unnatural amino acid biosynthesis

In the synthetic biology experiments with an expanded genetic code the biosynthesis of unnatural amino acid is not taken into consideration as the system is rather rickety and the amino acids unusual.

Similar amino acids

A lot of introduced amino acids are dramatically different from the standard set. However, there are several amino acids that never made it to the final version of the genetic code as they are subtly different from the canonical amino acids and must have been too hard for the high promiscuous primordial systems to differentiate. However, subtle differences would be useful for finetuning. Examples include aminobutyric acid (homoalanine), norvaline and norleucine, allo-threonine ("allonine"), ornithine or aminoadipate. If there was a way to introduce them and increase the fidelity it would be hugely beneficial. It would probably result in enzymes with higher fidelity and catalytic efficiency.
That Nature itself failed back then does not necessary mean that scientists would fail with modern metabolism. The main drawback is a selection system. Current approached to recoding rely on the new amino acid as fill in as opposed to something that makes the E. coli addicted to it. The latter would mean that the system could evolve to better handle the new amino acids. Phage with an unnatural amino acid have higher fitness (Hammerling et al., 2014). Unnatural RNA display (Josephson et al., 2005) could be used to generate a protein that is evolved to require the non-canonical amino acid as nearby residues are evolved to best suit that enzyme, if one really wanted to all that trouble. Alternatively and less reliably, GFP with different residues does behave differently and position 65 could handle stuff like homoalanine, but the properties between S65A or S65V are not too different. So if a good and simple selection method were present it could be doable.

Novel amino acid biosynthesis

This leaves with the biosynthesis of the novel amino acids, which is the main focus here. There are many possible amino acids to choose from, and a good source of information for that is the wikipedia article non-proteinogenenic amino acids  — I (reticently) wrote many years ago to sort out the mess that there was, but I subsequently left to the elements and it has become a bit cluttered like an unselected psuedogene. Why certain amino acids made it while other did not is discussed in a great paper by Weber and Miller in 1981.
Most of the amino acids that nature can make would just make structural variants. Furthermore, mechanistic diversity mostly comes from cofactors (metals, PLP, biotin, thiamine, MoCo, FeS clusters etc.), which is a more sensible solution given that there only one per certain type of enzyme. Some amino acids that are supplemented Nature cannot make with ease, in particular chlorination and fluorination reactions in Nature can be counted with one hand.
Homoserine, homocysteine, ornithine and aminoadipate. E. coli already makes homoserine for methionine and threonine biosynthesis and homocysteine via homoserine for methionine. Ornithine is from arginine biosynthesis and was kicked out of the genetic code by it. Gram positive bacteria, which do not require diaminopimelate, make lysine via aminoadipate ("homoglutamate").
Homoalanine. The simplest novel amino acid is homoalanine (aminobutyrate). 2-ketobutyrate is produced during isoleucine biosynthesis, which if it were transaminated it would produce homoalanine. Therefore it is likely that the branched chain transaminase probably must go to some effort to not produce homoalanine (forbidden reaction). This amino acid is found in meteorites and is really simple, but its similarity to alanine, hence why it must have lost out.
Norvaline, norleucine and homonorleucine. In the Weber and Miller paper the presence of branched chain amino acids was mentioned as potentially a result of the frozen accident. From a biochemical point of view, the synthesis of valine and isoleucine from pyruvate + pyruvate and ketobutyrate + pyruvate follows a simple pattern (decarboxylative aldol condensation, reduction, dehydration, transamination). The leucine branch is slightly different and is actually a duplication of some of the TCA cycle enzymes (Jensen 1976), specifcially it condenses ketoisovalerate (valine sans amine) and acetyl-CoA, dehydrates, rehydrates, reduces, decarboxylates and transaminates. If the latter route is used with ketobutyrate and acetyl-CoA one would get norvaline, with ketopentanoate (norvaline sans amine) and acetyl-CoA one would get norleucine. These biosynthetic pathway have actually been studied in the 80s. The interesting thing is that it can be used further making homonorleucine. Straight chain amino acids make sturdier protein, so there is a definite benefit there. The reason why norleucine is not in the genetic code is that it was evicted by methionine, which finds an additional use in SAM cofactor (Ferla and Patrick, 2014). Parenthetically, as a result the AUA isoleucine codon is unusal —it is also a rare codon (0.4%) in E. coli— as to avoid methionine has an unusual tRNA (ileX), which would make an easy target for recoding.
Allonine. Threonine synthase is the sole determinant of the chirality of threonine's second centre.
Aminophenylanine. Chorismate is rearranged to prephenate and then oxidatively decarboxylated and transaminated to make tyrosine, while the hydroxyl group of chorisate is swapped for amine for folate biosynthesis. If the product of the latter followed the tyrosine pathway one would aminophenylalanine.
Rethinking phenylalanine. While on the topic, the whole route for phenylalanine biosynthesis is odd. It feels like an evolutionary remnant. If one were to draw up phenylalanine biosynthesis without knowing about the shikimate/chorismate pathway the solution would be different. If I were to design the phenylalanine pathway I would start with a tetraketide (terminal acetyl-CoA derived; 2,4,6,8-tetraoxononanoyl-CoA), cyclise (Aldol addition of C9 in enol form to C4 ketone), two rounds of reduction (6-oxo and 8-oxo) and three dehydrations (4,6,8-hydroxyl), followed by a transamination (2-oxo): phenylalanine by polyketide synthesis!
Naphthylalanine. Phenylalanine has a single aromatic ring, naphthylalanine has two. Using the logic for the rethought phenylalanine synthesis we get a synthesis by hexaketide (2,4,6,8,10,12-hexaoxotridecanoyl-CoA). Namely cyclise (Aldol addition of C11 in enol form to C6 ketone and C13 to C4), two rounds of reduction (8-oxo and 10 or 12-oxo) and five dehydrations (4,6,8,10,12-hydroxyl), followed by a transamination (2-oxo). Naphthylalanine has been added to the genetic code (Wang et al., 2002), but was added as a supplement. GFP doesn’t work with either form of naphthylalanine (Kajihara et al., 2005), therefore there isn’t a good selection marker where the amino acid itself is beneficial. So probably the worst sketched pathway to possibly make.
Other aromatic compounds. Secondary metabolites are often made by polyketide or isoprenoid biosynthesis, which are rather flexible so some extra compounds could be drawn up.
LEGO style. All amino acid backbones are make in different ways and only cysteine, selenocysteine, homocysteine and tryptophan operate by a join-side-chain-on-with-backbone approach. Tryptophan synthase does this trick by aromatic electophilic substitution, where the electrophile is phosphopyridoxyl-dehydroalanine (serine on PLP after hydroxyl has left), while the sulfur/seleno amino acids are by the similar Micheal addition. So this trick could be extended to other aromatic, carbanions and enols, if one really wanted to, but that if far from a nice one size fits all approach.

Thursday, 9 July 2015

Thesis wordle

EDIT: The correct term is word cloud, while wordle is a website that runs on Flash and does not work in most browsers anymore. I would recommend Tagul instead. Here is the Wordle for my thesis:

I studied the enzyme MetC from Thermotoga maritima, which had alanine racemising activity in addition to a β-eliminating activity. I also studied the enzymes from Wolbachia, Pelagibacter ubique and E. coli. So far so go, all those words appear.
Unfortunately Wordle breaks up non breaking spaces, understores, interpuncts, except hyphens. It removes common words, but it does not collapse grammatical number, hence the enzymes/enzyme, genes/gene and activities/activity. In this version, hyphens were added and common plurals collapsed.
My appendix features a large Perl script, which affects the Wordle: else, elsif, foreach, print, file, sub and the name of a function (input). I assume "if" and "for" are weeded out by the filter against common words. Also YP and NP feature as there are many genbank accession identifiers.
If I remove the filter of common words the "the" is so prevalent that is squashes everything.

Now I want the raw data. So I pasted the thesis into an online word counter and saved the output and imported in MatLab and plotted the power-law distribution.

Not much else can be done with the data because many of the single-appearance words are actually sequences and there is no way to cluster the words by meaning. All the possible analyses will give generic results (e.g. more frequent words are shorter). If one were to go overboard, one option would be to track the progress of certain key words across the text. A Twitter trending equivalent. The problem is that I already know what words will appear more frequently were. So there isn't much point and it is probably best stopping at a Wordle step, which shows what one already knows, but is pretty.

Appendix, Scripts:
%first graph
xlabel('Rank of the unique word');
title('Word frequency in Matteo"s thesis');
offset= 300;
hold on;
hold off;

>%second graph (not shown, frequent words are shorter).
for i=1:numel(xprime)
    zprime=[zprime mean(z(x==xprime(i)))];
    yprime=[yprime numel(z(x==xprime(i)))];
hold on;
hold off;
xlabel('number of counts of word');

ylabel('average number of letters for words of that frequency')