The art of blowing up protein: KEGG

Monday, 28 September 2015

Publication-driven complexification

"Any sufficiently advanced technology is indistinguishable from magic."

—A. Clarke's Third Law

I like knowing what I am doing, it helps me figure out if something is wrong. However, there is a trend for accepted operations to become "mathemagical" either out of necessity... or out of publication race.

RNASeq is often cheered for being absolute reads and thus requiring less normalisation than microarrays. Anyone that has tried to understand Loess or Lowess takes that as a blessing.
However, the situation is not that simple as different samples have different total amounts of DNA and, due to publications trying to out do the other it quickly gets complicated, where acceptant nodding is the best option lest one want to spend an evening reading a method's section and the appendix on a paper.
An arithmetic average is discouraged as the highly expressed genes will wreak havoc with other genes. On those lines, Mortazavi et al. 2008 (PMID: 18516045) introduced a per mille CDS-length normalised scale, which was called RPKM —the letters don't mean much and feel like a weird backronym, which I don't get, although had I been them I would have made people have to use the ‰ glyph. The underlying logic is clear and for a while it was popular. RPKM got ousted by Anders and Huber 2010 (PMID:20979621) and Robinson et al. 2010 (PMID:19910308) with DESeq and edgeR, which do many calculations. The normalisation in DESeq is done by taking the median for each sample of each of the ratios of the gene count for that samples over the geometric mean of the values of that genes across the samples. That makes sense, albeit cumbersome to grasp. For the test of significance, the improvement on a negative binomial GLM spans a page or two and works by magic, or more correctly mathemagic, maths that is so advanced it might as well be magic —Google shows the word is taken by some weird mail order learn maths course or somesuch, but shhh! The sequel to DESeq, DESeq2 does everything automatically and in 3 lines everything is done. It is really good although I do like to know what I am doing.
Long story aside, I did in parallel the analysis with DESeq methods in MatLab, using their agonising tutorial, which was tedious, but good to see how it everything works —to quote Richard Feynman: "what I cannot create I cannot understand". It wants a true waste of time as it made me realise what graphs would be helpful as the various protocols and co. try and out do each in terms of graph complexity. It really bugged me that an obvious graph, double-log plot of each replicate against its counterpart with a Pearson's ρ thrown in, was omitted everywhere in favour of graphs, such as the empirical cumulative function of the distribution of the variance against the χ-squared of the variance estimates —a great way of inspecting if there are any oddities in a distribution, but really not the simplest graph. It is used because it is sophisticated, which doesn't necessarily mean better (incorrect assumptions can go a long way in distorting data), but it means more publishable.
Curiously, once the alignment, normalisation and significance steps are done, it is cowboy territory especially for bacteria. For example, for bacterial operon composition there is either a program, Rockhopper, which does not like my genome (I had to submit for realignment each replicon on different runs to get operons for the plasmids and refuses to do the chromosome) and another paper where the supplementary zip file was labelled .docx and the docx was labelled zip, which goes to indicate the scarce interest.
The main use of differential expression is functional enrichment, which depends on annotations that are rough as nails...

I am in awe at the mathematics involved in the calculation, but I cannot help, but feel annoyed at the fact that they are embarrassingly more sophisticated than they ought to be, especially since the complexity gives somewhat marginal improvements and the created dataset will be badly mishandled with clustering based on wildy guessed functions.

Wednesday, 6 May 2015

Speculations about methionine biosynthesis genes

Last year I wrote a review on the bacterial diversity methionine biosynthesis:

Ferla MP, Patrick WM. Bacterial methionine biosynthesis. Microbiology. 2014 Aug;160(Pt 8):1571-84.
PMID: 24939187, doi: 10.1099/mic.0.077826-0 and pdf.

It was crammed with facts and a couple of deductions that in my opinion are correct. However, there were a lot of hypotheses and conjectures, from plausible to wild, that did not make it into paper. Here I thought I might mention a few.

The MetCombo

It is my opinion that a bifunctional enzyme that catalyses both the MetC and the MetB reaction is impossible. I have come to call this hypothetical enzyme, the MetCombo. So the data at hand are:

MetC and MetB are close homologues and it is really hard to tell them apart in a phylogram —with the bold assumption that the uncharacterised genes are what have been guess.
Both KEGG and EcoCyc take the close homology to mean that bifunctional enzyme is present in several organisms —basically all those with MetB, which is a lot as you know from the met biosynthesis paper
They are in the same pathway
Nobody has ever seen a metCombo
Papers that try to evolve MetC ↔ MetB are not realy successful
Personal results: E. coli metC cannot rescue metB
Personal results: Thermotoga maritima "metB" is actually a metC and it has no in vivo or in vivo MetB activity(check out my thesis)
Catalytically a metC and metB in a single active site would be a disaster.

Catalytic profligacy

Cystathionine is a cysteine/alanine and a homocysteine/homoalanine joined together with a thioether. It has a short side (S is on the β) and a long side (S on the γ).
MetC is cysthationine β-lyase, it eliminates cystathionine at the thioether bond. On the shorter side (β).
MetB is cystathionine synthase it eliminates O-acetyl-homoserine at the ester bond and then attacks it with cysteine's thiol making cystathionine.
The two PLP enzymes hold cystathionine at some point but in radically different ways, one on the β side (MetC), the other on the γ (MetB).
Taking a step back, we have two types of cystathionine lyase and what controls the specificity between a β-lyase and a γ-lyase is not known —there have been a few papers looking into making MetC into a MetB and viceverse, but unfortunately nothing tackling this simpler issue. Cystathionine looks nearly identical from both sides: the sulfur bridge is hard to tell apart from a methyl group as there is only a slight size and charge difference. Methionine can be substituted with norleucine in protein with only minimal effect. Therefore it is intriguing how the enzymes bind it tightly in a specific way. My theory is that sulfur-π interactions may be involved as there are several tyrosines in the active site of MetC. Additionally, a β-elimination might be easier than a γ-elimination, therefore it is shame that there is a decent amount of data of the lack of γ-elimination activity in the β-lyase, but not viceversa. Therefore it would seem more likely to have a powerful bifunctional β- γ- cystathionine lyase than have retrained one that is strongly specific. However, this bifunctional enzyme is not an evolutionary a good idea due to the number of round trips it would do. Specifically, cysteine or homocysteine would go into making cystathionine, which the uncommitted lyase would either correctly transform or return a starting substrate —at the cost of ATP. The reason for this fascinating parenthesis is to conclude that cystathionine synthase/lyase combo that could do both, would be equally as bad of an idea —it would work due to flux from excess substrate to product in demand, but it is just extremely inefficient.

Conflicting results

Some methionine gene rescue experiments go in different ways that expected and there occasionally are concentration dependent oddities. My opinion is that this is due toone of the following:

It is dominating the threonine branch point and there is not enough threonine being produced.
It is depleting all the cysteine or homocysteine

I like the idea of enzymes being repressed from doing something easy, but bad on an aside (e.g. MetC can eliminate serine), but I don't think it's the case here. Unfortunately the only way to find out is to actually test whether the oddity perseveres when threonine and cysteine are added.

The ancestral PLP-dependent methionine gene

TBA

Methionine and norleucine

TBA

Alignment file

Here is the alignment file of manually aligned genes of various metB metC etc.

The art of blowing up protein

Pages