Publication-driven complexification

Monday 28 September 2015

Publication-driven complexification

"Any sufficiently advanced technology is indistinguishable from magic."
—A. Clarke's Third Law
I like knowing what I am doing, it helps me figure out if something is wrong. However, there is a trend for accepted operations to become "mathemagical" either out of necessity... or out of publication race.  

RNASeq is often cheered for being absolute reads and thus requiring less normalisation than microarrays. Anyone that has tried to understand Loess or Lowess takes that as a blessing.
However, the situation is not that simple as different samples have different total amounts of DNA and, due to publications trying to out do the other it quickly gets complicated, where acceptant nodding is the best option lest one want to spend an evening reading a method's section and the appendix on a paper.
An arithmetic average is discouraged as the highly expressed genes will wreak havoc with other genes. On those lines, Mortazavi et al. 2008 (PMID: 18516045) introduced a per mille CDS-length normalised scale, which was called  RPKM —the letters don't mean much and feel like a weird backronym, which I don't get, although had I been them I would have made people have to use the ‰ glyph. The underlying logic is clear and for a while it was popular. RPKM got ousted by Anders and Huber 2010 (PMID:20979621) and Robinson et al. 2010 (PMID:19910308) with DESeq and edgeR, which do many calculations. The normalisation in DESeq is done by taking the median for each sample of each of the ratios of the gene count for that samples over the geometric mean of the values of that genes across the samples. That makes sense, albeit cumbersome to grasp. For the test of significance, the improvement on a negative binomial GLM spans a page or two and works by magic, or more correctly mathemagic, maths that is so advanced it might as well be magic —Google shows the word is taken by some weird mail order learn maths course or somesuch, but shhh! The sequel to DESeq, DESeq2 does everything automatically and in 3 lines everything is done. It is really good although I do like to know what I am doing.
Long story aside, I did in parallel the analysis with DESeq methods in MatLab, using their agonising tutorial, which was tedious, but good to see how it everything works  —to quote Richard Feynman: "what I cannot create I cannot understand". It wants a true waste of time as it made me realise what graphs would be helpful as the various protocols and co. try and out do each in terms of graph complexity. It really bugged me that an obvious graph, double-log plot of each replicate against its counterpart with a Pearson's ρ thrown in, was omitted everywhere in favour of graphs, such as the empirical cumulative function of the distribution of the variance against the χ-squared of the variance estimates —a great way of inspecting if there are any oddities in a distribution, but really not the simplest graph. It is used because it is sophisticated, which doesn't necessarily mean better (incorrect assumptions can go a long way in distorting data), but it means more publishable.
Curiously, once the alignment, normalisation and significance steps are done, it is cowboy territory especially for bacteria. For example, for bacterial operon composition there is either a program, Rockhopper, which does not like my genome (I had to submit for realignment each replicon on different runs to get operons for the plasmids and refuses to do the chromosome) and another paper where the supplementary zip file was labelled .docx and the docx was labelled zip, which goes to indicate the scarce interest.
The main use of differential expression is functional enrichment, which depends on annotations that are rough as nails...

I am in awe at the mathematics involved in the calculation, but I cannot help, but feel annoyed at the fact that they are embarrassingly more sophisticated than they ought to be, especially since the complexity gives somewhat marginal improvements and the created dataset will be badly mishandled with clustering based on wildy guessed functions.

No comments:

Post a Comment