A simple hack for a phylogenetic Noah's ark dilemma

Tuesday 11 April 2017

A simple hack for a phylogenetic Noah's ark dilemma

Ever had an endless list of bacterial names that needed a trim?
Ever see a tree where the bacterium chosen is not the famous one, but it's cousin? Or actually a tree where you don't recognise a single name?
The issue of picking bacteria from a list is what I call Noah's ark dilemma. This term is used generally for the biblical problem of the size of the boat required for all the animals in existence (except dinosaurs). Here I mean it picking the most meaningful bacteria from a list. In the past year, I have come to rely on a simple solution: Pubmed popularity.

My problems are not as epic.
But I did add once a silhouette of a
dinosaur to a Circos plot once to
make it more dramatic...

The best way to make a gene tree is to do a psi-blast (with troublesome pathogens removed) a few times until you exceed the capacity for Safari to handle the large dataset (Chrome crashes first). Then you run a custom script to prettify the headers, then run cd-hit to prune it down, align it with muscle, infer a tree with RAxML and you are done. Except that the tree is composed of unknown names.
I used to like the hack to convert the names to NCBI taxon-ids so that when submitted to iTol website the inner nodes would be that of the higher taxa, so collapsing the branches and renaming the clades based on the nome would result in a pretty tree... but that is a pain. The solution would be to pick the best name in the list of header from the cd-hit clusters. The default is the first in the list. My solution get the species with the most pubmed citations. (I have a poorly annotated and slow script available if you want it).

This also works to find what species are the most interesting out of a list. Actually, using pubmed counts is also useful for figuring out lifestyle metadata. Manually porting the data from Bergey's manual of Bacteriology is the best way, but it is too labour intensive —it took me 2 hours to get the list of all Firmicute thermophiles and even then I had made mistakes. Below is a python 3 script that given a list of species and a list of terms will find the number of paper of those species. (EDIT: substitute genomeArk.binomialiser(species) with a function to clean names into a dictionary with the keys genus and species (epithet)).

Despite the apparent shoddiness, it is actually a pretty good hack.

No comments:

Post a Comment