Noah's ark dilemma in phylogeny

Sunday, 22 November 2015

Noah's ark dilemma in phylogeny

When looking at bacterial diversity, be it for a conserved gene or the organism itself, a common problem is the wealth of sister strains and sister species: this is a bother as often one would like a balanced representation. This leads to a Noah's ark dilemma of having to pick one.

Luckily, some phylogenetic tools (RAxML, ClustalΩ and some others) have clustering settings allowing them to trim a dataset down. The only problem is that when one does a BLAST search and reduces the dataset one is left with the more sesquipedalian names, which mean nothing. I wrote the Wikipedia article on Bacterial taxonomy and I have been entasked to look into the phylogeny of different groups of genes or species a few times, so I would say my familiarity of bacterial diversity is higher than average, yet I have seen on papers and elsewhere an endless amount of trees whose leaves are filled with obscure species. This means that something better is probably in order. One solution to the dilemma of who goes in and who doesn't (the Noah's ark dilemma of the title), is picking the type strain/species/genus.

Going for typestrains

In a paper with Cameron Thrash I got all the ribosomal sequences of sequenced members of the Alphaproteobacteria, removed contaminated rRNA sequences (which sounds like a futile exercise, but as mentioned in the paper a few bugs had contaminations artefacts), then trimmed the diversity by picking the type species of each genus or the type strain of each species (script). I have given some thought into rewriting the Perl script into Python as I keep needing it and I really hate Perl nowadays, but I never have the time as a tad more grandiose effort would be needed for something publishable.
Picking the type taxon has some drawbacks.

Effort

Picking the type strain requires information which can be gleaned from the name only when they have a superscript T at the end of their name, which NCBI and most other servers do not do. In those cases, I take mine from LPSN, but it does require a lot of effort.

Uncharted taxonomy

The requirements to describe a species are quite painfully out of date and impossible for some species. Therefore there are many candidate species which will throw a spanner in the works. Intracellular parasites and scavengers cannot be grown to the satisfaction of the Bacteriological Code. Also, the Cyanobacteria is an utter mess —I have no idea why. One work around this is to see the frequency of the species in paper abstracts, but textmining a download of PubMed is a moderate amount of effort.

Not the famous one

The typestrain in some cases is not the famous strain. E. coli K-12 MG1655 is not the typestrain of Bacillus coli described by Dr. Escherich. MG1655 is the de facto reference strain as it was first sequenced in the Blattner et al. 1997 paper. (Topic of another blog post)

Bad choice for biomining

In the case of testing enzymes, picking the type species of a higher taxon might be a bad call. Wayne Patrick and I went for the metC gene Pelagibacter ubique to see how it behaved in the Pelagibacterales. P. ubique has an optimal temperature of 16°C, which means that it could only be expressed as a MalE fusion protein. As mentioned in my thesis, we should have gone for Pelagibacter bermudensis (SAR11 sp. HTCC7211), which prefers 24°C (IIRC), except it was not the type species.

Nevertheless, choosing the type species is by far the nicest approach. It is just a shame that nobody has made yet a server to sort the genes in a fasta file based on typespecies/number of papers. One day...

PS. Also check out the post on making up bacterial names by Markov chain.

The art of blowing up protein

Pages

Sunday, 22 November 2015