The art of blowing up protein: Uncultured bacterial majority? Digitally unannotated majority of the minority is worse

Aquifex, an exiting bacterium, which unexciting BioSample data.

It has been remarked that the majority of bacterial diversity remains uncultured. A fraction of the cultured bacteria are genome-sequenced and a fraction of these have an machine-readable data about them.

The tome of wisdom

For a new bacterial species to be coined, a lot of data and paperwork needs to collected. The process is very old fashioned. The genome does not need to be sequenced, but the melting temperature of the genome does. As Cyanobacteria ("blue-green algae") was never part of the kingdom Monera, but of the kindom Plantae, the rules are so mad that it basically is abandoned taxonomically —and there are Synechococcus ssp. interspersed throughout the phylum. A lot of information is collected in various papers and it gets put in a book, Bergey's Manual of Bacteriology, which can be downloaded illegally or be found by going on a quest to the dusty paper sections of a library, I am told.

Enter the modern world

The data mining for genome mining consists of parsing all sequenced genomes and filtering them out by certain genomic properties, such as the presence of a certain other genes, what the neighbouring genes are (guilt by association) and so forth. The genomes can be clustered based taxonomy with a bit of work. But when it comes to information about the species itself the limited amount of data is shocking. When a genome gets submitted NCBI recommends that some metadata be added, technically called BioSample data, about the environment of the bug, optimal temperature and so forth. However, after collapsing sister strains, only about 0.5% of the bacterial species have information about temperature, relationship to oxygen and lifestyle. As a result a simple question like "Are thermophilic species less likely to have this gene?" ends up getting an answer with a poor statistical backing (if addressed properly). Strangely, there is no other option outside of the BioSample data. I have asked around and all got "me too!" replies. There is no dataset that incorporates the rather useful data from the various descriptions of the species. Pathway hole prediction is an example of the lack of this data. The BioCyc suite of databases, for example, tries to predict what compounds a species makes and it some cases, such as quinone length and type, it is near impossible, because the data on auxotrophies, lifestyle, quinones and lipids and so forth remains inaccessible.

I need it, but I am not going to spend months textmining Bergey's manuals. Therefore it is my hope someone needs it more than me and they would do the hard work and release it. Either that or wait until Google becomes sentient enough to do it. Strangely the latter seems more likely.

The art of blowing up protein

Pages

Sunday, 17 January 2016

Uncultured bacterial majority? Digitally unannotated majority of the minority is worse

The tome of wisdom

Enter the modern world

No comments:

Post a Comment