Pages

Saturday, 16 January 2016

The contagious ORF annotation error of 16S rRNA

Some time back in many genomes there were a few copies of a small hypothetical open reading frame, sometimes annotated as a quinone oxidase. These organisms also had less 16S rRNA than 23S rRNA. This is not some curious observation about enzyme evolution of a duo of a promiscuous ribozymatic activity of 23S rRNA and small protein that could lead to a Nature paper, though. In reality it is a sequence annotation error that seems quite viral in NCBI.



>Example of the digitally viral sequence MLPTLSRLSVSYRPESRLRHWCSSTSLRISPLHVEFRSPLLHSSPFTPVSNDSPRLSRGLSHQT

Sequencing a genome is easy nowadays. The problem is that there seems to be a need to for easy annotations. PROKKA, for example, is advertised as being quicker than a coffee break. The problem is that the principle of junk-in junk-out applies: there are rather poor annotations out there and they get propagated with ease and the only way to stop it is to look over what has gone on manually. A funny example is the sloppiness of gene symbols: species A has an enzyme encoded by aaa, species B has two isozymes, aaaA and aaaB, while species C has one, but blasts to aaaB so the gene gets called aaaB, even if there is no aaaA. Anothe craziness is that y-genes get propagated beyond B. subtilis and E. coli. One of the reasons of this junk-rich annotation is that it is near impossible to correct it once it is on NCBI.

I saw the strange peptide in question on lots of draft genomes last year, recently I was curious to see if it was everywhere by now, but it was not: it had actually gone done to only 20 or so sequences (generally called conserved hypothetical protein). I often have had accession numbers returning errors (obsolete), but I never did really think that behind the scenes the people at NCBI actually cleans up the junk, but I was wrong...

No comments:

Post a Comment