>Example of the digitally viral sequence
MLPTLSRLSVSYRPESRLRHWCSSTSLRISPLHVEFRSPLLHSSPFTPVSNDSPRLSRGLSHQT
Sequencing a genome is easy nowadays. The problem is that there seems to be a need to for easy annotations. PROKKA, for example, is advertised as being quicker than a coffee break. The problem is that the principle of junk-in junk-out applies: there are rather poor annotations out there and they get propagated with ease and the only way to stop it is to look over what has gone on manually. A funny example is the sloppiness of gene symbols: species A has an enzyme encoded by aaa, species B has two isozymes, aaaA and aaaB, while species C has one, but blasts to aaaB so the gene gets called aaaB, even if there is no aaaA. Anothe craziness is that y-genes get propagated beyond B. subtilis and E. coli. One of the reasons of this junk-rich annotation is that it is near impossible to correct it once it is on NCBI.
I saw the strange peptide in question on lots of draft genomes last year, recently I was curious to see if it was everywhere by now, but it was not: it had actually gone done to only 20 or so sequences (generally called conserved hypothetical protein). I often have had accession numbers returning errors (obsolete), but I never did really think that behind the scenes the people at NCBI actually cleans up the junk, but I was wrong...
No comments:
Post a Comment