Getting the corresponding nucleotide sequences of protein sequences

Saturday, 16 January 2016

Getting the corresponding nucleotide sequences of protein sequences

Getting stuck on a really simple task is not a nice feeling.
A seemingly simple challenge I have faced a few times so far is getting the nucleotide sequence corresponding to a protein sequence: this seems really straightforward, yet it is not.

Manually, it is somewhat easy:
  • Go to the protein record corresponding to the GenBank ID of interest
  • Press Nucleotide on left side bar
  • Search for the GenBank ID in the genome record, if a scaffold click again
  • Try with the name or accession number if it fails
  • Take note for direction
  • Press Gene and then Extract
  • Reverse complement if needed
  • Fasta as text. Done.
However, this is a pain to automate, so I tried with different ways in python.
The FTP site has a lot of genomes, but not all proteins come from sequenced genomes or are RefSeqs. So Blast against the online database is the best option.
At first I thought that a degenerate back translate would work best. But it turns out those residues are masked and the blastn fails.
Instead I tried tblastn. For Blasts I was using the biopython module NCBIWWW, which simply got stuck —possibly because the NCBI server was rather busy. So after a few failed tries I simply submitted the fasta sequences myself on the web server against the "nt" database, forgetting that some may have come from the "wgs".
After fifteen minutes of Blasting away it gave the XML file I needed and I got the sequences. Of course at first I just verified by length of the translation as it hadn't crossed my mind that some were from the wgs database. So I had to repeat the failed ones with the wgs.
All in all, it took a few hours... probably more than it would have taken to get them by hand...

No comments:

Post a Comment