Here I present a nice pair of Python methods to get a given key or value in a convoluted object of nested dictionary-like and list-like objects.
Code
Say you have NCBI protein ID and you run:
handle = Entrez.esearch(db="protein", term=identifier, retmax=1)
d = Entrez.read(Entrez.efetch(db="protein", id=Entrez.read(handle)["IdList"][0], rettype="native"))
The dictionary d is a nightmare. Hence these three handy methods:
Two methods "float upwards" the path to a key
float_key(dictionary,key)
or value float_value(dictionary,key)
in given convoluted nested dictionary, while the last one get_value(dictionary,(keys/indices list))
get the value in the dictionary following a tuple of keys/indices returns the value.These are written for python3 and not tested under 2.
Also it works for dictionary-like and list-like object (cf. Biopython Entrez output).
Example
identifier='WP_013067760'
handle = Entrez.esearch(db="protein", term=identifier, retmax=1)
d = Entrez.read(Entrez.efetch(db="protein", id=Entrez.read(handle)["IdList"][0], rettype="native"))
from pprint import PrettyPrinter
pprint=PrettyPrinter().pprint
pprint(d) #unintelligible even with pprint!!
So here comes the script:
float_key(d,'Textseq-id_accession') #gives a list of keys than can be used in get_value for future iterations.
get_value(d,('Bioseq-set_seq-set',0,'Seq-entry_seq','Bioseq','Bioseq_id',0,'Seq-id_other','Textseq-id','Textseq-id_accession'))
#full circle! WP_013067760
Uniprot and Pfam
If you are dealing with Uniprot or Pfam, the XML parser is not supported by Biopython.If you want to do it strictly by the book, you would use a schema, which is a template XML normally follows. However, whereas Uniprot works fine getting it parsed via its XML schema, Pfam fails even with validation = lax.
If you want to do it properly, you really ought to be using ElementTree, which I discuss on a different post. But if you are in a rush and are rushy on you ElementTree syntax (which is easy to forget) or you want your data as a JSON or similar corner-case, the following solutions works:
Uniprot XML
import xmlschema, ET
etx = ET.XML(xml) #Uniprot xml
schema = xmlschema.XMLSchema('https://www.uniprot.org/docs/uniprot.xsd')
entry_dict = schema.to_dict(etx)['{http://uniprot.org/uniprot}entry'][0] #worked
Pfam XML
etx = ET.XML(xml)
schema = xmlschema.XMLSchema('https://pfam.xfam.org/static/documents/schemas/protein.xsd')
entry_dict = schema.to_dict(etx) #fails
Therefore I suggest using the code from this Stack overflow answer
protoentry_dict = etree_to_dict(etx)
if '{https://pfam.xfam.org/}matches' in protoentry_dict['{https://pfam.xfam.org/}pfam']['{https://pfam.xfam.org/}entry']:
entry_dict = protoentry_dict['{https://pfam.xfam.org/}pfam']['{https://pfam.xfam.org/}entry']['{https://pfam.xfam.org/}matches']['{https://pfam.xfam.org/}match']
else:
entry_dict = []
However, for purposes of sanity, I clean my entries to remove the namespace on tags
{http://uniprot.org/uniprot}
{https://pfam.xfam.org/}
that appear everywhere. Using this script:
No comments:
Post a Comment