How to deal with horridly complex dictionaries in Python

Wednesday, 19 December 2018

How to deal with horridly complex dictionaries in Python

NCBI and Uniprot data is complex, so is understandably stored as byzantine data structures, which have rather consistent schema, but hard to decipher. Depending on workflow, XML files are can be stored as dictionaries or as ElementTree.Element instances. I will talk about both, here I will talk about dictionaries —elsewhere I discuss using ElementTree. These are easier to deal with in some cases, but you can spend ages trying to find the series of keys and indices required to find a given value.
Here I present a nice pair of Python methods to get a given key or value in a convoluted object of nested dictionary-like and list-like objects.


Code

Say you have NCBI protein ID and you run:

handle = Entrez.esearch(db="protein", term=identifier, retmax=1)
d = Entrez.read(Entrez.efetch(db="protein", id=Entrez.read(handle)["IdList"][0], rettype="native"))

The dictionary d is a nightmare. Hence these three handy methods:
Two methods "float upwards" the path to a key float_key(dictionary,key) or value float_value(dictionary,key) in given convoluted nested dictionary, while the last one get_value(dictionary,(keys/indices list)) get the value in the dictionary following a tuple of keys/indices returns the value.
These are written for python3 and not tested under 2.
Also it works for dictionary-like and list-like object (cf. Biopython Entrez output).
Example

identifier='WP_013067760'
handle = Entrez.esearch(db="protein", term=identifier, retmax=1)
d = Entrez.read(Entrez.efetch(db="protein", id=Entrez.read(handle)["IdList"][0], rettype="native"))
from pprint import PrettyPrinter
pprint=PrettyPrinter().pprint
pprint(d) #unintelligible even with pprint!!
So here comes the script:

float_key(d,'Textseq-id_accession') #gives a list of keys than can be used in get_value for future iterations.
get_value(d,('Bioseq-set_seq-set',0,'Seq-entry_seq','Bioseq','Bioseq_id',0,'Seq-id_other','Textseq-id','Textseq-id_accession'))
#full circle! WP_013067760

Uniprot and Pfam

If you are dealing with Uniprot or Pfam, the XML parser is not supported by Biopython.
If you want to do it strictly by the book, you would use a schema, which is a template XML normally follows. However, whereas Uniprot works fine getting it parsed via its XML schema, Pfam fails even with validation = lax.
If you want to do it properly, you really ought to be using ElementTree, which I discuss on a different post. But if you are in a rush and are rushy on you ElementTree syntax (which is easy to forget) or you want your data as a JSON or similar corner-case, the following solutions works:
Uniprot XML

import xmlschema, ET
etx = ET.XML(xml) #Uniprot xml
schema = xmlschema.XMLSchema('https://www.uniprot.org/docs/uniprot.xsd')
entry_dict = schema.to_dict(etx)['{http://uniprot.org/uniprot}entry'][0] #worked
Pfam XML

etx = ET.XML(xml)
schema = xmlschema.XMLSchema('https://pfam.xfam.org/static/documents/schemas/protein.xsd')
entry_dict = schema.to_dict(etx) #fails
Therefore I suggest using the code from this Stack overflow answer

protoentry_dict = etree_to_dict(etx)
if '{https://pfam.xfam.org/}matches' in protoentry_dict['{https://pfam.xfam.org/}pfam']['{https://pfam.xfam.org/}entry']:
    entry_dict = protoentry_dict['{https://pfam.xfam.org/}pfam']['{https://pfam.xfam.org/}entry']['{https://pfam.xfam.org/}matches']['{https://pfam.xfam.org/}match']
else:
    entry_dict = []


However, for purposes of sanity, I clean my entries to remove the namespace on tags {http://uniprot.org/uniprot}{https://pfam.xfam.org/} that appear everywhere. Using this script:

No comments:

Post a Comment