How to deal with horrid XML dictionaries in Python

Wednesday, 19 December 2018

How to deal with horrid XML dictionaries in Python

NCBI and Uniprot data is stored as byzantine XMLs, which have rather consistent schema, but hard to decipher. You can spend ages trying to find the series of keys and indices required to find a given value.
Here I present a nice pair of Python methods to get a given key or value in a convoluted object of nested dictionary-like and list-like objects.


Code

Say you have NCBI protein ID and you run:

handle = Entrez.esearch(db="protein", term=identifier, retmax=1)
d = Entrez.read(Entrez.efetch(db="protein", id=Entrez.read(handle)["IdList"][0], rettype="native"))

The dictionary d is a nightmare. Hence these three handy methods:
Two methods "float upwards" the path to a key float_key(dictionary,key) or value float_value(dictionary,key) in given convoluted nested dictionary, while the last one get_value(dictionary,(keys/indices list)) get the value in the dictionary following a tuple of keys/indices returns the value.
These are written for python3 and not tested under 2.
Also it works for dictionary-like and list-like object (cf. Biopython Entrez output).
Example

identifier='WP_013067760'
handle = Entrez.esearch(db="protein", term=identifier, retmax=1)
d = Entrez.read(Entrez.efetch(db="protein", id=Entrez.read(handle)["IdList"][0], rettype="native"))
from pprint import PrettyPrinter
pprint=PrettyPrinter().pprint
pprint(d) #unintelligible even with pprint!!
So here comes the script:

float_key(d,'Textseq-id_accession') #gives a list of keys than can be used in get_value for future iterations.
get_value(d,('Bioseq-set_seq-set',0,'Seq-entry_seq','Bioseq','Bioseq_id',0,'Seq-id_other','Textseq-id','Textseq-id_accession'))
#full circle! WP_013067760

Uniprot and Pfam

If you are dealing with Uniprot or Pfam, the XML parser is not supported by Biopython.
If you want to do it by the book, you would use a schema, which is a template XML normally follows. However, whereas Uniprot works fine getting it parsed via its XML schema, Pfam fails even with validation = lax.
Uniprot XML

import xmlschema, ET
etx = ET.XML(xml) #Uniprot xml
schema = xmlschema.XMLSchema('https://www.uniprot.org/docs/uniprot.xsd')
entry_dict = schema.to_dict(etx)['{http://uniprot.org/uniprot}entry'][0] #worked
Pfam XML

etx = ET.XML(xml)
schema = xmlschema.XMLSchema('https://pfam.xfam.org/static/documents/schemas/protein.xsd')
entry_dict = schema.to_dict(etx) #fails
Therefore I suggest using the code from this Stack overflow answer

protoentry_dict = etree_to_dict(etx)
if '{https://pfam.xfam.org/}matches' in protoentry_dict['{https://pfam.xfam.org/}pfam']['{https://pfam.xfam.org/}entry']:
    entry_dict = protoentry_dict['{https://pfam.xfam.org/}pfam']['{https://pfam.xfam.org/}entry']['{https://pfam.xfam.org/}matches']['{https://pfam.xfam.org/}match']
else:
    entry_dict = []


However, for purposes of sanity, I clean my entries to remove the tags {http://uniprot.org/uniprot}{https://pfam.xfam.org/} that appear everywhere. Using this script:

No comments:

Post a Comment