Pages

Tuesday, 19 February 2019

Uniprot XML and Python ElementTree

Biopython does not have support for Uniprot. The reason is because it holds so much data that it would defeat the point to introduce a complex standard that the user would have to try and remember and the best way is for the user to choose themselves what piece of data they want.
Here I discuss the best way to deal with Uniprot XML files using ElementTree, which is really nice, but awkward at times, hence why I talk about a few monkeypatches that help. If you do not wish to deal with ElementTree (say you want a really quick, bu messy fix) see my post about complicated dictionaries.

ElementTree and XML

First, lets talk about XML and ElementTree.
XML, like HTML, has the following syntax, which you are undoubtedly familar with:
  •  "elements" marked by two parts (called "tags") one opening and cloning  <ELEMENT ...>...</ELEMENT>
  •  "attributes" within the opening tag <ELEMENT attrib='value'>
  • "content" which is anything within the tags, including other elements.
A Python ElementTree.Element instance has three main attributes:
  • .tag: the name of the element. Note that it also includes the namespace for the schema in curly brakets, which tells a machine how to convert it into a database say.
  • .attrib: a dictionary of attribute names and values
  • .text: the content that isn't an element or within another element.

Monkeypatches

Unfortunately for parsing, the XML of Uniprot has many many elements, many optional. Say an entry may have none, one or more comment elements and several are nested. A good example is
Here are some really useful boolean methods that can be added to ElementTree.Element to make life a lot easier:
Specifically, they quickly and cleanly allow one to test things like if elem.is_tag('accession') or if elem.is_tag('name') and elem.has_attr('type', 'primary'): without too much awkaward syntax.

Large files

A caveat is required for large files. .parse loads everything into memory, which could be a bad thing if you are using the XML from the FTP site of Uniprot. In that case you have to use the iterparse function asking to yield the element on the event when it "end" reading it. As the first element (the root) stores all the elements it is essential to clear each element after that you are done with them.

def iter_human(filename):
    """
    Interates across a LARGE Uniprot XML file and returns *only* the humans.
    :return: ET.Element()
    """
    for event, elem in ET.iterparse(filename, events=('end',)):
        if elem is not None and isinstance(elem, ET.Element):
            if elem.ns_strip() == 'entry':
                if elem.is_human():
                    yield elem
                elem.clear()

No comments:

Post a Comment