Pages

Thursday, 24 August 2023

Reading compressed molecular files on NFS


There are some tasks that make one feel like a failed door-to-door evangelist, one amongst these is proselyting about using compressed files on networked file systems. Namely, NFS are slower than local SSD drives, so most often it is actually quicker to read compressed files in memory rather than decompress them to disk. Here are two Python snippets for dealing with small molecule files.

Zip files

These exist for Windows users and are annoying [intentional ambiguous subject].
A zip is a multifile archive, a tarball-equivalent. Below it's assumed there's a single file in the archive, but if there are more one could iterate across the list given by the .infolist() method of the compressed archive filehandle object (zipfile.ZipFile).
In RDKit, a filehandle of a SDF can be read, but not via Chem.SDMolSupplier, but via Chem.ForwardSDMolSupplier. In particular it needs to be a binary stream, not unencoded text, but that is fine here, otherwise things would get complicated in order to avoid doing binary = text.encode('utf8') which would fill memory up quickly. Another thing to note is that a text and a binary stream can be typehinted with typing.TextIO or typing.BinaryIO, while the io classes are io.StringIO and io.BytesIO

import zipfile
from rdkit import Chem
from typing import BinaryIO

with zipfile.ZipFile(πŸ‘ΎπŸ‘ΎπŸ‘Ύ_sdf.zip', mode="r") as zah:
    zfh: BinaryIO = zah.open(zah.infolist().pop())
    with Chem.ForwardSDMolSupplier(zfh) as sdfh:
        mol: Chem.Mol
        for mol in sdfh:
            ...

GNU Zip files

Gunzip is a comically named command, but let's put that down.

import gzip
from rdkit import Chem
from typing import BinaryIO

with gzip.open('πŸ‘ΎπŸ‘ΎπŸ‘Ύ_sdf.zip', mode="r") as zfh:
    with Chem.ForwardSDMolSupplier(zfh) as sdfh:
        mol: Chem.Mol
        for mol in sdfh:
            ...

Writing

In RDKit, Chem.SDWriter can accept text streams. So the command would be gzip.open('πŸ‘ΎπŸ‘ΎπŸ‘Ύ_sdf.zip', mode="wt").

Bash

One limitation is that bashfu gets a bit more complicated. zcat replaces cat, but for head one has to pipe the decompressed stream gunzip -c πŸ‘ΎπŸ‘ΎπŸ‘Ύ.gz | head -n πŸ€–.

No comments:

Post a Comment