There are some tasks that make one feel like a failed door-to-door evangelist, one amongst these is proselyting about using compressed files on networked file systems. Namely, NFS are slower than local SSD drives, so most often it is actually quicker to read compressed files in memory rather than decompress them to disk. Here are two Python snippets for dealing with small molecule files.
Zip files
These exist for Windows users and are annoying [intentional ambiguous subject].
A zip is a multifile archive, a tarball-equivalent. Below it's assumed there's a single file in the archive, but if there are more one could iterate across the list given by the .infolist()
method of the compressed archive filehandle object (zipfile.ZipFile
).
In RDKit, a filehandle of a SDF can be read, but not via Chem.SDMolSupplier
, but via Chem.ForwardSDMolSupplier
. In particular it needs to be a binary stream, not unencoded text, but that is fine here, otherwise things would get complicated in order to avoid doing binary = text.encode('utf8')
which would fill memory up quickly. Another thing to note is that a text and a binary stream can be typehinted with typing.TextIO
or typing.BinaryIO
, while the io classes are io.StringIO
and io.BytesIO
import zipfile
from rdkit import Chem
from typing import BinaryIO
with zipfile.ZipFile(πΎπΎπΎ_sdf.zip', mode="r") as zah:
zfh: BinaryIO = zah.open(zah.infolist().pop())
with Chem.ForwardSDMolSupplier(zfh) as sdfh:
mol: Chem.Mol
for mol in sdfh:
...
GNU Zip files
Gunzip is a comically named command, but let's put that down.
import gzip
from rdkit import Chem
from typing import BinaryIO
with gzip.open('πΎπΎπΎ_sdf.zip', mode="r") as zfh:
with Chem.ForwardSDMolSupplier(zfh) as sdfh:
mol: Chem.Mol
for mol in sdfh:
...
Writing
In RDKit, Chem.SDWriter
can accept text streams. So the command would be gzip.open('πΎπΎπΎ_sdf.zip', mode="wt")
.
Bash
One limitation is that bashfu gets a bit more complicated. zcat
replaces cat
, but for head one has to pipe the decompressed stream gunzip -c πΎπΎπΎ.gz | head -n π€
.
No comments:
Post a Comment