Rosetta/Pyrosetta on a cluster or in the cloud

Wednesday, 7 October 2020

Rosetta/Pyrosetta on a cluster or in the cloud


Due to licensing Rosetta and Pyrosetta cannot be installed via apt-get/pip but has to be downloaded from the Rosetta Commons website. This makes things harder if you are in a colabs notebook, ssh'ed into a machine or running off a remote jupyter notebook. Luckily it actually is straightforward.

Warning: This information is out of date.

TL;DR

Here is a notebook in colabs with PyRosetta, if you want a snippet of code to use in a Colabs cell, here it is:
#@title Installation
#@markdown The following is not the real password. However, the format is similar.
username = 'boltzmann' #@param {type:"string"}
password = 'constant' #@param {type:"string"}
#@markdown Release to install:
_release = 'release-295' #@param {type:"string"}
#@markdown Use Google Drive for PyRosetta (way faster next time, but takes up space)
use_drive = True #@param {type:"boolean"}


import sys
import platform
import os
assert platform.dist()[0] == 'Ubuntu'
py_version = str(sys.version_info.major) + str(sys.version_info.minor)
if use_drive:
  from google.colab import drive
  drive.mount('/content/drive')
  _path = '/content/drive/MyDrive'
  os.chdir(_path)
else:
  _path = '/content'
if not any(['PyRosetta4.Release' in filename for filename in os.listdir()]):
  os.system(f'curl -u {username}:{password} https://graylab.jhu.edu/download/PyRosetta4/archive/release/PyRosetta4.Release.python{py_version}.ubuntu/PyRosetta4.Release.python{py_version}.ubuntu.{_release}.tar.bz2 -o /content/a.tar.bz2')
  os.system('tar -xf /content/a.tar.bz2')
os.system(f'pip3 install -e {_path}/PyRosetta4.Release.python{py_version}.ubuntu.{_release}/setup/')
import site
site.main()
import pyrosetta

Download

When logged into the target machine (discussed below), there are two main ways of downloading files off the websites with the terminal wget and curl. The former is easy to use (and follows redirects well), while the latter is more powerful. When one goes to www.pyrosetta.org/dow and finds the latest version wanted and clicks it a pop-up that is part of the browser appears. This is not a JavaScript modal, it is an itegral part of the browser, which when filled in visits the page but the request contains the username and password in the header. To achieve this with curl use the command -u with username:password. For example:

curl -u 👾👾👾:👾👾👾 https://graylab.jhu.edu/download/PyRosetta4/archive/release/PyRosetta4.Release.python38.linux/PyRosetta4.Release.python38.linux.release-273.tar.bz2 -o a.tar.bz2

If you don't remember the password the email was sent by "comotion" (with one em). However, in addition to searching your inbox you should brush up on your protein folding theory as the password (common to all academic user) is very memorable —tut tut!

The same goes for Rosetta, but with a different username, password and url https://www.rosettacommons.org/downloads/academic/3.12/rosetta_bin_linux_3.12_bundle.tgz. As of writing, the SSL certificate is expired for the latter (namely, Chrome has hissy fits because it is not https://), so you may also need to add -k if you get an SSL warning. If you have some other issue, if you search the username:password in Google you get a few GitHub sh files and instructions that people have negligently made public.

Once it is downloaded, unzip it and navigate into it and install:

tar -xf a.tar.bz2
cd PyRosetta4.Release.python38.linux
pip install .

If your download is a .tar.gz use tar -xzf a.tar.gz.

If you have not come across the pip install . command, take note as it is very handy for any python script you may write as it instals the module present in the current folder assuming there is a setup.py file. The latter is easy to write and well documented. Running pip install -e . is even cooler as it means that you can edit your module and not have to re-install it.

To use system pip you need to be admin, to use conda pip you don't —if you activated your conda environment (your shell prompt will show it) pip will be the conda pip. If you want to make sure you are using the correct pip if relevant, you can check with which pip.

Jupyter

(This topic is covered in more detail here)

In a Jupyter notebook there are three ways to run a bash command. In a terminal page, in a while cell prefixed with the cell magic %%bash or prefixing a code with !. For example:

!echo 'hello world'

If you are not sure what version of python you are running you can check with:

import sys
print(sys.version)

A special case is Colabs or similar, where a new container is spun up for your session, which is a bit of a pain because you'd need to reinstall pyrosetta if you did it the normal way. So one option is to download it and put it in your google drive, so you can skip the download and extraction step. To do so:

from google.colab
import drive drive.mount('/content/drive')

Remote machines

Too often I hear of people using their laptops for calculations, because they are unaware of HPC resources available to them. It is not uncommon to have a department cluster, a university cluster or a national cluster. In fact, I, at the University of Oxford, have access to a departmental, university (ARC) and national cluster (Archer). While during my PhD at the University of Otago (NZ) I had access to a departmental and a national one (NeSI). Information online about these are generally patchy, but central or departmental IT people can direct you to the correct place. That is, do not assume there is not one because you don't know about it.

In extremis, one can apply to get Amazon AWS credit (easy and not much paperwork) and run calculations in the cloud.

The command ssh allows one to connect to a remote machine and use the terminal there as if locally. On a Windows machine you can do this either by enably the Ubuntu developer kernel in a Windows 10 machine (recommended) or by using "Putty".

One thing to remember is that clusters may have a job scheduler that allocates a script to a given compute node, whereas when you ssh in, you are in a login node —it is all to often common, but extremely inconsiderate to run calculations on a login one. So do familiarise yourself with Slurm, Sun Grid Engine or whatever job scheduler is present.

Another is that, in order to tailor for all usages, most clusters are set up to dyanamically modify one's user environment via "environment modules":

module avail
module load python/3.8

In our case module load conda is best due to the install step

.

Remote notebook

NB. This section applies to a node that does not depend on a scheduler, i.e. do not do this on a login node. If the latter is the case, ask the person responsible for your cluster about interative nodes and jupyterhub options.

When you run a jupyter notebook you can specify the port --port 9999, which, if the port is exposed (--ip="*" and no firewall), is visible to all on the same network (in the browser type the machine's local IP followed by :9999). Generally it will be blocked, but luckily you can do port forwarding via ssh (see explainshell.com/explain?cmd=ssh+-L+-N+-f+-l for more). Namely, by typing in your own machine's terminal:

ssh -N -L localhost:9999:localhost:9999 👾👾👾@👾👾👾

Where 👾👾👾@👾👾👾 is the user@address of the remote machine, where you are also sshed into normally and are running jupyter notebook --no-browser --port=9999. Now when in your browser you go to localhost:9999 you'll get the notebook. "localhost", "0.0.0.0" or "127.0.0.1" are kind of synonyms and mean your machine —hence the joke stickers saying "There is no place like 127.0.0.1". However, if this works there are some changes that are a must. First, when you close the original ssh connection where you were running the notebook it will die. This is no good. Therefore you can run nohup jupyter notebook &, which will run it detached (fg or lsof -i 9999 and killing that process). Alternatively and better still, check if tmux is present —this is a very powerful tool, a terminal multiplexer, and worth checking out.

A potential tripping point is how jupyter notebook is installed (check the python version in the notebook with sys.version). In some case you may need to register the required python kernel. Actually you can register multiple kernels, including Julia and NodeJS.

Another thing to do is to set a password, which stops the pain of copying a token:

jupyter notebook password

Also, to avoid typing in the ssh password, look into ssh keys. The 5 minute faff will save you from endlessly typing passwords.

Lastly, having a remote notebook the same colour as your local one gets confusing very very quickly. Therefore I suggest using Jupyter themes (pip install jupyterthemes), say you wanted a notebook that is yellow (?!), you'd set the solarize theme —personally I use dark mode locally and a blue light mode remotely.

2 comments:

  1. when I try to untar the file, I get this error :
    tar (child): bzip2: Cannot exec: No such file or directory
    tar (child): Error is not recoverable: exiting now
    tar: Child returned status 2

    ReplyDelete
    Replies
    1. The file paths have changed since — 3.8 is long deprecated

      Delete