The art of blowing up protein: Annotate as you go

There's a counter-constructive saying: a project is dead as soon as you add documentation (Aeschylus, I believe).

This could not be more incorrect. Whereas it is true that writing documentation on an evolving project will quickly result in the fresh documentation becoming quickly invalid, it is a planning truth that writing documentation once a project is finishing is impossible because there are a hundread and one more pressing issues. Therefore, adding docstrings to each function, method and class in Python as one goes along is by far more advantageous. Once this is done, however this information needs to be transmuted into documentation. Here is how once can set up ReadTheDocs without falling into a few traps, as the documentation generator Sphinx is ironically weirdly documented and should be done ideally early on, so one knows what mistakes one's making.

Note

I run an internal workshop on adding extras to a GitHub repository and one question in the feedback was if I could write out the steps to do ReadTheDocs properly. And this is it.

Motivation

Motivation is hard when it’s a big task. When the project is complete, the paper becomes the only focus and documentation falls on the sidelines. It does not help that reviewers rarely check code: in my experience, half of reviewers do not even check a web app. However, in the grand scheme of things it does matter. Therefore, one should not leave it to last. Every time I have left it as the last thing I have sorely regretted it. Furthermore, the sentence “I wrote this, but don’t remember what it does” does often arise when documentation is done late.

Invalid top-down documentation

The saying in the lead applies primarily to writing an overview. In an ideal world, the overview is written first and acts as a roadmap of how the module ought to work and by virtue of the excellent planning resulting from the thoroughly thought-out overview will be true even at the end of the project. However, projects evolve and more often than not they were not started with the idea of being evolvable —the comparison of an American city vs. a European medieval city is classic teaching example of the concept of planning and evolvability in CS. Nevertheless, I would still advocate to think what the end goals of a project are and sketch them out before starting. But majorly the most time-consuming part of writing documentation is the description of the parts, hence my insistence on writing them as one goes along.

Docstrings

In Python docstrings work really nicely —much nicer than doxigen documentation in C++. Docstrings are generally written in ReStructuredText (rst) within triple quotes within a function, method or class.

from typing import Any

def foo(bar: Any) -> int:
    """
    This is a docstring.

    :param bar: This is a parameter.
    :type bar: Any
    :return: This is a return value.
    :rtype: int
    """
    ...
    return bar

All these docstrings can be converted by Sphinx into a nice documentation page. Previously I wrote a blog post about converting docstrings to markdown documentation for GitHub, which is helpful in the case the project is not intended to be pip released, but for a proper project this is a bad idea and instead the correct course of action is to create ReadTheDocs documentation. The preferred format for GitHub is markdown as it's easier and the Sphinx autodoc extension is not applicable there. The preferred format for ReadTheDocs is ReStructuredText (rst).

The textbook example generation of the conf.py file is using Sphinx sphinx-quickstart command. This does not automatically tell it to convert docstrings out of the box, but you have to add them. The docstrings and module content is ”API” documentation and the command line tool sphinx-apidoc or sphinx-autogen do this. But it often requires some tweaks for the API documentation one wants.

At the base of the repo, we will create a .readthedocs.yml file for ReadTheDocs, but first lets make a .readthedocs folder (or any other name you want) will the documentation.

sphinx-apidoc -o .readthedocs . .readthedocs --full -A 'Your name here' -l 'en';
cd .readthedocs;

Running make html in that folder will generate the documentation in the html folder, for you to check out. Do this often as stuff breaks easily with Sphinx.

Some tweaks are a must.

In the folder there are two main files of interest, the conf.py file and the index.rst file. The former holds how the project is parsed the latter how is the main menu.

Automodule, autoclass, autobahn, automethod, autofunction

The index.rst file is the main menu. It will refer to a file, without the .rst extension, with the name of your module.

This will be a file in the folder along with all submodules, in the format module.submodule.rst. And will contain the following workhorse:

.. automodule:: module_name.submodule_name
   :members:
   :undoc-members:
   :show-inheritance:

There are a few directives like this that can be used to generate the documentation and are discussed in autodoc documentation, such as autoclass. When you add a new python file (submodule) to your project, Sphinx will not know about it. So be vigilant to add a new definition to the index.rst file.

The following parameters are worth noting:

:members: will include all the members of the module and the order can be changed with :member-order:.
This will not include private (_foo) or magic (or dunder) methods. :private-members: will include all, while :special-members: will include magic methods (called special by nobody except Sphinx).
:undoc-members: will include all members that are not documented.
:inherited-members: will include all members that are inherited from a parent class, which is rather key.

When a class gets too big, it should be split into multiple files, each with a single class in it that has a functional theme. These classes will form a chain of inheritance, leading up to the main class. Naming the split files with underscores will get them ignored. Consequently, it is an option to document only the main class which thanks to :inherited-members: will have everything.

But :inherited-members: is not always welcome. For example, when using typehinting (which is optional but actually a must), one does resort to typing.TypedDict (which allows you to specify the expected names of the dictionary keys and the type of its values) or typing.TypeVar (which is a wrapper for a type). The :inherited-members: on these will make a mess of pointlessness. Therefore it often gets easier to manually define how one wants things annotated via multiple autoclass rather than the autogenerated blanket automodule.

conf.py file

Ignore sys.path.insert

In the conf.py file, there's a commented out line with sys.path.insert. Leave it like so. In the .readthedocs.yml file, there will be

python:
   install:
     - method: pip
       path: .
     - requirements: .readthedocs/requirements.txt
     - requirements: requirements.txt

So the module to be documented will be installed anyway (path: .).

extensions

The conf.py file does not call a function like setup in a setup.py file, but just sets global variables for Sphinx. One is the list extensions which tells Sphinx which extensions to use. E.g.

extensions = [
    'readthedocs_ext.readthedocs',
    'sphinx.ext.viewcode',
    'sphinx.ext.todo',
    #'sphinx_toolbox.more_autodoc',
    'sphinx.ext.autodoc',
]

readthedocs_ext.readthedocs will be added by RTD, but it's nice for testing locally (need to be installed). sphinx.ext.viewcode shows the code snippets in the documentation. sphinx_toolbox.more_autodoc is a nice extension that adds more autodoc directives, but is hard to set up as it will crash one and a million corner cases —more so than mypy. But it is a good idea to check if it can in the first place —if something fails use the subsets that work. sphinx_toolbox.more_autodoc.typehints is the key one in my opinion as vanilla Sphinx does not do typehints. In the sphinx-quickstart command documentation there's a list of vanilla extensions that one can use.

It should be noted that the classic way to specify typehint only methods:

import typing
if typing.TYPE_CHECKING:
    from foo import Foo

needs to be altered to:

import typing
if typing.TYPE_CHECKING or 'sphinx' in sys.modules:
    from foo import Foo

Other variables

There is a variable html_static_path, which can be set to an empty list if there are no static files:

html_static_path = ['_static']

This is because you cannot git commit an empty folder so without a _static folder it will fail.

There is also a line html_theme = 'alabaster' which is the default theme for Sphinx. ReadTheDocs uses 'sphinx_rtd_theme'. Therefore to use the sphinx_rtd_theme locally you need to install it. So our installation list is looking like:

pip install sphinx-toolbox readthedocs-sphinx-ext sphinx-rtd-theme

Other variables worth adding for more_autodoc are:

always_document_param_types = True
typehints_defaults = 'braces'   # other styles are available

The root_doc variable is a good way to store the rst files in a folder to declutter. By default it is index as index.rst is the main page, so moving it to source/index.rst and setting root_doc='source/index'. Alternatively, one could have the conf.py in that folder, but not the make.

init.py

Counterintuitively, __init__ method docstrings are skipped, even if at first documentation of how to initialise a module would be expected in the __init__.py file. There are thre solutions:

One can add it manually on an autoclass directive via :special-members: __init__ in the rst definition.

One can globally override its skippage in the conf.py file one can add:

def skip(app, what, name, obj,would_skip, options):
    if name in ( '__init__',):
        return False
    return would_skip
def setup(app):
    app.connect('autodoc-skip-member', skip)

One can document class initialisation in the class docstring, which is often done, but one loses the typehints.

However, as codeclimate painfully reminds us, there should be ideally 4 or less attributes in a method, and class initialisation often has many arguments, so you may end up using packed keyword arguments annotated as a TypedDict. And to add insult to injury, the init may be overloaded:

from typing_extensions import Unpack, TypedDict  # this is a 3.10 feature
from typing import Dict, List
from singledispatchmethod import singledispatchmethod

class FooOptions(TypedDict):
    a: int
    b: str
    c: float
    d: bool
    e: Dict[str, int]

class Foo:
   """
   This class accepts a main arguments, either as a dictionary or as a list, 
   followed by various options as keyword arguments as specified in the `FooKwargs` class.
   """

    @singledispatchmethod
    def __init__(self, data: list, **options: Unpack[FooOptions]):
        """
        This docstring will be skipped. And also, are we talking of this dispatch or all?
        """
        self.data:List[int] = data
        self.a:int = options.get('a', -1)
        self.b:str = options.get('b', 'unknown')
        self.c:float = options.get('c', float('nan'))
        self.d:bool = options.get('d', False)
        self.e:Dict[str, int] = options.get('e', {})

    @__init__.register
    def _(self, data: dict, **kwargs: Unpack[FooOptions]):
        self.__init__(list(data.values()), **kwargs)

In this rather extreme case, annotating the class makes very much more sense. If this example seemed very alien, don't worry, but do make sure to read up on typehints as they make coding easier and less error-prone and as a bonus PyCharm will give better suggestions.

Mock

Often some module is required, but this requires a dark magic ritual to get running. As a result the Mock class from unittest is of great use. This is used to make a mock of a module, which pretends to be there, but does nothing. So in config.py one can add:

import sys
from unittest.mock import Mock, MagicMock
sys.modules['foo'] = MagicMock()

Mixed Markdown

GitHub runs off a README.md, while the PyPI runs off the setuptools.setup call in setup.py, specifically whatever text is passed to the description and long_description arguments and flavoured via long_description_content_type argument. However, most projects simply pass the text of the former to the latter. The same applies to the intro in RTD. Therefore, it's beneficial to mix some markdown within the RST files. To make Sphinx accept both the module sphinx-mdinclude can be used. In the requirements.txt, it is hyphenated, while in the include list in the conf.py it is underscored. The conf.py for Sphinx is messy and will populate its folder with RST files hence why it was kept separate above. This however means that the markdown files at the root of the project will be missed. As a result they need to be copied over to the documentation folder and the contained links fixed and the filenames changed to me more graceful (README.md to Description.md).

Additional caveats

Stick to ReStructuredText

One can write docstrings directly in markdown, but this is not a great idea as RST is specifically designed for code annotation as we will see in a later section.

Catch formatting errors early

PyCharm autofills docstrings for you if set to do so (search preferences for “Automatic documentation”), but a common mistake is to not add a blank space between the description and the parameters. Without this the first parameter will be interpreted as the description and not as a bullet point. Everyone makes this mistake, but if one started early to check the documentation was getting generated fine, then one would avoid this subsequently.

Browser hard refresh

In a browser it is critical to do a hard refresh of the pages (Shift+refresh button). Silly but I'd say 90% of issues come from this.

Tests are documentation

Tests are documentation. You should always write tests. I test new features generally in a Jupyter notebook, to see the outputs in full, but the key conclusions can be converted into a test. Future you or a user will likely check out the code in the tests, so do add docstrings to them too.

Check if possible

Sphinx has many extra formatting features over markdown and if you have a need for something that may be a common requirement, check the documentation and pick up the extra extensions or Sphinx formatting tricks as the need arises, for example: :ivar: or :cvar:, are worth adding to the documentation.

ReadTheDocs

So far I have gone through Sphinx, which only half of it. The next step, once we have a working Sphinx, is to use ReadTheDocs.

Yaml

Add a .readthedocs.yml file to the root of your project. For example I like to have:

version: 2

build:
  os: ubuntu-20.04
  tools:
    python: "3.8"

sphinx:
   configuration: .readthedocs/source/conf.py
   builder: html
   fail_on_warning: true

python:
   install:
     - method: pip
       path: .
     - requirements: requirements.txt
     - requirements: .readthedocs/requirements.txt

Namely, we install the module defined in the setup.py in the root with the method pip and the requirements with requirements.txt. But as mentioned there are a few requirements specific to Sphinx, which have nothing to do with the module's operations, hence the additional .readthedocs/requirements.txt file.

The fail_on_warning set to true is rather wishful thinking but at the debug stage this is helpful.

In the case of PyRosetta, we have a problem as it does not install like normal. Luckily one can have private environment variables in ReadTheDocs (set within the settings for the project on the ReadTheDocs website). In my package, pyrosetta-help is a command line tool that is added install_pyrosetta, which requires the presence of the PYROSETTA_USERNAME and PYROSETTA_PASSWORD env variables. This can be run by setting in the yaml file the following:

build:
  ...
  jobs:
    post_install:
      - install_pyrosetta

Likewise for other options the jobs directives can be used to better set up the environment.

Runtime

Once the yaml file is complete, head over to the ReadTheDocs website and link your GitHub account and create a new project from the reposition of interest.

Once the project build is kicked off, you can see what happens in the 'Builds' tab. Clicking on the top build, which give a badge (green hopefully), a printout and on two links in the right hand side saying view docs and view raw. The latter is crucial as it gives you the raw output.

Check for errors and warnings. If fail_on_warning is set to false, then if the documentation was partially generated, it would claim to be a success and only view raw would say otherwise.

And to reiterate, do make sure to do a hard refresh of the docs page.

Slack

In the settings on the site one can set up a webhook to a Slack channel to notify of build status.

The art of blowing up protein

Pages

Saturday, 4 June 2022

Annotate as you go