Confabulating a crystal lattice for a protein model

Structure prediction has not made crystallography obsolete, but it has however made it easier — here I explore a potential trick that could be used. Under certain conditions, proteins pack in mathematically precise crystalline lattices because the lattice configuration is energetically preferable over being free in the unfriendly crystallisation solution used. Sometimes the protein will refuse to crystallise because either it is happy in solution or it would rather unfold. As a result researchers mutate surface residues to be more hydrophobic or copy the crystal packing interface ("crystal epitope") of a homologue that crystallises. A different (untested) approach could be to design the interface with ProteinMPNN or FastDesign. The catch is that one needs to make a model of the desired supercell to optimise. Here I will discuss how to make it.

Parenthesis/Apology

In my previous blog post about tweaking protein I mentioned that ProteinMPNN could be used to improve packing of proteins and that symmetry was going to be a topic of a back-to-back post, yet I did not publish one —mea culpa. Honestly I had no motivation due to my busyness and my leaving academia combined with the fact that people are less likely to find answers from blog posts given ChatGTP (cf. article on The Economist about AI and web footfall). I used to get about two emails a week from people asking for help; now it is far fewer than once a month: times have changed. However, someone reached out asking about it, which was all I needed to spur me to write this!

Rotated bounding box

To pack a protein model into a lattice, we need to determine the smallest rectangular prism (bounding box) it can fit in. For simplicity I will talk about a rectangular prism as the mathematics for other shapes is too tricky to do easily. Technically, in crystallography the word is 'orthogonal unit cell', but I'll hold back on going into the endless rabbithole that is the classification of space groups. Instead I will concentrate first on the far less confusing topic of rototranslations.
The bounding box will most likely not be perpendicular to the axes with one vertex on the origin as a result a few things need to be done.
We need to get the principal axes of inertia, or the PCA axes if you must, i.e. the eigenvectors of the covariance of the points —the three perpendicular axes of the box. Like everything to do with rototranslations, a tripping point is that the many operations are calculated centred on the origin —the principal axes of inertia are one such case. Additionally, we need to make sure the determinant is positive otherwise we get mirroring.

Demystifying the mathemagic

A small parenthesis is in order to demystify the mathemagic.

Identity, rotation (proper and improper), translation, reflection, scale (squeezing and stretching) and shear are all affine transforms (linear transforms that are not projection). They can be composed together and in the case of rotation if the determinant is negative it is a improper rotation, composed of a proper rotation and a reflection (into the mirror universe of D-protein). A rototranslation is composed of a translation to the origin, a rotation and a translation back.

The three perpendicular "arrows" are, in mathematical parlance, a matrix of three 3D points relative to the origin, which are perpendicular/orthogonal to each other (called an orthonormal frame). An orthogonal matrix is a proper rotational matrix (proper not in the British English meaning, but does not scale / distort). So given a coordinates of a point / vector relative to this frame (local/rotated frame) by this rotational matrix will yield the coordinates/vector in the world frame / origin centred as the rotated frame is in the world frame, the reverse operation is also possible: matrix-multiplying a world frame vector by the transpose of this rotated frame will yield the coordinates relative to the rotated frame, so in the case of the protein coordinates within the bounding box, the product of this latter multiplication be relative to the box, effectively rotating the protein (think of it like rotating the camera). For greater context, rotational matrices are normally described verbosely with a trig car crash: this is because each rotation angle needs to be converted into the vectors representing the adjacent and opposite sides of the right angle triangle formed by the angle, which in 2D is a single angle, while in 3D is two for a spherical coordinate (like latitude and longitude, ie. pointing in a direction relative to the origin) or three for a full rotation (like a plane, which can yaw, pitch or roll). An important detail is that the determinant of any rotation matrix needs to be positive otherwise you have a reflection and we don't want evil mirror universe D protein. This is actually the simple part: the crystallographic lattice part is the scary part.

A curious challenge is actually getting the vertices of the box but luckily there is numpy.meshgrid that generates the combinations aligned with the system axes, which then need to be rototranslated to the protein's location and axes.

Convex hull

However, protein are rather complicated in shapes, so we actually only care about the outside surface, specifically the convex hull of the atomic coordinates —the convex hull of an object is it's outside border, a but like the Pareto frontier but 360°. Obviously, there is class in scipy that does it.

There are a few steps, but they boil down to the usual operations (shift to-and-fro the origin with a linear transform by doing a matrix multiplication thrown in). Here is a snippet that finds the box, and shift to the axes.

import gemmi
from rdkit import Chem
import numpy as np
import numpy.typing as npt
from scipy.spatial import ConvexHull
from pathlib import Path
from typing import Self


class MinimumRotatedBox:
    """
    ... code::block python:

        structure = gemmi.read_pdb('1UBQ.clean.pdb')
        rotobox = MinimumRotatedBox(structure)
        rotobox.to_mol_of_corners('original_box.mol')
        rotobox.rototranslate()
        rotobox.to_mol_of_corners('final_box.mol')
        rotobox.box_corners.min(axis=0)  # origin!
    """

    def __init__(self, structure: gemmi.Structure):
        self.structure: gemmi.Structure = structure
        self.coords: np.array = np.array([])
        nan_xyz: np.array = np.array([float('nan'), float('nan'), float('nan')])
        self.hull_volume: float = 0.
        self.box_volume: float = 0.
        self.hull_points: np.array = np.array([])
        self.centroid: np.array = nan_xyz
        self.eigenvectors: np.array = np.array([])
        self.box_corner: list[npt.NDArray[float]] = [nan_xyz] * 8
        self._calculate()

    def _calculate(self) -> None:
        """
        For recaluting the values if something changes.
        Requires calling.
        
        :return: 
        """
        self.coords = self.get_coords()
        hull = ConvexHull(self.coords)
        self.hull_volume = hull.volume
        self.hull_points = self.coords[hull.vertices]
        self.centroid = np.mean(self.hull_points, axis=0)
        self.eigenvectors = self._compute_eigenvectors()
        self.box_corners, self.box_volume = self._compute_min_volume_box()
        # box_corners is the volume of the box...

    @classmethod
    def from_file(cls, path: Path | str) -> Self:
        """
        Load a gemmi.Structure from a file.
        And make a MinimumRotatedBox from it.
        
        :param path: 
        :return: 
        """
        structure = gemmi.read_pdb(str(path))
        return cls(structure)

    def get_coords(self, model: gemmi.Model | None = None, only_to_CB: bool = True) -> np.ndarray:
        if model is None:
            model = self.structure[0]
        if only_to_CB:
            valid_names = ('N', 'CA', 'C', 'O', 'CB')
            coords = [atom.pos.tolist() for chain in model for res in chain for atom in res if atom.name in valid_names]
        else:
            coords = [atom.pos.tolist() for chain in model for res in chain for atom in res]
        return np.array(coords)

    def _compute_eigenvectors(self) -> np.array:
        # Compute the PCA of the _shell_ regardless of if it's like a C hollow on one side...
        centered = self.hull_points - self.centroid
        cov = np.cov(centered.T)
        eigenvalues, eigenvectors = np.linalg.eigh(cov)
        # NB. the eigenvectors need sorting!
        order = np.argsort(eigenvalues)[::-1]
        eigenvectors = eigenvectors[:, order]
        # enforce right-handed frame
        if np.linalg.det(eigenvectors) < 0:
            eigenvectors[:, 2] *= -1
        return eigenvectors

    def _compute_min_volume_box(self) -> tuple[np.array, float]:
        # Get the box corners
        # https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html makes the 8 corners :D
        # but the corners need rotating to the axes and the mesh rotating back
        centered = self.hull_points - self.centroid
        rotated = centered @ self.eigenvectors
        min_coords = np.min(rotated, axis=0)
        max_coords = np.max(rotated, axis=0)
        extent = max_coords - min_coords
        volume = np.prod(extent)
        box_corners: list[npt.NDArray[float]] = []
        for corner in np.array(np.meshgrid(*zip(min_coords, max_coords))).T.reshape(-1, 3):
            corner = (corner @ self.eigenvectors.T) + self.centroid
            # near zeros fixed for aestetics
            corner[np.abs(corner) < 1e-6] = 0.0
            box_corners.append(corner)
        return np.array(box_corners), volume

    def to_mol_of_corners(self, filename: str | None = None) -> Chem.Mol:
        mol = Chem.RWMol()
        conf = Chem.Conformer(len(self.box_corners))
        for i, coord in enumerate(self.box_corners):
            mol.AddAtom(Chem.Atom(6))
            conf.SetAtomPosition(i, coord)
        mol.AddConformer(conf)
        edges = [
            (0, 1), (0, 2), (0, 4),
            (1, 3), (1, 5),
            (2, 3), (2, 6),
            (3, 7),
            (4, 5), (4, 6),
            (5, 7),
            (6, 7)
        ]
        for i, j in edges:
            mol.AddBond(i, j, Chem.BondType.SINGLE)
        if filename:
            Chem.MolToMolFile(mol, filename)
        return mol.GetMol()

    @staticmethod
    def rototranslate_target(model: gemmi.Model, coords: np.ndarray, rotation: np.ndarray, centroid: np.ndarray):
        mat = gemmi.Mat33(rotation.T.tolist())

        coords_centered = coords - centroid
        rotated_coords = coords_centered @ rotation
        min_corner = np.min(rotated_coords, axis=0)

        vec_np: npt.NDArray[float] = -min_corner - (rotation.T @ centroid)
        vec = gemmi.Vec3(float(vec_np[0]), float(vec_np[1]), float(vec_np[2]))

        transform = gemmi.Transform(mat, vec)
        model.transform_pos_and_adp(transform)

    def rototranslate(self, copy: bool = False) -> gemmi.Structure:
        if copy:
            structure = self.structure.clone()
            self.rototranslate_target(structure[0], self.coords, self.eigenvectors, self.centroid)
            return structure
        else:
            self.rototranslate_target(self.structure[0], self.coords, self.eigenvectors, self.centroid)
            self._calculate()  # recalculate
            return self.structure

Specifically, everything happens once initialised (rotobox = MinimumRotatedBox.from_file('👾👾👾.pdb')), allowing the corners to be made (rotobox.to_mol_of_corners('original_box.mol') or aligned to the system axes rotobox.rototranslate().write_pdb('👾👾👾.pdb') and ready for the next step, packing and mutating the close residues!

For debugging, making the box faux-molecule part of the PDB is very helpful:

mol: Chem.Mol = rotobox.to_mol_of_corners()
structure:gemmi.Structure = gemmi.read_pdb_string( Chem.MolToPDBBlock(mol) )
structure[0].add_chain( rotobox.structure[0]['A'] )
structure.write_pdb('boxed.pdb', conect_records=True)

In the above code in get_coordinates, only the backbone heavy atoms plus CB are used under the assumption that lysines or similar could be repacked. Another approach to the same effect, given the greek lettering of atom names, would be to skip atoms with latin letters G, D, E, Z. This is especially valid as protein used will likely be an in silico prediction anyway.

The protein used will likely be an AlphaFold structure, which may have a huge spaghetti loop: these could be skipped for the purposes of calculating the bounding box. Any loop skipped for calculation but kept in the design that causes clashes could be energy minimised in the supercell with the rest of the protein fixed.
These could actually be redesigned away with RFdiffusion or simply replaced and may have been the reason why the protein was being troublesome in crystallography. Comically, terminal loops left when mirrored will yield something reminiscent of Michelangelo's Creation of Adam fresco.

Lastly, in MD, the water box is axis aligned and rooted at the origin but it is a lot bigger than the protein (or else it blows up) so generally the protein is simply shifted to place its centroid at half the box sizes. However, the above trick could be used to make the protein nicely aligned until it tumbles around.

Packing

The next problem is figuring out how we want to place the repeating units, as rotations come into play.
The case where the protein are packed aligned in a grid all in the same direction without any rotations or mirror planes is called a triclinic system and has a "space group" of P1 (Hermann-Mauguin notation, hence the '_hm" suffix in the code below). Unfortunately protein very rarely pack that way. Instead the most common lattice group in crystallography is P2₁2₁2₁ namely the rectangular prisms face a symmetry mate that is rotated 180° on each of the axes (2-fold rotation) with an extra translation on the axis (the subscript 1 is a screw axis). P222 without this screw axis is rarer, but a nice sanity compromise. P stands for Primitive Bravais lattice, and there are lots more letters. The Wikipedia crystal system page is rather intense as there are different notations and includes reflections (D-proteins!).

Herein I will concentrate on tessellation of rectangular prisms, but be aware that fancier packing arrangements could be possible.

Gemmi obviously knows the operations needed to rototranslate a protein to make a supercell, so we can get them as follows:

import numpy as np

for op in list(gemmi.SpaceGroup('P 2 2 2').operations()):
    print(op.triplet(), np.array(op.rot) / op.DEN, 'and', np.array(op.tran) / op.DEN)

which will give:

x,y,z [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]] and [0. 0. 0.]
-x,-y,z [[-1. 0. 0.] [ 0. -1. 0.] [ 0. 0. 1.]] and [0. 0. 0.]
x,-y,-z [[ 1. 0. 0.] [ 0. -1. 0.] [ 0. 0. -1.]] and [0. 0. 0.]
-x,y,-z [[-1. 0. 0.] [ 0. 1. 0.] [ 0. 0. -1.]] and [0. 0. 0.]

While for "P 21 21 21" it will give:

x,y,z [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]] and [0. 0. 0.]
-x+1/2,-y,z+1/2 [[-1. 0. 0.] [ 0. -1. 0.] [ 0. 0. 1.]] and [0.5 0. 0.5]
x+1/2,-y+1/2,-z [[ 1. 0. 0.] [ 0. -1. 0.] [ 0. 0. -1.]] and [0.5 0.5 0. ]
-x,y+1/2,-z+1/2 [[-1. 0. 0.] [ 0. 1. 0.] [ 0. 0. -1.]] and [0. 0.5 0.5]

While for "P 1" it is just the first line —identity only, no rotations needed to make a supercell, it is its own smallest supercell.

The first part, the triplet specified what the axis of reflection is, there will be two as two reflections make a rotation, which we want (again, remember "say no to protein from the evil mirror universe").

To add a touch of confusion, one can get a classic affine 4x4 transform matrix (op.seitz()), but this in crystallography called a Seitz matrix for some reason.

When these operations are applied, one gets the smallest supercell, which will have P1 spacegroup that can be repeated by translation to form a nice grid.

Parenthetically, to apply a symmetry operation in Gemmi:

op: gemmi.Op = ...  # from above. NB. not a gemmi.Assembly.Operator !
translation = np.array(op.tran) / op.DEN
rotation = np.array(op.rot) / op.DEN
transform = gemmi.Transform(gemmi.Mat33(rotation), gemmi.Vec3(*translation))
model.transform_pos_and_adp(transform)

There are two ways from here. The cheaty way or applying the transforms. The cheat is by setting the symmetry and cell and letting PyMOL generate a supercell, however setting up for PyMOL quickly blends into the route involving maths. This is because the location of the protein relative to the origin is crucial in all bar P1 spacegroups.
So let's start with P1. First, add spacegroup and cell size to the structure:

structure = gemmi.read_pdb('👾👾👾.pdb')
rotobox = MinimumRotatedBox(structure)
rotobox.rototranslate()
rotobox.structure.cell = gemmi.UnitCell(*np.max(rotobox.box_corners, axis=0), 90, 90, 90)
rotobox.structure.spacegroup_hm = 'P 1'
rotobox.to_mol_of_corners('final_box.mol')
rotobox.structure.write_pdb('test.pdb')

In PyMOL, in the right hand side menu, A(ction) > Generate > Symmetry mates will make the supercell (runs `cmd.symexp("test_","test","test",cutoff=4,segi=1)`), giving a nice grid. This is because there is only identity in P1. All other cases generate a mess, because the rotations are around the origin, making the position matter.

PyMOL generates the minimal supercell (i.e. applies the operations), but also tessellates it (P1) and shows the neighbouring cells within a the specified distance.

As a starting point, this may be the most problematic experimentally as it's an uncommon Hermann-Mauguin spacegroup, but it is by far the easiest. Namely, one has three faces to work with, improving the contacts of the head to tail pairing can be done even pretending the two protein sharing a face are different and optimising the sequence in proteinMPNN without symmetry.

An additional thing to note, is that the coordinates of a protein need not be contained within its unit cell. Most likely the cell and origin of a real crystal will make the protein pack more snuggly.

A similar problem to designing a crystal packing system would be designing de novo oligomers, however, the problem is simpler because for the purposes of immobilisation and so forth they would likely have a radial symmetry around an axis (i.e. a specific side of the monomer is on the same plane for all monomers), so a the HM spacegroup would be just P2, P3, P4 etc. and the origin would simply be at that axis of rotation. P1 (the no rotation one) is called triclinic, P2 is monoclinic, P3 is trigonal, P4 is tetragonal, P6 is hexagonal. 5-fold and 7-fold rotational symmetry are problematic as these are meant to tessellate in a triangular, square or hexagonal grid, not a square grid. To make things weirder, the P2 etc. is shorthand for P 1 2 1, not P 2 1 1. P 2 2 2 is an orthorhombic system, which should not be confused with orthogonal (90° box). However, for this specific case, there is no need to worry about crystals, becuase it requires placing a chosen axis of the frame of the protein and apply n basic rotations around an axis with an angle of 360°/n.

P 2 2 2

Back to P 2 2 2 and the quest for the exact location of the origin. If the four P 2 2 2 spacegroup operations are applied to a protein centred at the origin, the four protein will be an overlapping mess, while with the bounding box starting at the origin you will get a checkerboard.

Protein assemblies in P 2 2 2 (example: PDB:2Z51) in their definition in CIF format (which holds this information) have fields like:

_pdbx_struct_assembly_gen.assembly_id       1 
_pdbx_struct_assembly_gen.oper_expression   1,2 
_pdbx_struct_assembly_gen.asym_id_list      A,B,C,D 
# 
loop_
_pdbx_struct_oper_list.id 
_pdbx_struct_oper_list.type 
_pdbx_struct_oper_list.name 
_pdbx_struct_oper_list.symmetry_operation 
_pdbx_struct_oper_list.matrix[1][1] 
_pdbx_struct_oper_list.matrix[1][2] 
_pdbx_struct_oper_list.matrix[1][3] 
_pdbx_struct_oper_list.vector[1] 
_pdbx_struct_oper_list.matrix[2][1] 
_pdbx_struct_oper_list.matrix[2][2] 
_pdbx_struct_oper_list.matrix[2][3] 
_pdbx_struct_oper_list.vector[2] 
_pdbx_struct_oper_list.matrix[3][1] 
_pdbx_struct_oper_list.matrix[3][2] 
_pdbx_struct_oper_list.matrix[3][3] 
_pdbx_struct_oper_list.vector[3] 
1 'identity operation'         1_555 x,y,z     1.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 1.0000000000  
0.0000000000 0.0000000000  0.0000000000 0.0000000000 1.0000000000  0.0000000000 
2 'crystal symmetry operation' 4_565 x,-y+1,-z 1.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 -1.0000000000 
0.0000000000 57.3940000000 0.0000000000 0.0000000000 -1.0000000000 0.0000000000

which is a funny beast, not because it is a 4x4 transformation matrix missing the homogeneous coordinates row, which for protein coordinates is _normal_ly is not needed, but because it is a P222, but has a +1 on the y reflection, which the triplet of the space group definition lacked, which assume you will deal with the translation the origin of the frame of the protein.

When the bounding box vector is scaled by 2 to match the unit cell (which is not the box a protein sits in) one gets variants of checkerboards because the rotation happens at the origin. Hence the second route, not via PyMOL.

To be perfectly honest, I have not sat down with pen and paper to figure out the mathemagic required to find the origin and everyone I have asked is utterly confused as they are solving crystals the other way round —and the software does the mathematics. Parenthetically, when I worked on this in late 2024, ChatGTP was simply awful at generating rotational matrices, so I would not recommend it.

As a result, brute force is the quickest way. Centre the protein on the origin, make the copies and shift them into position by the cell size.

This leaves with doing the maths manually, which is a lot more straightforward, given that:

Of the four operations in the space group, the zeroth is an identity applied to the centre unit cell, the first is on the xy-plane, the second on the xz-plane and the third on the yz-plane.
There is no special unit, the operations can be reapplied to any asymmetric mate.
When more chains are added and two occupy the same unit, they should superpose perfectly
Copying the lattice and shifting it by 3 cells will result in some perfectly superposed reoccupied units

The four operations in P 2 2 2 will make an axis gizmo shaped thing (4 adjacent vertices of a cube): there are 8 vertices in a cube and in P 2 2 2 the missing ones will simply be translations of the diagonally opposite vertices.


def center(model: gemmi.Model) -> None:
    # Yes, American spelling pains me too.
    cell_params = np.array(structure.cell.parameters[:3])
    mat = gemmi.Mat33()
    vec = gemmi.Vec3(*-cell_params/2)
    transform = gemmi.Transform(mat, vec)
    model.transform_pos_and_adp(transform)

def calc_shift_matrix(cell_params: np.array) -> np.array:
    # hack... a 3x4 identity shifted right with the cell params in the diagonal
    # to be added to the translation.
    # 100% lazy
    eye = np.eye(3, 4, k=+1)
    return eye * np.append([0], cell_params)

def make_supercell(structure: gemmi.Structure, add_shift: bool=True) -> tuple[gemmi.Structure, list[list[str]]]:
    """
    Apply the symmetry ops transformations.
    As I am unsure over the origin, ``add_shift`` centres the box to the origin and 
    shifts the axes in order as it _assumes_ they are meant to be identity, 
    x-axis rotation, y-axis, then z-axis.
    This is not true for most non P2x2x2x spacegroups...
    """
    model: gemmi.Model = structure[0]
    supermodel = gemmi.Model(0)
    if add_shift:
        center(model)
    shifts = calc_shift_matrix( np.array(structure.cell.parameters[:3]) )
    unit2chains = [] # keep track of n_units
    for idx, op in enumerate(gemmi.SpaceGroup(structure.spacegroup_hm).operations()):
        copy = model.clone()
        translation = (np.array(op.tran) / op.DEN)
        print(idx, op.triplet(), translation)
        rotation = np.array(op.rot) / op.DEN
        transform = gemmi.Transform(gemmi.Mat33(rotation), gemmi.Vec3(*translation))
        copy.transform_pos_and_adp(transform)
        if add_shift:
            translation = shifts[:,idx]
            transform = gemmi.Transform(gemmi.Mat33(), gemmi.Vec3(*translation))
            copy.transform_pos_and_adp(transform)
        unit2chains.append( [] )
        for chain in copy:
            chain.name += f'_{idx+1}'
            chain = supermodel.add_chain(chain)
            unit2chains[idx].append(chain.name)
    neostructure = gemmi.Structure()
    supermodel = neostructure.add_model(supermodel)
    return neostructure, unit2chains

def expand_by_copying(supermodel: gemmi.Model, unit2chains: list, max_chains: int = 26) -> None:
    """
    Given a structure with the first set of transformations applied, copy them out
    """
    # check it's a normal protein. nucleic acids should be okay too as gemmi.SupSelect.CAP selects CA or P, 
    # but I am keeping this "simple"
    assert supermodel[0].get_polymer().check_polymer_type() == gemmi.PolymerType.PeptideL
    identity_chain_name = unit2chains[1][0]
    chain2position = {chain.name: chain[0][0].pos for chain in supermodel}
    
    for rep in range(3):
        for idx in range(1, len(unit2chains)+1):
            sym_chain_name = unit2chains[idx][0]
            sup: gemmi.SupResult = gemmi.calculate_superposition(supermodel[identity_chain_name].get_polymer(), 
                                                                 supermodel[sym_chain_name].get_polymer(), 
                                                                 gemmi.PolymerType.PeptideL, 
                                                                 gemmi.SupSelect.CaP)
        
            copy = supermodel.clone()
            copy.transform_pos_and_adp(sup.transform)
            new_sym_names = []
            for chain in copy:
                new_pos = chain[0][0].pos
                for original_name, original_pos in chain2position.items():
                    if new_pos.dist(original_pos) ≶= 0.1:
                        break
                else:
                    chain.name += f'-{rep+1}.{idx+1}'
                    chain2position[chain.name] = new_pos
                    chain = supermodel.add_chain(chain)
                    new_sym_names.append(chain.name)
            if new_sym_names:
                unit2chains.append(new_sym_names)
            if max_chains > 0 and len(supermodel) >= max_chains:
                return
        
# -------------------------------

# ## Rotate for an orthogonal box
structure = gemmi.read_pdb('👾👾👾.pdb')
rotobox = MinimumRotatedBox(structure)
rotobox.rototranslate()
structure = rotobox.structure
structure.spacegroup_hm = 'P 2 2 2'
structure.cell = gemmi.UnitCell(*np.max(rotobox.box_corners, axis=0), 90, 90, 90)

# ## Make supercell of specified rotations
superstructure, unit2chains = make_supercell(structure)
# unit2chains is a list of list of each chain name in each unit
# the official way to handle this is laborious
expand_by_copying(superstructure[0], unit2chains)

# save
superstructure.make_mmcif_block().write_file('super.cif')

This is all very crude, but does the job. The grid could be shrunk to optimise the distance. Gemmi has neighbourhood searching functionality, for example in the example used (PDB:1UBQ) I get a clash:


unit_cell = gemmi.UnitCell()

contact_search = gemmi.ContactSearch(5.)
contact_search.ignore = gemmi.ContactSearch.Ignore.SameChain
contact_search.min_occupancy = 0.0  # theoretical models...
neighbor_search = gemmi.NeighborSearch(superstructure[0], unit_cell, 5.).populate(include_h=False)
contacts: list[gemmi.ContactSearch.Result] = contact_search.find_contacts(neighbor_search)
nice_contacts = [contact for contact in contacts if contact.partner1.chain.name == 'A_1' or contact.partner2.chain.name == 'A_1']

for contact in nice_contacts:
    # make sure any clashes are not backbone to backbone
    backbone_names = ('N', 'CA', 'C', 'O', 'H', 'OXT')
    assert not all([contact.partner1.atom.name in backbone_names,
                    contact.partner2.atom.name in backbone_names,
                    contact.dist < 2.88]), f'{contact.partner1.residue}.{contact.partner1.atom.name}:{contact.partner1.chain.name} - '+\
                                           f'{contact.partner2.residue}.{contact.partner2.atom.name}:{contact.partner2.chain.name} = '+\
                                           f'{contact.dist:.1f}Å'

Seen in PyMOL (with default set connect_mode, 0:

Therefore the cell needs tweaking by increasing the unitcell by ~2Å on the Z axis.

Once everything is nice, one can get the residues in common between the query chain (say A_1) and a given mate:

by_other_chain: dict[str, list[gemmi.ContactSearch.Result]] = {}

for contact in nice_contacts:
    cra: gemmi.CRA
    for cra in (contact.partner1, contact.partner2):
        if cra.chain.name != 'A_1':
            other_name = cra.chain.name
            if other_name not in by_other_chain:
                by_other_chain[other_name] = []
            by_other_chain[other_name].append(contact)

{chain_name: {cra.residue.seqid.num for contact in contacts for cra in (contact.partner1, contact.partner2)} for chain_name, contacts in by_other_chain.items()}

Armed with this list and example4 of proteinMPNN (symmetry) one can design the mutations that would best suit the interface, keeping an eye out for the score. Alternative, with the caveat that symmetry is rather tricky to use in (Py)Rosetta, one could use Rosetta FastRelax with design active to design the interface constraining the packer palette to residues found in an shallow MSA.

(Ir)relevance

Here I have spoken of the maths without discussing whether it should be done. If the aim is to crystallise a protein, addressing why it does not crystallise might be more important than engineering it to crystallise. For example, the behaviour of a unstable protein due to lack of disulfides, metal ions, or cofactors could be improved by fixing the absence. Large dynamic behaviour caused by spaghetti loops or flexible hinges could be fixed by removing the former and rigidifying the latter. In fact, the above work was never put to use because all test cases had different issues and I never got to try it...

Debugging

The notes presented here require a fair amount of failures. For example, forgetting to translate before rotating or not checking against improper rotations. As a result I found it very handy to use this PDB (there is a script in PyMOL one can download or write from scratch or ask ChatGTP, but this is lazier for a similar diagnostic):

HEADER    FAKE PDB STRUCTURE
HETATM    1  NA      O   1       0.000   0.000   0.000  1.00  0.00           NA  
HETATM    2  CL      Z   2       0.000   0.000   5.000  1.00  0.00           CL  
HETATM    3  MG      Z   3       0.000   0.000  10.000  1.00  0.00           MG  
HETATM    3  MG      Z   3       0.000   0.000  20.000  1.00  0.00           MG  
HETATM    4  CA      Y   4       0.000   5.000   0.000  1.00  0.00           CA  
HETATM    5  CA      X   5       5.000   0.000   0.000  1.00  0.00           CA  
TER
END

The art of blowing up protein

Pages

Sunday, 17 August 2025