The curious case of fluorescein and implicit waters

I recently was asked a straightforward task: get the energy scores of a given protein by itself, conjugated with fluorescein maleimide and its ring opened form. Despite that, I failed miserably because of implicit waters playing odd tricks.
Given that it is of no real use, I am happy to share so to give a tutorial of how to do that in Rosetta (ligand docking), which I found taxing the first time.

(For files see github.com/matteoferla/rosetta-pymol/../Fluorescein)

Chemistry

But first, some background. A maleimide is a group that reacts with cysteine. Once reacted with a thiol the group is a succinimide. This is not reversible, but a further reaction happens, namely the anhydride ring can be hydrolysed opening it, resulting in a succinamate. In the case of fluorescein two forms are possible, acid and lactone depending on the pH of the system, namely in acidic conditions the lactone form, which is non-fluorescent, is favoured.

The amino acid way

There are two ways in Rosetta of calculating the best pose for a conjugated cysteine. Ligand docking and design/remodel with a non-canonical amino acid.
The former is much easier than the latter. There are fewer rotamers to worry about so letting these be predicted in each run is much easier —the structure after all is mostly aromatic—, while the latter has a lot of angles to spin around and runs the risk that these may differ substantially from the real cysteine ones. Although the trick to make ligand docking work is make an old school cst and use the deprecated ligand_dock application. However, the latter would be handier if one wanted to do fluorescein-cysteine scanning of a protein to determine the best spot to label it —were it not for a catch.

The technical

To use something that is not one of the twenty amino acids with normal protonation, one needs a params file. These either are in the database folder (rosetta/main/database/) under chemical/residue_type_sets/fa_standard/residue_types or have to be made. In both cases the Rosetta command requires the flag -extra_res_fa xxx.params or/and -extra_res_cen xxx.params depending on the application —adding both does no harm.
The key to parameterisation is a script called molfile_to_params.py, which in case you have a broken version of PyRosetta —i.e. me due to 2/3 issues—, you can download from the pyrosetta page. This script converts a molecule file into a parameter file. It accepts both original mol (sdf) and mol2, even if the extra arguments slap of sdf mol files. The latter has atom names, which is mighty important if you want conventional names. Once I parameterised testosterone without worrying about the Carbon numbering despite being important and was driven spare by having to have a conversion dictionary during the analysis.
One thing to note is that the mol2 is conjugated in Kekulé form (i.e. single/double as one of the resonant structures) not aromatic (dotted bonds): the Python script molfile_to_params.py will actually fix this so it is okay.
This script is quite curious. It weirdly does not have a way to recognise the substituent (say calling the atom R) as I'll mention, but it has secret features/hacks to allow it to parameterise simpler atoms (calcium or molecular oxygen), by accepting V atom (=virtual), which are ignored -I think Rosetta as a consequence ignores vanadium.
Firstly, Rosetta used a different energy function than CHARMM, so CHARMM pars files don't work —challenges make life fun—, so the process is slightly different, but it is fairly painless once one gets the hang of it, except for rotamer generation.
This is generally a result of downloading from PubChem the closest molecule, altering it in ChemDraw or ChemSketch and then optimised with Gaussian or if in a rush saved from ChemDraw as a SMILES and made into a mol2 with Python RD Kit —just in case any of these are not familiar I highly recommend checking them out*.

There are a few bits that need changing in the file itself. To parameterise an amino acid one needs to make a mol2 file with the amino acid protected with an N-terminal acetyl group and a C-terminal methylamide. And change a few lines in the mol2 files (see Rosetta's non-canonical amino acid tutorial).
In the case of the docking approach, for the constrain file a reference with Cβ atom is also needed in order to keep angles right. In terms of the ligand, the molecule were made as thiol-reacted ligands, which after parameterisation were edited. Namely The sulfur was altered changing its BOND line to CONNECT and replace its mention with a connector CONN1. Note that CONNECT is spelt with two Ns, whereas in PDB files it's a single N, which is bound to cause confusion.
The docking has to obey one rule: the SG atom need to be connected to the ligand at a reasonable distance and angle, which means a constraint file. To make the constraint file (cst) the distances in pymol were measured in the reference structure with the Cβ atom. Rosetta seems to have two different main constraint file formats, this is the enzyme one. To make a cst file work a REMARK line needs to be added to the top of the PBD file:

REMARK 888 MATCH TEMPLATE A CYS xx MATCH MOTIF A yyy zz 1 1

Where xx is the cysteine residue number, yyy the residue name of the ligand and zz its number. The docked ligand has to be the last residue in the PDB structure. The latter was made by opening in PyMol the relaxed protein and the ligand with a thiol, aligning the SG atoms and deleting the HG atom off the cysteine and the SG atom off the ligand, say

pair_fit resn FLS & name S1, resn cys & name SG, resn FLS & name C1, resn cys & name HG
remove resn FLS & name S1

The hydrogen is optional and makes sense only with the no_optH flag set to false in the docking.

$ROSETTA/ligand_dock.$ROSETTAEXT -database $ROSETTADB -ex1 -ex1aro -ex2 -extrachi_cutoff 1 -extra_res_fa yyy.params -enzdes:cstfile yyy-bond.cst  -docking -minimize_ligand -harmonic_torsions 10 -improve_orientation 1000 -out:pdb 1 -out:nstruct 100 -out:level 300 -s xxx-yyy.pdb;

The catch

The problem is that the predictions made no sense, often with a gigantic hydrophobic moiety sticking up like an iceberg. The problem may lie with the implicit water, namely a shell of dynamic water that is not modelled as individual waters. In the energy score surface exposure factors into it, but a fixed water behaved different. This is known to have some trouble especially in water with is fixed in place by strong bonding networks and can cause trouble in docking in active sites (cf. WaterDocking app). A surface lysines hydrogen bonding to a water and it in turn to glutamate are common in crystal structures, but will be utterly broken up after relaxation -possibly why supercharging is good at making accidentally aggregating protein. So I would take any predictions with large hydrophobic moieties with a pinch of (unhydrated) salt...

*) Namely…

ChemDraw and ChemSketch are Illustrator for structures. 3D structures are horrid to visualise with them, so PyMol can open mol (sdf) and mol2 files. In ChemDraw molecules can be arcanely exported as SMILES by selecting, right-clicking and, under molecule, copy-as.
Gaussian is the most common QM-MD simulator found on most big clusters have -see OpenBabel for formats.
SMILES is a way to write chiral chemical structures as strings.
RD kit is a python package (NB. brew/apt-get not pip). In terms of SMILES to mol2, the NIH server, https://cactus.nci.nih.gov/cgi-bin/translate.tcl, can do it, although it needs at times a refresh or two as it crashes (at least for me).

The art of blowing up protein

Pages

Monday, 19 March 2018