Hacking PDBs for fusion protein and missing loops

Monday 16 October 2017

Hacking PDBs for fusion protein and missing loops

This is a How-to guide to make structures of fusion protein —the shoddy way.
For the proper way see the how to do it in Rosetta or Pyrosetta or by coercing the RosettaCM threader do it.
NB. This assumes medium–advanced PyMol competency.


Representing the structures of fusion protein is a rather annoying task. There are many ways of doing this.
Contrary to expectations, submitting the sequence of a fusion protein to the Phyre server (threading/ab initio predictor) or similar will return spaghetti, just as happens with circular permutations (for which you get a cracked half open domain with spaghetti attached). But give it a go nevertheless —it may save you a lot of time!
Rosetta can be used, but the domains will not be photogenic and it is overkill for some operations.
So the best way is to simply hack the PDB files and some PyMol ninjutsu.

EDIT/WARNING: since writing this I have realised that one could add missing loops by threading as several algorithms have loop adding operations —in which case using PyMod may be the best approach for non-coders as, it is less straighforward than this, but utilises a proper loop modelling algorithm as it's a GUI wrapper for MODELLER. PyMod requires some easy installations (MODELLER and a PyMOL plug in) and the generation of the alignment (load the structure, run print cmd.get_fastastr(), get your full sequence from Uniprot, go to the online Muscle alignment tool and bingo) and does not require any awkward pulling of chains.


As an example I will make phusion, i.e. Pyrococcus furiosus DNA polymerase fused to Sulfolobus solfataricus 7D domain (Sso7), because its name basically says it is a fusion.

Starting structures

First one has to get the structures. If a structure is present in the PDB that is perfect. If not, I strongly recommend the Phyre 2 server over others predictors.
I find Phyre 2 mostly accurate and relatively fast. I have found iTasser to be just a tad less believable (but a solid runner up). SwissModel is only threading (no ab initio) so it is heresy to use in this decade. While Robetta is glacially slow and if you can set up a Rosetta ab initio job I have no idea what you are doing reading this.
However, do not submit too much to Phyre 2 in one go: go modular. Although add a few glycines and serines for linkers are very helpful.

In our case, phusion as a whole is not available —it's a "secret". But we have these two structures:
Just for the record, this is what happens when the whole of phusion is submitted to Phyre 2 (model in gray, Sso 7D in blue and Pfu PolB in green):
Basically Sso 7D is a mess. But Phusion seems kind of okay, but we will need a better representation.
There a few nice scripts in PyMol worth getting, foremost being align_all.py, one of these is colorbyrmsd.py, which shows the differences as B-factors.
That is, the C-terminus where the Sso7D was placed spaghettified the structure as feared.


Assuming you have all you need —I will discuss linkers, TEV sites and epitopes in part 2— it is time to make some fusion protein.
First open one pdb and drag the other into the same instance.
Second remove waters, crystallographic additives and any extra chains, but if you have to remove whole domains leave residues to make life easier as we will see. Namely, 3-button mouse editing is painful to use, so the best way to bring protein A near protein B is to use the align command. Ideally you should have a residue or more in excess that you will trim off after aligning the structures. If not you will have to add one with the sculpting wizard.

So let's get started with our example and load both parts of phusion to make a fusion:
> fetch 2JGU, pfu
> fetch 1bbx, sso7d
Let's look at the sequence. Either via the menu or by typing:
> set seq_view, 1
I see that there are:
  1. Missing loops
  2. manganese atoms and not magnesium —which is a comical crystalisation artefact as manganese decreases polymerase fidelity
  3. lots of waters
  4. a DNA strand

Missing loops

I prefer to add missing loops using Rosetta remodel, because it is somewhat less arbitrary —but its use is not trivial— so I will talk about PyMol.
Anoter option is using Coot, the program actually used during the refinement of the crystal structure. It is a standalone or part of the CCP4 package and is very versatile. However, if there is no loop in a deposited structure it means there likely was no density, so using Coot will be inaccurate, less than the shoddy PyMol way below, but PyMOL is easier to use than Coot (discussed further on the other post).
If the starting crystal structure is publicly available, submitting the sequence to Phyre should produce the missing loop. But do check by opening both in the same instance of PyMol and aligning them:
> align name_of_A, name_of_B

Alternatively you can add it yourself. In fact, adding a missing manually loop is actually really fun in PyMol. The trick is to use build and sculpt.
So let's get started with our example and load both parts of phusion to make a fusion:
> fetch 2JGU, pfu

Now, in the menu under Build there are lots of fun things to play with!
First though let's find the C' atom of the last residue before the disordered loop. To do it by mouse, change the mouse mode to three button editing by clicking on the panel in the bottom left that says three button viewing. Now middle click to zoom into the last non-gray residue in the sequence:

Click the carboxyl carbon (C'). In this case there is only one oxygen, so no need to delete one oxygen. If there were two, the terminal oxygen is named "OXT" and is part of the last residue.
Then, click on Build>Residue and add the residues needed, they will overwrite the gray letters. To make it more fun pressing opt/alt and the single letter code of the amino acid will add the residue.

The chain added is straight and is not closed, which needs to be fixed.
Select the carboxyl carbon of the final residue (W616) and the backbone nitrogen of the next residue (S617) and you will get a distance. One could stop here and just join the two: under the menu, click Build>Create bond or type
> bond pk1, pk2

The dotted line is gone, but the bond is 10 times too long, so this is not really a good idea right now.
Luckily, there is a really fun tool called sculpting to fix it. So undo the last bond (unbond) and activate auto-scupting under Build>sculpting.
Now when you drag selected atoms around using cmd/ctrl+shift left mouse button the chain will spruce up. However, one problem is the nitrogen being repulsed by its new friend. So click on it (single ringed sphere = pk1, double = pk2 etc.) and type
> protect pk1
Now drag away! Do note that an amide bond is planar, so one needs to be careful to make it happy by having the nitrogen sticking up in the opposite side of the carboxyl oxygen. At some point, ~ 3Å, it may pay to bond the atoms, deactivate sculpting, moving the carboxyl carbon to 1.3Å and reactivating the sculpting and the dihedral issue might go away... somewhat. Also, there should be a single hydrogen on the nitrogen.
Great. However, there is another missing loop, but the process is the same.

The protect command is actually really handy so it pays to protect everything except your new residues.

One caveat for missing loops is that if the file loaded is a cif (say you fetched without specifying), when you add the residues they will be added instead of the grey ones as hoped, however, in some causes, generally pdb files, they will be added filling other missing numbers. So do be careful and renumber your new residues if need be with the "alter resi xxx, resi='yyy'" command.

His tags and TEV sites

In the case of hexahistidine tag (HHHHHH) and its fancier HAT tag alternative (KDHLIHNVHKEFHAHAHNK) they are generally unstructured loops in the PDB, so the same building method works.
TEV cut sites (ENLYFQG) are a helix —see PDB:4CYD and PDB:5AUM for example—, so fusing them as done for the main fusion proteins is a good option.
For other tags, to check do in NCBI a protein BLAST search against the pdb database (an option in the dropdown list after the species exclusion input boxes).

Dehydrate and change the ions

To get rid of the waters, either via A(ction) or type
> remove solvents
To change the ions type:
> alter resn MN, resn='MG'
> alter elem MN, elem='MG'
> sort
As we will see the command alter is amazing.
Note that the peculiarity that a single atom HETATM is still part of a "residue".
PyMol command line is very powerful. I strongly recommend read the selection logic of PyMol if you are not familiar with it (for more see official site). But basically

  • resi means residue index
  • resn means residue name
  • chain and segi the chain and the segment
  • name is the atom name. All PDB files have the same atom names for each amino acid. The first letter is the element name —except for the hydrogens, which sometimes have a number first. The next is often a greek letter transliterated into latin alphabet (NZ is the zeta nitrogen of a lysine). The connecting protonated nitrogen and carboxyl, are simply N, H, C and O. Alpha hydrogen is HA.
  • & / AND as in intersection of both selections, | / OR as in union of both selections. 


Now, I actually was hoping to join the two domains with the residues "YQKTRQVGLTSWLNIKKSMH", but for now I will pretend I wanted to cut the loop bit off the end. The section "Missing Loops" has dealt with with adding residues.
Basically Y751 onwards. So let's align this residue with the first residue of the other protein.
To align to proteins the command align is great, but for single atoms in a parwise manner pair_fit is needed.
This latter command can be used via the command line, or via the menu Wizard>Pair_fit —the former is a pain because every atom needs to be typed, the latter is diabolical as one has to click each atom, which is fiddly as often the atoms in the background get picked. In terms of command line it gets a bit long, unless one uses the shorthand form:
> pair_fit pfu & resi 753 & name CA, sso7d & chain C & resi 2 & name CA, pfu & resi 753 & name N, sso7d & chain C & resi 2 & name N
> pair_fit pfu/A//753/CA, sso7d/C//2/CA, pfu/A//753/N, sso7d/C//2/N
I will talk about the lazier option, using align.
> align pfu & resi 751-753 & name CA, sso7d & chain C & resi 1-3 & name CA
So basically the command is align <to_be_moved_protein>,<fixed_protein>
The selection of the first is the structure 2jgu and resi (residue index) 751 and atoms named C-alpha. AND means that it selects only elements present in both.  The second term has an extra bit, chain C as there are multiple chains.

Beautiful! If it did not yield the correct result, using pair_fit atom per atom is needed. So let's stick with a lazy hackish use of align.
But wait, I see you have noticed a problem as you are zealously screaming "what about dihedral angles, you idiot". By aligning the alpha carbons I basically have ignored the psi, phi and omega angles. I was hoping I'd get away with it. Oh well. In that case.
> align pfu & resi 752-753 & (name CA | name N | name O),
         sso7d & chain C & resi 2-3 & (name CA | name N | name O)
Note how the numbering is shifted: I played around to make it align okay, this means though I will have to use the mutagenesis wizard to fix it.
However, in some cases there is a clash and sculpting can help fix.


We need to make both structures part of the same chain. To do so we do the following.
Let's save the session as there is no undo.
Remove residues 752 and upwards from pfu
> remove pfu & resi 752-999

Remove the first residue off chain C of sso7d (we shifted them along).
Use the mutagenesis wizard to mutate Y751 to an alanine or build it there.
Now. Let's fix the numbering. The command alter takes some getting used to. Just remember the command is called alter, which means you can Google "PyMol alter" to be an instant PyMol ninja. resi 1 of chain C of sso7d will need to be resi 751 of chain A. First make chain A (the DNA) become something else or you will have a mess.
> alter sso7d & chain A, chain='D'
> alter sso7d & chain C, chain='A'
> sort

> alter sso7d & chain A, resi=str(int(resi)+750) # 752-2 = 750
> sort
In some cases, there are segment identifiers, which is sequence makes the chain look like "pfu/A/A/". That is <model>/<segi>/<chain>, these can be altered
> alter segi='X', segi='Y'
Now we can join them:
> fuse pfu, sso7d
> remove pfu
> bond resi 751 & name C, resi 752 & name N
We have a bona fide fused phusion!

An alternative way of doing this is by hacking the pdb file directly, which is less direct, but less confusing:
We can select all (type it) and export the selection as a molecule in PDB format (untick all for the bare minimum).
In some sequences there is line called TER in the PDB file, this needs to be removed at the site of fusion obviously. To do this, open the PDB file in TextEdit/Notepad and find it.
If you reopen this file you will have a fusion protein.


1 comment:

  1. I just couldn't leave your website before telling you that I truly enjoyed the top quality info you present to your visitors? Will be back again frequently to check up on new posts.