Pooled ORFome library by multiplex linear PCR

Some time back I dreamed up a method of making a pooled orfome library by multiplex linear amplification. I never did submit it for a grant as it is way to ambitious and, well, expensive for the end result. I really like the idea so I feel bad it never went anyway nor will it, but that is science for you and only tenured professors can do what they like.

Here is a sketch of the idea. The prices might not be slightly out of date.

In essence, the science is sound, albeit risky and expensive, but the generation of a pooled orfome library of a given species is not a technology that woud have much demand, because few people care about pathway holes anyore and, a transcriptome or fragmented genome library does the same thing without the pricetag, just with orders of magnitude lower efficiency.

Summary

Aim. The creation of a pool ASKA-like (orfome) library by doing a multiplex linear amplification reaction with a primer pool against genomic DNA generated by gene synthesis methods (custom GeneArt Strings job).

Use. The generated pooled library of amplicons can be used to spot unknown non-homologous genes that fill a pathway hole or promiscuous activities. Not only is identifying non-homologous genes a big challenge, but there is great novelty in the method itself.

Cost. In this proposal I have not got a cost due to the caginess of the Introgen rep, but I estimate it to be around $5k.

Basics

One use of the Aska library and other orfome libraries is to pool them to find rescuers of a knockout (Patrick et al., 2007). Revealing the whole set of promiscuous enzymes for that activity —Dan Anderson’s definition of promiscuome means all of the promiscuous activities in a genome. This works as well (or better) with main activities, except that with E. coli there are no easy pathway holes left to probed. With species that are not E. coli this would be useful, but the number of available orfome libraries is small, incomplete or problematic such as that of the T. maritima structural biology consortium (split on 3 different plasmids).

Options

PCR marathon. The Aska and other orfomes are made by separate PCRs in a multiwell plate. If one is to pool them the effort seems wasted as it is a Herculean task.

Fragmented genome. Shelley Copley is using a fragmented genome library as a proxy for an orfome library. The problem is that most of the library will be composed of partial gene, out of frame genes and genes under their own promoter.

Depleted transcriptome. Nobody (that I know) has used a transcriptome library. The major problem is that the genes are represented one of each (orfome), but are present in certain concentrations, where the major transcripts are rRNA and ribosomal peptides. Depleting the former is doable, albeit not perfect, but to get a good coverage the library should be titanic.

Multiplex PCR. The PCR approach is best as it is targeted. If one wanted a pool, one might be able to pool the primers and do one or more multiplex PCRs. Three problems need to be addressed:

Making of primers
Maximum n-plex PCR
Avoiding biases

In light of the extremely high-plex nature, a linear amplification is a better solution as will be discussed.

Primers

One way to have a pool of primers is to make them with a GeneArt gene strings. At QMB I cornered the rep who said it was easily doable making upto ten thousand different strings (4300 genes in E. coli times two assuming no redundancy), but would not discuss pricing. For them it is easy as they skip a step and simply deliver the oligonucleotide synthesis product. However, everything is secret (or the rep didn’t know). He could not confirm that they make the oligos by lithography. Nor did he disclose the length distribution. However, Affimetrix ten years ago was making 25 base oligos (probes) for its GeneChip, so it is higher than 25. Illumina’s BeadArray uses 23+50 bases probes, but it might be made in a weird way. Consequently, the primers will be longer than 25 bases, which is perfect. The primers should have high melting temperatures, preferably at 72 °C, so should be 30 bases long.

The real problem I assume is that the completeness decreases exponentially with product length.

Let’s assume they obey the IDT’s graph on the right and GeneArt actually varies the length.

Consequently, a PAGE purification step is required no matter what.

But the yield would decrease. >200 ng is the guaranteed yield of gene strings.

30*8.6 k = 258,000 bases. The size limit of the sequence is 3,000 bases, but I think that that is due to the overlap as an Affymetrix GeneChip has a genome repeated multiple times. Consequently, if the overlap is a shifting of one base pair it should cost the same as three 3k bp gene string reactions (assuming no discount for skipped ligase step). If it is of 5 base window, it would be the same as 15 3kb gene string reactions. If the window is 10, 30 3kb reactions. The latter is nonsensical as it would mean only 300 unique sequences, which is in stark contradiction to the diversity offered in their library options. Regardless, even at $10k it is ten times cheaper that one by one sequences. The first option seems the right one, so let’s say $3k for 600 ng. 400 ng after PAGE (HPLC preferable).

The yield is also okay. 400 ng divided by (327g/mol per base times 30 bases per oligo) is 40 pmol. 200 nM reaction concentration means 200 µl PCRs. The maximum yield would be 1 µg amplicon.

Multiplex

The other major issue is the limit of a multiplex PCR. Kapa multiplex PCR website talks 4-plex, 8-plex and 12-plex, but it mentions in the FAQ that 30 is highest mentioned. High-plex 30 amplicons, ≤1000 bp. 30-plex is no way close to 4300-plex. However, this does not mean it is impossible.

Multiplex PCR is often used in conjunction with detection methods (TaqMan etc.), and not for subsequent cloning. So some issues may not apply. Two issues however do apply, artefacts and biases.

On chip multiplex PCR. Parenthetically one paper makes a 100-plex PCR on chip to boost sensitivity (PMID: 21909519). This is a related system and a very plausible approach, however, it would require in situ DNA synthesis capabilities.

Primer dimers. Primer dimers happen in normal PCR. The way to avoid it is using a higher annealing temperature (longer primers) and to avoid repetitive sequences. The oddity is that the Kapa manual gives an example annealing step as 60 °C. If the PCR were a two-step reaction, the reaction would be more efficient with less chance of noise. That means that AT-rich organisms, like P. ubique, are off. DMSO and betaine allow better PCR specificity especially with GC-rich sequences, so it might be good to go even more overkill with annealing length.

3’ trail primer binding. Here a new issue arises: the first copy of a sequence might have a primer binding site on its 3’ end for the next gene. This truncated gene would not amplify with the other primer, but would act as an inhibitor soaking up the first primer. Amplicons can act as megaprimers, so it might be as bad. Nevertheless, it is worth worrying about. Furthermore, Taq polymerase does not have strand displacement activity but 5’ exonucleolytic activity. One way to overcome this is to have 5’ protected primers (IDT options), which may interfere with the plasmid construction set and may be problematic for overlapping genes. I would shy away from strand displacing enzymes are they may be non-optimal.

The amplification step. PCR amplification efficiencies differ between primers. I could model what may happen (simulate, BRENDA has all the constants I need). However, PCR reactions with low primer concentration might avoid sequences lagging behind. This combined with the previous problem raises the suspicion linear amplification may be best.

Linear amplification. One thing to avoid is exponentially amplifying biases. One option is to do a single primer linear amplification. At first I would say no as it is really wasteful in terms of primers (which are limited) and the yield would not be enough. Plus it would need a second step in the opposite direction to produce correctly sized sequences. Plus the major issue with high-plex reactions is the low yield of product, so this would be four-thousand fold worse when linearly amplified. However, what is not being considered is the fact that a 4300plex PCR isn’t amplifying exponentially. In 25 µl reaction with 100 ng template and with 200 nM primer pool, there are about 2.5 pM (for 4.6Mbp) genome strands and 50 pM of each primer (of 4300), which only allows for a twenty fold amplification.

Inspection

How does one go about testing it? There will be smears if it is a nonspecific mess and if it worked. For the final product, NGS may be an option to see bias amplification, but not for routine troubleshooting. Transforming specific rescue strains is too laborious. The best option is RT-PCR of some genes (spread of sizes).

Plasmid

The next step once a pool has been made is to get them in plasmid form.

RE cloning, blunt. Messy products, low yield. Non-directional. Not worth it.

RE cloning, added 5’ site. It would require subsets to avoid cutting the amplicon. Which is would be a pain. Also it means adding to the primer.

Gibson assembly. It requires adding to the primer. This means longer primers, which is okay-ish, but I have an unscientific hunch it may mess up the multiplex primer.

TOPO clone. Directional TOPO is a good option, but has a drawback, there is no good plasmid. Also I am ignoring the 5’ primer protection possibility. The pET is not good as it requires T7 (cotransformation of pTARA plasmid and rescue screen has not been done). The pBAD and the pET have loads of crap at the end. The Aska collection is 10% non-functional due to the GFP tag. The thioredoxin and the hexahistidine tags may be cool, but they are in the way of a clean experiment. Adding a stuff to the primers to override the crap is a no-go due to the 5’ tag in multiplex concern and is seriously inelegant for downstream processes. So the TOPO ligation into pENTR plasmid followed by Gateway into a custom plasmid. It would allow a brief amplification step. A custom TOPO would be ideal, but it costs more than the GeneStrings step itself. A good homebrew TOPO-plasmid has never been achieved. So, disappointingly, a two-step TOPO is the TOPO way (I’ll keep an eye out for new TOPO kits).

Test subject

P. ubique is not an option as it is too AT-rich and its genes cannot be expressed even with pRARE. Even though this whole idea arose to tackle the pathway holes issue in P. ubique.
I want to try it for T. maritima to find if MetC is the sole racemase, but as there is the structural genomics consortium’s library, it makes it seem a tad redundant.

If I were to do this independently of my project, I would be easy to find a good candidate that:

has a solvable pathway hole for the field test of the library,
it has a small genome,
its proteins are expressible in E. coli and
has a 50% GC-content.

EDIT. After analysing the transcriptome of Geobacillus thermoglucosidasius and knowing how popular it is getting I would say a Geobacillus spp. is a perfect candidate.

Parenthesis: Name

For now, I’ll call it ORFome library by multiplex linear amplification. I have been calling it Aska-like library. Although I’d like a better one. Yes, a name is a semantic handle, but there is nothing worse than a bad name. Delitto Perfetto is a cool method name, CAGE isn’t and I rather keep away from backronyms. Portmanteaux are where it’s at, especially for European grants. Geneplex is already taken, but -plex means -fold anyway. Genomeplex? Mass ORF conjuration spell? I would love to coin a name, but I should withhold for taken the fun from later on.

The art of blowing up protein

Pages

Saturday, 14 November 2015