Some time back I dreamed up a method of making a pooled orfome library by multiplex linear amplification. I never did submit it for a grant as it is way to ambitious and, well, expensive for the end result. I really like the idea so I feel bad it never went anyway nor will it, but that is science for you and only tenured professors can do what they like.
Here is a sketch of the idea. The prices might not be slightly out of date.
In essence, the science is sound, albeit risky and expensive, but the generation of a pooled orfome library of a given species is not a technology that woud have much demand, because few people care about pathway holes anyore and, a transcriptome or fragmented genome library does the same thing without the pricetag, just with orders of magnitude lower efficiency.
Summary
Aim. The creation
of a pool ASKA-like (orfome) library by doing a multiplex linear amplification reaction
with a primer pool against genomic DNA generated by gene synthesis methods (custom
GeneArt Strings job).
Use. The
generated pooled library of amplicons can be used to spot unknown
non-homologous genes that fill a pathway hole or promiscuous activities. Not
only is identifying non-homologous genes a big challenge, but there is great
novelty in the method itself.
Cost. In this
proposal I have not got a cost due to the caginess of the Introgen rep, but I
estimate it to be around $5k.
Basics
One use of the Aska library and other orfome libraries is to pool them to find
rescuers of a knockout (Patrick et al., 2007). Revealing the whole set of promiscuous enzymes for that
activity —Dan Anderson’s definition of promiscuome means all of the promiscuous activities in a genome. This works as well
(or better) with main activities, except that with E. coli there are no easy pathway holes left to probed. With
species that are not E. coli this
would be useful, but the number of available orfome
libraries is small, incomplete or problematic such as that of the T. maritima structural biology
consortium (split on 3 different plasmids).
Options
PCR marathon. The
Aska and other orfomes are made by separate PCRs in a multiwell plate. If one
is to pool them the effort seems wasted as it is a Herculean task.
Fragmented genome.
Shelley Copley is using a fragmented genome library as a proxy for an orfome
library. The problem is that most of the library will be composed of partial
gene, out of frame genes and genes under their own promoter.
Depleted transcriptome.
Nobody (that I know) has used a transcriptome library. The major problem is
that the genes are represented one of each (orfome), but are present in certain
concentrations, where the major transcripts are rRNA and ribosomal peptides.
Depleting the former is doable, albeit not perfect, but to get a good coverage
the library should be titanic.
Multiplex PCR. The
PCR approach is best as it is targeted. If one wanted a pool, one might be able
to pool the primers and do one or more multiplex PCRs. Three problems need to be
addressed:
- Making of primers
- Maximum n-plex PCR
- Avoiding biases
In light of the extremely high-plex nature, a linear
amplification is a better solution as will be discussed.
Primers
One way to have a pool of primers is to make them with a
GeneArt gene strings. At QMB I cornered the rep who said it was easily doable
making upto ten thousand different strings (4300 genes in E. coli times two assuming no redundancy), but would not discuss
pricing. For them it is easy as they skip a step and simply deliver the
oligonucleotide synthesis product. However, everything is secret (or the rep
didn’t know). He could not confirm that they make the oligos by lithography.
Nor did he disclose the length distribution. However, Affimetrix ten years ago
was making 25 base oligos (probes) for its GeneChip, so it is higher than 25. Illumina’s
BeadArray uses 23+50 bases probes, but it might be made in a weird way.
Consequently, the primers will be longer than 25 bases, which is perfect. The
primers should have high melting temperatures, preferably at 72 °C, so should
be 30 bases long.
The real problem I assume is that the completeness decreases exponentially with
product length.
Let’s assume they obey the IDT’s graph on
the right and GeneArt actually varies the length.
Consequently, a PAGE purification step is required no matter
what.
But the yield would decrease. >200 ng is the guaranteed
yield of gene strings.
30*8.6 k = 258,000 bases.
The size limit of the sequence is 3,000 bases, but I think that that is
due to the overlap as an Affymetrix GeneChip has a genome repeated multiple
times. Consequently, if the overlap is a shifting of one base pair it should
cost the same as three 3k bp gene string reactions (assuming no discount for
skipped ligase step). If it is of 5 base window, it would be the same as 15 3kb
gene string reactions. If the window is 10, 30 3kb reactions. The latter is
nonsensical as it would mean only 300 unique sequences, which is in stark
contradiction to the diversity offered in their library options. Regardless,
even at $10k it is ten times cheaper that one by one sequences. The first option
seems the right one, so let’s say $3k for 600 ng. 400 ng after PAGE (HPLC
preferable).
The yield is also okay. 400 ng divided by (327g/mol per base
times 30 bases per oligo) is 40 pmol. 200 nM reaction concentration means 200
µl PCRs. The maximum yield would be 1 µg amplicon.
Multiplex
The other major issue is the limit of a multiplex PCR. Kapa
multiplex PCR website talks 4-plex, 8-plex and 12-plex, but it mentions in the
FAQ that 30 is highest mentioned. High-plex 30 amplicons, ≤1000 bp. 30-plex is
no way close to 4300-plex. However, this does not mean it is impossible.
Multiplex PCR is often used in conjunction with detection
methods (TaqMan etc.), and not for subsequent cloning. So some issues may not
apply. Two issues however do apply, artefacts and biases.
On chip multiplex
PCR. Parenthetically one paper makes a 100-plex PCR on chip to boost
sensitivity (PMID: 21909519). This is a related system and a very plausible
approach, however, it would require in situ DNA synthesis capabilities.
Primer dimers. Primer dimers happen in normal PCR. The
way to avoid it is using a higher annealing temperature (longer primers) and to
avoid repetitive sequences. The oddity is that the Kapa manual gives an example
annealing step as 60 °C. If the PCR were a two-step reaction, the reaction
would be more efficient with less chance of noise. That means that AT-rich
organisms, like P. ubique, are off.
DMSO and betaine allow better PCR specificity especially with GC-rich sequences,
so it might be good to go even more overkill with annealing length.
3’ trail primer
binding. Here a new issue arises:
the first copy of a sequence might have a primer binding site on its 3’ end for
the next gene. This truncated gene would not amplify with the other primer, but
would act as an inhibitor soaking up the first primer. Amplicons can act as
megaprimers, so it might be as bad. Nevertheless, it is worth worrying about.
Furthermore, Taq polymerase does not have strand displacement activity but 5’
exonucleolytic activity. One way to overcome this is to have 5’ protected
primers (IDT
options), which may interfere with the plasmid construction set and may be
problematic for overlapping genes. I would shy away from strand displacing
enzymes are they may be non-optimal.
The amplification step.
PCR amplification efficiencies differ between primers. I could model what may happen
(simulate, BRENDA has all the constants I need). However, PCR reactions with
low primer concentration might avoid sequences lagging behind. This combined
with the previous problem raises the suspicion linear amplification may be
best.
Linear amplification.
One thing to avoid is exponentially amplifying biases. One option is to do a
single primer linear amplification. At first I would say no as it is really
wasteful in terms of primers (which are limited) and the yield would not be
enough. Plus it would need a second step in the opposite direction to produce
correctly sized sequences. Plus the major issue with high-plex reactions is the
low yield of product, so this would be four-thousand fold worse when linearly
amplified. However, what is not being considered is the fact that a 4300plex
PCR isn’t amplifying exponentially. In 25 µl reaction with 100 ng template and
with 200 nM primer pool, there are about 2.5 pM (for 4.6Mbp) genome strands and
50 pM of each primer (of 4300), which only allows for a twenty fold
amplification.
Inspection
How does one go about testing it? There will be smears if it
is a nonspecific mess and if it worked. For the final product, NGS may be an
option to see bias amplification, but not for routine troubleshooting.
Transforming specific rescue strains is too laborious. The best option is
RT-PCR of some genes (spread of sizes).
Plasmid
The next step once a pool has been made is to get them in
plasmid form.
RE cloning, blunt.
Messy products, low yield. Non-directional. Not worth it.
RE cloning, added 5’
site. It would require subsets to avoid cutting the amplicon. Which is
would be a pain. Also it means adding to the primer.
Gibson assembly.
It requires adding to the primer. This means longer primers, which is okay-ish,
but I have an unscientific hunch it may mess up the multiplex primer.
TOPO clone.
Directional TOPO is a good option, but has a drawback, there is no good plasmid.
Also I am ignoring the 5’ primer protection possibility. The pET is not good as
it requires T7 (cotransformation of pTARA plasmid and rescue screen has not
been done). The pBAD and the pET have loads of crap at the end. The Aska
collection is 10% non-functional due to the GFP tag. The thioredoxin and the
hexahistidine tags may be cool, but they are in the way of a clean experiment.
Adding a stuff to the primers to override the crap is a no-go due to the 5’ tag
in multiplex concern and is seriously inelegant for downstream processes. So
the TOPO ligation into pENTR plasmid followed by Gateway into a custom plasmid.
It would allow a brief amplification step. A custom TOPO would be ideal, but it
costs more than the GeneStrings step itself. A good homebrew TOPO-plasmid has
never been achieved. So, disappointingly, a two-step TOPO is the TOPO way (I’ll keep an eye out for new TOPO kits).
Test subject
P. ubique is not
an option as it is too AT-rich and its genes cannot be expressed even with
pRARE. Even though this whole idea arose to tackle the pathway holes issue in P. ubique.
I want to try it for T. maritima to find if MetC is the sole racemase, but as there is the structural genomics consortium’s library, it makes it seem a tad redundant.
I want to try it for T. maritima to find if MetC is the sole racemase, but as there is the structural genomics consortium’s library, it makes it seem a tad redundant.
If I were to do this independently of my project, I would be
easy to find a good candidate that:
- has a solvable pathway hole for the field test of the library,
- it has a small genome,
- its proteins are expressible in E. coli and
- has a 50% GC-content.
EDIT. After analysing the transcriptome of Geobacillus thermoglucosidasius and knowing how popular it is getting I would say a Geobacillus spp. is a perfect candidate.
Parenthesis: Name
For now, I’ll call it ORFome library by multiplex linear
amplification. I have been calling it
Aska-like library. Although I’d like a better one. Yes, a name is a semantic
handle, but there is nothing worse than a bad name. Delitto Perfetto is a cool
method name, CAGE isn’t and I rather keep away from backronyms. Portmanteaux are
where it’s at, especially for European grants. Geneplex is already taken, but -plex means -fold anyway.
Genomeplex? Mass ORF conjuration spell? I would love to coin a name, but I should withhold for taken the fun from later on.
No comments:
Post a Comment