Saturday 28 November 2015

ABS biosynthesis

Lego, rumour has it, wants to biosynthesise acrylonitrile, butadiene styrene (ABS), the resin that gives their blocks their firm hold and transgenerational lifespan. This is cool for three reasons:
  1. metabolic engineering is cool by definition,
  2. Lego is cool by definition and
  3. one or two steps link back to a cool gene I found in Geobacillus


So what might they do to biosynthesise their resin? The processes are rather straightforward and one has to go out of one's way to dream up a cool route. In fact, there is a lot of repetition.
The three monomers for the polymerisation are styrene, acrylonitrile and butanediene. These would be made separately. But there are several commonalities, such as the terminal ene group.
There are a few ways to get a terminal ene group:
  1. Have a 2,3-ene and tautomerise it
  2. Have a 2,3-ene and terminal carboxyl and eliminate the carboxyl
  3. Reversible dehydration
  4. Irreversible dehydration via phopharylated intermediate
  5. Oxidative decarboxylation (oleT encoded p450-dependent fatty acid decarboxylase from Jeotgalicoccus sp.)

My guess is that their major challenge is that they will have to extensively modify a few enzymes and will be plagued with detection and screening. Nevertheless, I am still going to talk about the chemistry as it is a good excuse to sneak in a cool set of genes from Geobacillus.


There are two way to biosynthesise styrene. The simplest is decarboxylating cinnamic acid, while the more interesting one by dehydrating phenylethanol.

The tourist route

Phenylethanol —also unglily called phenylethyl alcohol— is in turn made from phenylacetate, which is made from phenylpyruvate.
Recently, while analysing a transcriptomic dataset for Prof. D. Leak, which resulted in an awesome website,, I stumbled across a really cool enzyme encoded among phenylalanine degradation genes, that I speculate is a phenylpyruvate dehydrogenase. This is a homologue of pyruvate dehydrogenase and follows the same mechanism, namely a decarboxylative oxidation followed by CoA attack.

There are other ways to make phenylacetate, but none allow such a shameless plug for my site —in fact, I should have talked about the 2-phenylethylamine biosynthetic route instead.
In nature the phenylacetate will go down the phenylacetate degradation pathway (paa genes), but it could be forced to go backwards and twice reduce the carboxyl group. Phenylacetaldehyde dehydrogenase is a common enzyme, which even E. coli has (faeB), but the phenylethanol dehydrogenase is not. I found no evidence that anyone has characterised one, but I am fairly certain that Gthg02251 in Geobacillus thermoglucosidasius is one as it is an alcohol dehydrogenase guiltily encoded next to faeB, which in turn is not with phenylethylamine deaminase (tynA).
So, that is how one makes phenylethanol. The dehydration part is problematic. A dehydratase would be reversible, but offers the cool advantage that it can be evolved by selecting for better variants that allow a bug with the paa genes and all these genes to survive on styrene as a carbon source. The alternative is phosphorylation and then dehydration as happens with several irreversible metabolic steps.

The actual route

That is the interesting way of doing it. Whereas the simple way is rather stereotypical. In plants there are really few secondary metabolites that are not derived from polyketides, isoprenoid, cinnamate/cumarate or a combination of these. Cinnamic acid is deaminated phenylalanine via a curious elimination reaction (catalysed by PAL). In the post metabolic engineering breaking bad I discuss how nature makes ephedrine, which is really complex and ungainly and then suggest a quicker way. Here the cinnamic acid route is actually way quicker as a simple decarboxylation does the trick. S. cerevisiae to defend itself from cinnamic acid, it has an enzyme PAD1p that decarboxylates cinnamic acid. Thefore, all that is needed is PAL and PAD1.


Previously I listed the possible routes to an terminal alkene, which were: 
  1. Tautomerise a 2,3-ene
  2. Decarboxylate a 2,3-ene with terminal carboxyl
  3. Dehydrate reversibly
  4. Dehydrate irreversible via phopharylated intermediate
  5. Decarboxylate oxidatively
In the case of butanediene, it is a 4 carbon molecule already, which forces one's hand in route choice. Aminoadipate is used to make lysine when diaminopimelate and dihydropicolinate are not needed. That means that a similar trick to the styrene biosynthetic route could be taken, namely aminoadipate is eliminated of the amine by a PAL mutant, decarboxylated by a PAD1 mutant and then oxidatively decarboxylated by a mutant OleT. But that requires changing a lot the substrate for three steps and the cells went to a lot of effort to make aminoadipate, so it is rather wasteful route.
Another way is to co-opt the butanol biosynthetic pathway to make butenol and dehydrate that.
A better way is to twice dehydrate butanediol.

As mentioned for styrene, a reversible dehydration means that selection could be done backwards. However, pushing the reaction to that route would require product clearance, otherwise there will be as much alcohol as the alkene. With butanediol and butanol there is a production and a degradation pathway, which would mean that selection could be done with the degradation route, while the actual production with the production route.


That is a curious molecule to biosynthesise. There are nitrile degrading bacteria and some pathways make it, so it is not wholly alien. preQ0 in queuosine is the first I encountered. QueC performs a ATP powered reaction where a carboxyl is converted to a nitrile. I am not sure why, but a cyano group seems (=Google) less susceptible to hydrolysis than a ketimine for some reason  —methylcyanoacrylate (superglue) follows a different reaction. Beta-alanine could be the starting compound, but it would require so many steps that it is a bad idea.
Substituting carboxyl for nitrile (nitrilating?) on acrylic acid with a QueC like enzyme would be better. Acrylic acid is small so it can be made by dehydration of lactic acid, oxidative decarboxylation of succinate or decarboxylation of fumarate. The latter sounds like the easiest solution as there are many decarboxylases that use similar molecules, such as malate or tartrate decarboxylase.


Basically, even if it seems like a crazy idea at first, the processes are rather straightforward —one or two engineered enzyme for each pathway—, but the chemistry is pretty hardcore, so the few engineered enzymes will have to be substantially altered. Given that the compounds are small, quantifying yields will be their main challenge. How one goes about designing a selection systems for these is an even bigger challenge as evolving repressors to respond to small and solely hydrophobic compounds would be nearly impossible... So they will have to do this most likely by rational design alone, which makes it seem like a crazy idea after all.

Sunday 22 November 2015

Noah's ark dilemma in phylogeny

When looking at bacterial diversity, be it for a conserved gene or the organism itself, a common problem is the wealth of sister strains and sister species: this is a bother as often one would like a balanced representation. This leads to a Noah's ark dilemma of having to pick one.

Saturday 14 November 2015

Pooled ORFome library by multiplex linear PCR

Some time back I dreamed up a method of making a pooled orfome library by multiplex linear amplification. I never did submit it for a grant as it is way to ambitious and, well, expensive for the end result. I really like the idea so I feel bad it never went anyway nor will it, but that is science for you and only tenured professors can do what they like.
Here is a sketch of the idea. The prices might not be slightly out of date.
In essence, the science is sound, albeit risky and expensive, but the generation of a pooled orfome library of a given species is not a technology that woud have much demand, because few people care about pathway holes anyore and, a transcriptome or fragmented genome library does the same thing without the pricetag, just with orders of magnitude lower efficiency.


Aim. The creation of a pool ASKA-like (orfome) library by doing a multiplex linear amplification reaction with a primer pool against genomic DNA generated by gene synthesis methods (custom GeneArt Strings job).
Use. The generated pooled library of amplicons can be used to spot unknown non-homologous genes that fill a pathway hole or promiscuous activities. Not only is identifying non-homologous genes a big challenge, but there is great novelty in the method itself.
Cost. In this proposal I have not got a cost due to the caginess of the Introgen rep, but I estimate it to be around $5k.


One use of the Aska library and other orfome libraries is to pool them to find rescuers of a knockout (Patrick et al., 2007). Revealing the whole set of promiscuous enzymes for that activity —Dan Anderson’s definition of promiscuome means all of the promiscuous activities in a genome. This works as well (or better) with main activities, except that with E. coli there are no easy pathway holes left to probed. With species that are not E. coli this would be useful, but the number of available orfome libraries is small, incomplete or problematic such as that of the T. maritima structural biology consortium (split on 3 different plasmids).


PCR marathon. The Aska and other orfomes are made by separate PCRs in a multiwell plate. If one is to pool them the effort seems wasted as it is a Herculean task.
Fragmented genome. Shelley Copley is using a fragmented genome library as a proxy for an orfome library. The problem is that most of the library will be composed of partial gene, out of frame genes and genes under their own promoter.
Depleted transcriptome. Nobody (that I know) has used a transcriptome library. The major problem is that the genes are represented one of each (orfome), but are present in certain concentrations, where the major transcripts are rRNA and ribosomal peptides. Depleting the former is doable, albeit not perfect, but to get a good coverage the library should be titanic.
Multiplex PCR. The PCR approach is best as it is targeted. If one wanted a pool, one might be able to pool the primers and do one or more multiplex PCRs. Three problems need to be addressed:
  • Making of primers
  • Maximum n-plex PCR
  • Avoiding biases
In light of the extremely high-plex nature, a linear amplification is a better solution as will be discussed.


One way to have a pool of primers is to make them with a GeneArt gene strings. At QMB I cornered the rep who said it was easily doable making upto ten thousand different strings (4300 genes in E. coli times two assuming no redundancy), but would not discuss pricing. For them it is easy as they skip a step and simply deliver the oligonucleotide synthesis product. However, everything is secret (or the rep didn’t know). He could not confirm that they make the oligos by lithography. Nor did he disclose the length distribution. However, Affimetrix ten years ago was making 25 base oligos (probes) for its GeneChip, so it is higher than 25. Illumina’s BeadArray uses 23+50 bases probes, but it might be made in a weird way. Consequently, the primers will be longer than 25 bases, which is perfect. The primers should have high melting temperatures, preferably at 72 °C, so should be 30 bases long.
The real problem I assume is that the completeness decreases exponentially with product length.

Let’s assume they obey the IDT’s graph on the right and GeneArt actually varies the length.

Consequently, a PAGE purification step is required no matter what.
But the yield would decrease. >200 ng is the guaranteed yield of gene strings.
30*8.6 k = 258,000 bases.  The size limit of the sequence is 3,000 bases, but I think that that is due to the overlap as an Affymetrix GeneChip has a genome repeated multiple times. Consequently, if the overlap is a shifting of one base pair it should cost the same as three 3k bp gene string reactions (assuming no discount for skipped ligase step). If it is of 5 base window, it would be the same as 15 3kb gene string reactions. If the window is 10, 30 3kb reactions. The latter is nonsensical as it would mean only 300 unique sequences, which is in stark contradiction to the diversity offered in their library options. Regardless, even at $10k it is ten times cheaper that one by one sequences. The first option seems the right one, so let’s say $3k for 600 ng. 400 ng after PAGE (HPLC preferable).
The yield is also okay. 400 ng divided by (327g/mol per base times 30 bases per oligo) is 40 pmol. 200 nM reaction concentration means 200 µl PCRs. The maximum yield would be 1 µg amplicon.


The other major issue is the limit of a multiplex PCR. Kapa multiplex PCR website talks 4-plex, 8-plex and 12-plex, but it mentions in the FAQ that 30 is highest mentioned. High-plex 30 amplicons, ≤1000 bp. 30-plex is no way close to 4300-plex. However, this does not mean it is impossible.
Multiplex PCR is often used in conjunction with detection methods (TaqMan etc.), and not for subsequent cloning. So some issues may not apply. Two issues however do apply, artefacts and biases.
On chip multiplex PCR. Parenthetically one paper makes a 100-plex PCR on chip to boost sensitivity (PMID: 21909519). This is a related system and a very plausible approach, however, it would require in situ DNA synthesis capabilities.
Primer dimers.  Primer dimers happen in normal PCR. The way to avoid it is using a higher annealing temperature (longer primers) and to avoid repetitive sequences. The oddity is that the Kapa manual gives an example annealing step as 60 °C. If the PCR were a two-step reaction, the reaction would be more efficient with less chance of noise. That means that AT-rich organisms, like P. ubique, are off. DMSO and betaine allow better PCR specificity especially with GC-rich sequences, so it might be good to go even more overkill with annealing length.
3’ trail primer binding.  Here a new issue arises: the first copy of a sequence might have a primer binding site on its 3’ end for the next gene. This truncated gene would not amplify with the other primer, but would act as an inhibitor soaking up the first primer. Amplicons can act as megaprimers, so it might be as bad. Nevertheless, it is worth worrying about. Furthermore, Taq polymerase does not have strand displacement activity but 5’ exonucleolytic activity. One way to overcome this is to have 5’ protected primers (IDT options), which may interfere with the plasmid construction set and may be problematic for overlapping genes. I would shy away from strand displacing enzymes are they may be non-optimal.
The amplification step. PCR amplification efficiencies differ between primers. I could model what may happen (simulate, BRENDA has all the constants I need). However, PCR reactions with low primer concentration might avoid sequences lagging behind. This combined with the previous problem raises the suspicion linear amplification may be best.
Linear amplification. One thing to avoid is exponentially amplifying biases. One option is to do a single primer linear amplification. At first I would say no as it is really wasteful in terms of primers (which are limited) and the yield would not be enough. Plus it would need a second step in the opposite direction to produce correctly sized sequences. Plus the major issue with high-plex reactions is the low yield of product, so this would be four-thousand fold worse when linearly amplified. However, what is not being considered is the fact that a 4300plex PCR isn’t amplifying exponentially. In 25 µl reaction with 100 ng template and with 200 nM primer pool, there are about 2.5 pM (for 4.6Mbp) genome strands and 50 pM of each primer (of 4300), which only allows for a twenty fold amplification.


How does one go about testing it? There will be smears if it is a nonspecific mess and if it worked. For the final product, NGS may be an option to see bias amplification, but not for routine troubleshooting. Transforming specific rescue strains is too laborious. The best option is RT-PCR of some genes (spread of sizes).


The next step once a pool has been made is to get them in plasmid form.
RE cloning, blunt. Messy products, low yield. Non-directional. Not worth it.
RE cloning, added 5’ site. It would require subsets to avoid cutting the amplicon. Which is would be a pain. Also it means adding to the primer.
Gibson assembly. It requires adding to the primer. This means longer primers, which is okay-ish, but I have an unscientific hunch it may mess up the multiplex primer.
TOPO clone. Directional TOPO is a good option, but has a drawback, there is no good plasmid. Also I am ignoring the 5’ primer protection possibility. The pET is not good as it requires T7 (cotransformation of pTARA plasmid and rescue screen has not been done). The pBAD and the pET have loads of crap at the end. The Aska collection is 10% non-functional due to the GFP tag. The thioredoxin and the hexahistidine tags may be cool, but they are in the way of a clean experiment. Adding a stuff to the primers to override the crap is a no-go due to the 5’ tag in multiplex concern and is seriously inelegant for downstream processes. So the TOPO ligation into pENTR plasmid followed by Gateway into a custom plasmid. It would allow a brief amplification step. A custom TOPO would be ideal, but it costs more than the GeneStrings step itself. A good homebrew TOPO-plasmid has never been achieved. So, disappointingly, a two-step TOPO is the TOPO way (I’ll keep an eye out for new TOPO kits).

Test subject

P. ubique is not an option as it is too AT-rich and its genes cannot be expressed even with pRARE. Even though this whole idea arose to tackle the pathway holes issue in P. ubique.
I want to try it for T. maritima to find if MetC is the sole racemase, but as there is the structural genomics consortium’s library, it makes it seem a tad redundant.
If I were to do this independently of my project, I would be easy to find a good candidate that:
  • has a solvable pathway hole for the field test of the library,
  • it has a small genome,
  • its proteins are expressible in E. coli and
  • has a 50% GC-content.
EDIT. After analysing the transcriptome of Geobacillus thermoglucosidasius and knowing how popular it is getting I would say a Geobacillus spp. is a perfect candidate.

Parenthesis: Name

For now, I’ll call it ORFome library by multiplex linear amplification.  I have been calling it Aska-like library. Although I’d like a better one. Yes, a name is a semantic handle, but there is nothing worse than a bad name. Delitto Perfetto is a cool method name, CAGE isn’t and I rather keep away from backronyms. Portmanteaux are where it’s at, especially for European grants. Geneplex is already taken, but -plex means -fold anyway. Genomeplex? Mass ORF conjuration spell? I would love to coin a name, but I should withhold for taken the fun from later on.

Sunday 8 November 2015

Diaminopurine in cyanophage S-2L DNA

There is a Nature paper from 1977 reporting a cyanobacterial phage, S-2L, that uses diaminopurine in its DNA instead of adenine. diaminopurine has two amine groups as opposed to one, so can bind more tightly to thymine, thus changing the behaviour of its DNA and tricking its host.

The strange thing is that this cool discovery went nowhere even though it would have some really interesting applications. In fact, this paper was followed by two others, then a hiatus and then two papers in the 90s about melting temperature. No cloning, no nothing.

The phage was lost as there is a patent that gives its sequence. The sequence is unhelpfully unannounced, but the patent among the really boring bits has claim 270, which says:
On the other hand, it seems very likely that the D-base is formed by semi-replicative modification. Between the two biosynthesis routes of dDTP formation described above, the identification of a succinyladenylate synthetase gene homologue called ddbA (deoxyribodiaminopurine biosynthetic gene A) leads to the conclusion that it is the second route which is probably taken during phage infection (FIG. 2).
Several tests have been carried out in order to determine the activity of the corresponding protein. The results suggest that the expression of ddbA allows restoration of the growth of a strain of E. coli expressing the yaaG gene of Bacillus subtilis [yaaG (now dgk) encodes  deoxyguanosine kinase] in the presence of a high concentration of dG (10 mM). On the other hand, 2,6-diaminopurine becomes toxic (10 mM) to E. coli when it is in phosphorylated form (which has been tested in the same strain of E. coli expressing the yaaG gene of Bacillus subtilis i.e. MG1655 pSU yaaG) which makes it possible to have a screen in order to identify in vivo the complete biosynthesis route of the D-base.
Small refresher on the last steps of purine biosynthesis: the purine pathway makes inosine monophosphate (IMP, whose nucleobase is hypoxanthine, a purine with keto group on carbon 6), which is

  • either aminated to adenine in two steps via a N-succinyl intermediate by the enzymes encoded by purA and purB
  • or oxidised at position 2 (resulting in a keto substituted compound, xanthosine monophosphate) and transaminated to guanine.
Consequently, to make diaminopurine monophosphate the cyanophage would need to aminate position 6 of guanosine monophosphate (guanine has a ketone at 6 and an amine at 2).

The authors claim that there is a homologue of the purA, called ddbA, whose enzyme can act on GMP, unlike the PurA enzyme.

However, they do not report a purB homologue or a adk homologue (encoding a kinase to make the triphosphate). This means that:

  • either Synechococcus (the host) purB and adk promiscously do the reactions, which seems odd given that the phage would give them a negative selection,
  • or that an analogue eluded them manually annotating the genome 1970s style.
E. coli AMP kinase is unable to accept diaminopurine monophosphate as the ligand is held via backbone atoms (click here for a visual). One could play devil's advocate and say that diaminopurine might be a strategy that was tried a few times along the evolutionary tree, hence why E. coli may be evolved to prevent it...

Parenthetically, the people that filed the patent were not the people who found the virus and no paper was published about it even though the first author of the patent studies odd nucleobases.
One interesting thing from the genome is the presence of many polymerases and of DNA gyrase, which is due to the odd DNA.

So why did nothing come out of it? Is it real?

There are four facts worth noting:

  • Recently a paper came out in PNAS where they describe modified uridines in some phages. The concentration of modified uridine is a small percentage —which may likely be the case with this cyanophage too. 
  • Sanger sequencing or Illumina of a virus with diaminopurine will work just fine, except with Nanopore sequences, where the more charged base will have a shorter dwell time and result in a lot of failed reads.
  • There were many papers about oxygen-insensitive nitrogenases in the 90s and 00s from a single team. It took a consortium of authors to disprove it thoroughly a decade later. In that case, there was a lot of secrecy about the strains and the genome of the organism. Whereas with cyanophage S-2L the sequence is on NCBI.
  • That the S-2L strain is lost is suspicious, but not implausible especially how hard it is to grow cyanobacteria.

In conclusion, it is a real shame research into this curious pathway was minimal and has not progressed as it would be a perfect toolkit for synthetic biology.

Saturday 7 November 2015

How shall I name my variables?

Python and naming conventions

Clarity and simplicity are part of the Zen underlying Python (PEP 20). Simplicity in a large system requires consistency and as a result there are various rules and guidelines. Overall, that and the large number of well documented libraries is what makes Python fun. Although, the idea that Python is good is reinforced by the quasi-cultist positivity of the community, especially by Python-only coders that are unaware of some really nice things other languages can do.

In fact, there are frustrating Pythonic things that pop up, some that are extremely granny-state, such as

  • lack of autoincrementor because supposedly it is confusing (which it is not) and redundant (yet there are three different string format options)
  • chaining is not that possible with lists (filter, map, join aren't list methods) because it makes spaghetti code
  • pointers and referencing isn't a thing because it's confusing and dangerous
  • parallelisation is implemented poorly, relatively to Matlab or Julia
  • JavaScript asynchronicity is confusing, but the Python asyncio excels
  • And several more

But overall, it is very clean. One thing that is annoying is that the name styles are not consistent. There is a PEP, that names the naming styles (PEP8), but does not make good suggestions of when to use which. In my opinion this is a terrible shame as this is where stuff starts crumbling.
In brief the problem arises with joining words and three main solutions are seen:
  • lowercase
  • lowercase_with_underscore
  • CamelCase (or CapWords in PEP8)
There are many more, but meme are out there listing with way more pizzazz than I have.
The first is the nice case, but that never happens. I mean, if a piece of python code more than ten lines long does not have a variable that holds something that cannot possibly be described in a word, it most likely should be rewritten in a more fun way with nested list comprehensions, some arcane trick from itertools and a lambda. So officially, joined_words_case is for all the variables and the CamelCase is for classes. Except... PEP8 states: "mixedCase is allowed only in contexts where that's already the prevailing style (e.g., to retain backwards compatibility", aka. they gave up.

Discrepancies and trends

That a class and a method are different seems obvious except in some cases where it becomes insane.
In the collections library defaultdictionary and namedtuple are in lowercase as they are factory methods and the standard types are lowercase, while OrderedDictionary, is in CamelCase. Single word datatypes are equally inconsistent: Counter is in camel case, while deque is in lowercase. All main library datatypes are in lowercase, so it is odd that such a mix would arise, but the documentation blames how the were implemented in the C code. In the os library the method isdir() checks if a filepath (string) matches a directory, while in the generator returned by scandir() the entries have is_dir() as a method, which is most likely a sloppy workaround to avoid masking. Outside of the standard library, the messiness continues. I constantly use biopython, but I never remember what is underscored and what is not and keep having to check cheatsheets to the detriment of simplicity.
There are some trends in the conventions nevertheless. CamelCase is a C thing, while underscores is a Ruby thing: this probably makes me feel more safe using someone's library or script that uses CamelCase. Someone wrote a paper and found CamelCase to be more reliable in terms errors. Personally, I like lowercase all the way, no camels or underscores and the standard Python library seems to be that way and it is really pleasant.
FULL_UPPERCASE variables are often global variables used as settings or from code written by capslock angry people —actually, if Whitespace language is a thing, why is there no capslocks language? Visual Basic is case insensitive and it makes my skill crawl when I look at its "If" and "For" statements.
Single letter variables are either math related or written by an amateur or someone who gave up towards the end —such as myself all the time— because no word came to mind to answer the question "How shall I name my variable?".

My two pence: inane word newfangling

The built-in methods of the mainspace and datatypes all are lowercase without underscores (e.g. open("file.txt").readline()), so there is consistency at the heart of it. Except that lowercase without underscores is not often recommended as it is the hardest to read of the three main ways —it is the easiest to type and possibly remember. With the except of when a word is a verb and it could have been in the present, past or present participle forms. Plus open("file.txt").read_line() is ugly and I feel really anti_underscoring.
German and many other languages are highly constructive and words and affixes can be added together. I have never encountered German code, but I would guess the author would have had no qualms in using underscorless lowercase. The problem is that English in not overly constructive with words of English origin as most affixes are from Latin. The microbiology rule of -o- linker for Greek and -i- for Latin and nothing for English does not really work as Anglo-Latin hybrids look horrendous. Also using Greek or Latin words for certain modern concepts is a mission and, albeit fun lacks clarity. The Anglish moot has some interesting ideas if someone wanted a word fully stemming from Old English and free of Latin. Nevertheless, I like the idea of solely lowercase and coining new words is so fun —except that it quickly becomes hard to read. Whereas traditionally, getting the Graeco-Latin equivalents and joining them was the chosen way, nowadays portmanteaux are really trendy. In the collections module, deque is a portmanteau and I personally like it more as a name than defaultdictionary —How about defaultionary?
As a Hungarian notation for those variables that are just insane, I have taken to adding "bag" as a suffix for lists and sets (e.g. genebag) and "dex" for dictionaries (e.g. genedex), which I have found rather satisfying and actually has helped (until I have to type reduced_metagenedexbagdex).

Hungarian tangent

That leads me to a tangent, the hungarian notation. I wrote in Perl for years, so the sigil notations for an object's type left a mark. Writing st_ for string and other forms Hungarian notation would just be painful and wasteful in Python, but minor things can be done, such as lists as plural nouns and functions as verbs. Except it seems to go awry so quickly!
Lists, sets and dictionaries. Obviously, the elements should not be the singulars as that results in painful results, but I must admit I have done so myself too many times. Collective nouns are a curious case as it solves that problem and reads poetically (for sheep in flock), but there are not that many cases that happens.
Methods. An obvious solution for methods is to have a verb. However, this clearly turns out to be a minefield. If you take the base form, many will also be nouns. If you take the present participle (-ing) the code will be horrendous. If you take the agent noun (-er, -ant), you end up with the silliest names that sound like an American submarine (e.g. the USS listmaker).
Metal notation. The true reason why I have opened this tangent is to mention metal notation. If one has deadkeys configured (default on a Mac) typing accents is easy. This made me think of the most brutal form of notation: the mëtäl notation. Namely, use as many umlauts as possible. I hope the Ikea servers use this. Although I am not overly sure why anyone would opt for the mëtäl notation. In matlab there are a ridiculous number of functions with very sensible names that may be masked, so the mëtäl notation would be perfect, except for the detail that matlab does not like unicode in its variables. One day I will figure out a use…
Nevertheless, even though Hungarian notation is somewhat useful, Python seems to survive without it: I personally think that most of the time when issues happen is with instances of some weirdo class and not a standard datatype anyway. So there is no need to go crazy with these, it is just fun.


Nevertheless, even if there were a few exceptions, it is my opinion that a centralised Pythonic ruleset would have been better. The system that I would favo(u)r is compulsory lowercase, as is seen for the built-in names — parenthetically, American spelling is a given, it did not take me long to spell colour "color" and grey "gray". The reason why lowercase is disfavoured is because it is hard to read when the words are long. In my opinion variables names should not be long in the first place. One way around this is making a sensible portmanteau or a properly coined word and just restraining from overly descriptive variables. At the end of the day, arguments of legibility at the cost of consistent and therefore easy usage makes no sense. defaultdictionary takes a fractions of a second more to read, but looking up how a word is written takes even minutes.