The art of blowing up protein: How shall I name my variables?

Python and naming conventions

Clarity and simplicity are part of the Zen underlying Python (PEP 20). Simplicity in a large system requires consistency and as a result there are various rules and guidelines. Overall, that and the large number of well documented libraries is what makes Python fun. Although, the idea that Python is good is reinforced by the quasi-cultist positivity of the community, especially by Python-only coders that are unaware of some really nice things other languages can do.

In fact, there are frustrating Pythonic things that pop up, some that are extremely granny-state, such as

lack of autoincrementor because supposedly it is confusing (which it is not) and redundant (yet there are three different string format options)
chaining is not that possible with lists (filter, map, join aren't list methods) because it makes spaghetti code
pointers and referencing isn't a thing because it's confusing and dangerous
parallelisation is implemented poorly, relatively to Matlab or Julia
JavaScript asynchronicity is confusing, but the Python asyncio excels
And several more

But overall, it is very clean. One thing that is annoying is that the name styles are not consistent. There is a PEP, that names the naming styles (PEP8), but does not make good suggestions of when to use which. In my opinion this is a terrible shame as this is where stuff starts crumbling.
In brief the problem arises with joining words and three main solutions are seen:

lowercase
lowercase_with_underscore
CamelCase (or CapWords in PEP8)

There are many more, but meme are out there listing with way more pizzazz than I have.

The first is the nice case, but that never happens. I mean, if a piece of python code more than ten lines long does not have a variable that holds something that cannot possibly be described in a word, it most likely should be rewritten in a more fun way with nested list comprehensions, some arcane trick from itertools and a lambda. So officially, joined_words_case is for all the variables and the CamelCase is for classes. Except... PEP8 states: "mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility", aka. they gave up.

Discrepancies and trends

That a class and a method are different seems obvious except in some cases where it becomes insane.
In the collections library defaultdictionary and namedtuple are in lowercase as they are factory methods and the standard types are lowercase, while OrderedDictionary, is in CamelCase. Single word datatypes are equally inconsistent: Counter is in camel case, while deque is in lowercase. All main library datatypes are in lowercase, so it is odd that such a mix would arise, but the documentation blames how the were implemented in the C code. In the os library the method isdir() checks if a filepath (string) matches a directory, while in the generator returned by scandir() the entries have is_dir() as a method, which is most likely a sloppy workaround to avoid masking. Outside of the standard library, the messiness continues. I constantly use biopython, but I never remember what is underscored and what is not and keep having to check cheatsheets to the detriment of simplicity.
There are some trends in the conventions nevertheless. CamelCase is a C thing, while underscores is a Ruby thing: this probably makes me feel more safe using someone's library or script that uses CamelCase. Someone wrote a paper and found CamelCase to be more reliable in terms errors. Personally, I like lowercase all the way, no camels or underscores and the standard Python library seems to be that way and it is really pleasant.
FULL_UPPERCASE variables are often global variables used as settings or from code written by capslock angry people —actually, if Whitespace language is a thing, why is there no capslocks language? Visual Basic is case insensitive and it makes my skill crawl when I look at its "If" and "For" statements.
Single letter variables are either math related or written by an amateur or someone who gave up towards the end —such as myself all the time— because no word came to mind to answer the question "How shall I name my variable?".

My two pence: inane word newfangling

The built-in methods of the mainspace and datatypes all are lowercase without underscores (e.g. open("file.txt").readline()), so there is consistency at the heart of it. Except that lowercase without underscores is not often recommended as it is the hardest to read of the three main ways —it is the easiest to type and possibly remember. With the except of when a word is a verb and it could have been in the present, past or present participle forms. Plus open("file.txt").read_line() is ugly and I feel really anti_underscoring.
German and many other languages are highly constructive and words and affixes can be added together. I have never encountered German code, but I would guess the author would have had no qualms in using underscorless lowercase. The problem is that English in not overly constructive with words of English origin as most affixes are from Latin. The microbiology rule of -o- linker for Greek and -i- for Latin and nothing for English does not really work as Anglo-Latin hybrids look horrendous. Also using Greek or Latin words for certain modern concepts is a mission and, albeit fun lacks clarity. The Anglish moot has some interesting ideas if someone wanted a word fully stemming from Old English and free of Latin. Nevertheless, I like the idea of solely lowercase and coining new words is so fun —except that it quickly becomes hard to read. Whereas traditionally, getting the Graeco-Latin equivalents and joining them was the chosen way, nowadays portmanteaux are really trendy. In the collections module, deque is a portmanteau and I personally like it more as a name than defaultdictionary —How about defaultionary?
As a Hungarian notation for those variables that are just insane, I have taken to adding "bag" as a suffix for lists and sets (e.g. genebag) and "dex" for dictionaries (e.g. genedex), which I have found rather satisfying and actually has helped (until I have to type reduced_metagenedexbagdex).

Hungarian tangent

That leads me to a tangent, the hungarian notation. I wrote in Perl for years, so the sigil notations for an object's type left a mark. Writing st_ for string and other forms Hungarian notation would just be painful and wasteful in Python, but minor things can be done, such as lists as plural nouns and functions as verbs. Except it seems to go awry so quickly!
Lists, sets and dictionaries. Obviously, the elements should not be the singulars as that results in painful results, but I must admit I have done so myself too many times. Collective nouns are a curious case as it solves that problem and reads poetically (for sheep in flock), but there are not that many cases that happens.
Methods. An obvious solution for methods is to have a verb. However, this clearly turns out to be a minefield. If you take the base form, many will also be nouns. If you take the present participle (-ing) the code will be horrendous. If you take the agent noun (-er, -ant), you end up with the silliest names that sound like an American submarine (e.g. the USS listmaker).
Metal notation. The true reason why I have opened this tangent is to mention metal notation. If one has deadkeys configured (default on a Mac) typing accents is easy. This made me think of the most brutal form of notation: the mëtäl notation. Namely, use as many umlauts as possible. I hope the Ikea servers use this. Although I am not overly sure why anyone would opt for the mëtäl notation. In matlab there are a ridiculous number of functions with very sensible names that may be masked, so the mëtäl notation would be perfect, except for the detail that matlab does not like unicode in its variables. One day I will figure out a use…
Nevertheless, even though Hungarian notation is somewhat useful, Python seems to survive without it: I personally think that most of the time when issues happen is with instances of some weirdo class and not a standard datatype anyway. So there is no need to go crazy with these, it is just fun.

Conclusion

Nevertheless, even if there were a few exceptions, it is my opinion that a centralised Pythonic ruleset would have been better. The system that I would favo(u)r is compulsory lowercase, as is seen for the built-in names — parenthetically, American spelling is a given, it did not take me long to spell colour "color" and grey "gray". The reason why lowercase is disfavoured is because it is hard to read when the words are long. In my opinion variables names should not be long in the first place. One way around this is making a sensible portmanteau or a properly coined word and just restraining from overly descriptive variables. At the end of the day, arguments of legibility at the cost of consistent and therefore easy usage makes no sense. defaultdictionary takes a fractions of a second more to read, but looking up how a word is written takes even minutes.

The art of blowing up protein

Pages

Saturday, 7 November 2015

How shall I name my variables?

Python and naming conventions

Discrepancies and trends

My two pence: inane word newfangling

Hungarian tangent

Conclusion

2 comments: