Thursday 9 July 2015

Thesis wordle

EDIT: The correct term is word cloud, while wordle is a website that runs on Flash and does not work in most browsers anymore. I would recommend Tagul instead. Here is the Wordle for my thesis:


I studied the enzyme MetC from Thermotoga maritima, which had alanine racemising activity in addition to a β-eliminating activity. I also studied the enzymes from Wolbachia, Pelagibacter ubique and E. coli. So far so go, all those words appear.
Unfortunately Wordle breaks up non breaking spaces, understores, interpuncts, except hyphens. It removes common words, but it does not collapse grammatical number, hence the enzymes/enzyme, genes/gene and activities/activity. In this version, hyphens were added and common plurals collapsed.
My appendix features a large Perl script, which affects the Wordle: else, elsif, foreach, print, file, sub and the name of a function (input). I assume "if" and "for" are weeded out by the filter against common words. Also YP and NP feature as there are many genbank accession identifiers.
If I remove the filter of common words the "the" is so prevalent that is squashes everything.

Now I want the raw data. So I pasted the thesis into an online word counter and saved the output and imported in MatLab and plotted the power-law distribution.




Not much else can be done with the data because many of the single-appearance words are actually sequences and there is no way to cluster the words by meaning. All the possible analyses will give generic results (e.g. more frequent words are shorter). If one were to go overboard, one option would be to track the progress of certain key words across the text. A Twitter trending equivalent. The problem is that I already know what words will appear more frequently were. So there isn't much point and it is probably best stopping at a Wordle step, which shows what one already knows, but is pretty.

Appendix, Scripts:
%first graph
loglog(x,'.');
ylabel('Counts');
xlabel('Rank of the unique word');
title('Word frequency in Matteo"s thesis');
offset= 300;
text(7,x(7)+offset,'MetC');
hold on;
plot(7,x(7),'r.','Markersize',20)
text(17,x(17)+offset,'enzyme')
plot(17,x(17),'r.','Markersize',20)
hold off;

>%second graph (not shown, frequent words are shorter).
x(isnan(x))=1;
xprime=unique(x);
z=cellfun(@length,words);
zprime=[];
yprime=[];
for i=1:numel(xprime)
    zprime=[zprime mean(z(x==xprime(i)))];
    yprime=[yprime numel(z(x==xprime(i)))];
end
plot(xprime,zprime,'ob');
hold on;
plot(xprime(yprime>1),zprime(yprime>1),'or');
hold off;
xlabel('number of counts of word');

ylabel('average number of letters for words of that frequency')

Sunday 5 July 2015

Acknowledgement infographic

To break from tradition and avoid an "Oscars acceptance speech", on my thesis I made an acknowledge infographic. It was for fun and to acknowledge the folk who helped me keep my sanity during my PhD by procrastinating and to give a weighted credit to people, including Wayne's frowning.




Saturday 4 July 2015

Getting the corresponding nucleotide sequences of protein sequences

Getting stuck on a really simple task is not a nice feeling.
A seemingly simple challenge I have faced a few times so far is getting the nucleotide sequence corresponding to a protein sequence: this seems really straightforward, yet it is not.