Reproducible research

A supplementary plot which is not part of the main manuscript

The statistical analyses for our recent paper on the bacterial communities of biological soil crusts in the Kalahari can be fully examined and reproduced via code and data I have deposited on GitHub.

The analysis relies heavily on the phyloseq package for R. The phyloseq paper and associated documentation were really an inspiration, and provided example code which helped make this possible, because the authors also did the same and demonstrated how it can be done. There are many advantages to this approach which I think are quite obvious, but there are a few difficult steps to make it happen. 

As this paper was not going to be open access as I had hoped, I was unsure of including the code as supplementary materials in case the journal copyright of the paper might apply to the code too. I am not sure if this was a justified concern or not but in the end I decided to host the code in a git repository, which has the benefit of being a living record which can be added to after the paper is published. The disadvantage is that there is no link from the paper to the code, and that is partly why I’m writing this article to explain. This was the first difficult step – making decisions about something you normally wouldn’t bother with at all when writing a paper.

The next difficulty then was to learn how to use git which I found not entirely obvious, however right from the outset of not understanding it I never lost any work. The major breakthrough enabling me to use git effectively was to find a suitable client – SourceTree. After that it gradually became clear and now I am doing all my development with version control via git. I hope to publish code for most of my future research in this way, and maybe for some older projects too.

Is it worth the added hassle?

From my point of view it is definitely worth it because my research contributes much more if it saves others from having to develop similar code again. I have benefited greatly from the code put out by others so I’m happy to contribute my own code even if I am a bit worried about it’s programmatic elegance, as McMurdie and Holmes (2013) put it . Furthermore the complete exposure of methods allows checking by peers (ideally in the review process admittedly), and it therefore also encourages working to a higher standard.

Until journals start demanding this kind of openness though, there is not very much incentive for people to adopt this approach. Since the advent of online journals and supplementary materials I don’t see why it isn’t a standard expectation like submitting your DNA sequences to a public database. At the moment publishing your code is a lot of extra work that potentially exposes errors – good from the scientific point of view but on a personal productivity level not all that attractive.


Comments are closed.