In Defense of Open Source Software
Over at Small Pond Science, Terry McGlynn, Amy Parachnowitsch, and Catherine Scott regularly post informative and useful blog entries about “how scientists research, teach and mentor in all kinds of academic institutions, including teaching-centered universities”. I’m a fan, and for the most part a lurker, which means that I mostly read and rarely contribute. One of the main reasons I lurk is because I want to keep myself out of trouble. I find it hard to convey tone in less formal and more conversational online platforms like Twitter and comments sections. Sometimes what I intend as a good-natured chiding can cause offense 1. I also forget that there is an asymmetry of experience between someone who engages actively and a lurker – lurkers can feel like they “know” someone because they have been reading public posts from an author or a site for a long time. From a site author’s perspective, however, a comment from an infrequent poster carries none of that history and familiarity.
This all serves as a preamble for an expansion of my thoughts on a Small Pond Science post. In a recent Small Pond Science comments section, I stopped lurking and made a snarky and ill-considered analogy criticizing the title of this Small Pond Science post: Open source software doesn’t necessarily mean we’ll have better stats . I’ll spare myself the embarrassment of repeating the analogy here (you can find the comment section here), but I was encouraged to take all the space I needed to explain my position. I’ve thought about it for a bit and decided that, in addition to apologizing for the tone of my comment, I’ll add an explanation of my position to this blog entry.
I simply disagree with a premise that any open source statistical package promises better stats. Rather, open source statistical packages promise the same stats, just more accessible to users regardless of the size (or existence) of user bank accounts. This Wikipedia page on statistical packages places software into four categories: open source, public domain, freeware, and proprietary. For the most part, you’d probably be hard pressed to find any specific analysis that you can conduct in any one package that you could not recreate in a separate package. The main story of the Small Pond Science post compares an experience running a statistical analysis on a dataset with the open source package “R” as well as with the proprietary package “JMP”. In recent years and in some biological subfields, R has become the de facto standard for statistical analysis. This has been difficult for me, because, for the most part, I received my statistical training on an older, proprietary format (in fact, I still have the tattered lab instructions for my first zoology lab in the late 90’s in which we used JMP to compare body lengths of soldier beetles!). The R user experience can still improve, but on the whole, this change from the previous status quo is a marked improvement. Here’s a short list of reasons why:
That’s the end of my list. The cost of JMP is prohibitive and, as a consequence, denies participation by a huge section of the scientific community. I work at a small regional university that caters mainly to commuter, first-generation students. Our students sometimes have trouble purchasing our textbooks, which are an order of magnitude less than the cost of a single JMP license. In a sense then, open source is “better” because it allows for greater inclusion and more participation.
At its core, though, the post Open source software doesn’t necessarily mean we’ll have better stats wasn’t about the actual statistical analyses that R vs JMP produce, or about the user base that each program serves due to their costs. Rather, the post was a valid critique of the user experience between JMP and R. I fully agree that the learning curve for R is steep and can possibly lead to user error issues. In grad school, I ran into issues with collaborators using R that were eventually fixed by ensuring that we both had downloaded the most-up-to-date package (i.e a “clean” install”). It can present real problems for collaboration, trouble-shooting, and teaching if a product is overly customizable. In my opinion, user-experience issues are neither endemic to open-source software nor statistical software (in fact, I have had greater frustrations with bioinformatics software on the whole than statistical software). My good friend Katie Hinde has made a similar critique of open-access with one of my absolutely favorite terms: it’s not open access to everyone if it’s hidden behind a paywall of jargon. While this comment was made specifically for publications, I think it also applies to software.
To recap, I think the R vs JMP themed post on Small Pond Science was, like nearly all the posts there, informative and useful. I disagreed with the choice of title and the framing of the critique as an open access issue. In my opinion, user experience problems affect many software packages – both open source and proprietary. Fortunately, as R has become more and more common, there is a terrific community of folks that are willing to help out. Many use the #rstats tag on Twitter.
- It bears repeating that it is the offense that matters, not a speaker’s intent.
With a .edu, a six-month license JMP costs $30. Which is costs less than the quantities of alcohol that R users will be purchasing after trying to get bug-free code to work.