Why on Earth are copyleft software licenses bad for scientific software?
My quest for this blog started with the discussion about the choice of license for the software under Open Differential Privacy (OpenDP) Initiative. I wanted to understand why copyleft licenses are a taboo in Scientific circles. OpenDP choosing MIT license prompted me to ask why wasn't something like AGPL used. I got this as an answer:
This answer prompted a feeling I have been harboring for a while — no one explains what exactly is wrong with the copyleft licenses and why they restrict adoption and participation.
I and many countless others have been involved in some kind of scientific field as software programmers. We, the research software engineers, now even have our own organizations! In a lot of ways, the scientific programming community is a bit behind the commercial software world. This is certainly true for best-practices, tools, etc. but it is also true for software licensing. In this post, you will get a glimpse of the most prominent (or prominently searchable) software licensing discussion in the scientific community.
For me, being in sciences gives the benefit of learning something new and help support scientific discovery along the way. I believe that copyleft licenses are truly aligned with the academic and scientific pursuit. I understand that commercial companies and folks who work there are vocal about not using copyleft licenses but I believe that for scientific knowledge to thrive, we need to use more copyleft licenses.
Also, this blog post will focus on non-military software. The kind of software that is not under an embargo or security curtains.
Types of scientific software
My experience mostly comes from genomics and adjacent areas. It is important to note that in this area, the privacy of individuals is critical and so is their data.
Software accompanying research paper(s)
This is the most common piece of software associated with scientific work and is generally published somewhere. This type of software can be a library, framework or even a hosted service with its source code hosted somewhere else. Most of the time, these are repositories sitting on GitHub that have not been touched since the paper was published. Some are even with open pull requests and issues with no response from the owner of the repository. But this rant is for another day.
You may think of this as a basic proof-of-concept that is not to be used in any production level work.
Software accompanied by research paper(s)
These are the tools and frameworks that help run scientific research projects. Software published by NCBI or NASA that facilitates researchers in their respective fields to run computations or simulations could be seen as this type of software. Or you can think of tools like NumPy, SciPy, etc. My current work project — CanDIG — is of a similar kind.
Commentary on licensing in the scientific community
The most common papers and articles that I could find are from Prof. Stodden. Prof. Stodden has published more about software licensing than anyone else in academia that I could personally find. Some quotes for a slide of a talk by Prof. Stodden:
The scientific ethos precludes directing another scientist’s creative contribution.
Copyleft licenses make demands on downstream code, namely that they use the upstream license on the entire library of new code. Two codes under two different copyleft licenses, therefore, cannot be mixed, as code cannot carry two licenses.
Science creates knowledge, our best estimate of the truth - a public good. People are free to use and build on public goods however they see fit.
Copyleft creates a barrier the transmission of knowledge and scientific progress, that is not compensated by other benefits.
Copyleft inhibits or prevents collaboration with industrial partners.
Don't scientific ethos also encourage "communism" and "disinterestedness"? I fail to understand here that how is forcing companies to disclose the source is against the scientific ethos. Commercial companies have been using and continue to use software with copyleft licenses. Following is another statement by Prof. Stodden that makes my belief in copyleft licenses being more important in scientific software, actually stronger.
Society’s right to knowledge is an unconditional right.
The knowledge, in part, lives in the software we write to facilitate the studies. A couple of questions before we move forward:
- How is it progress if someone takes your work, builds upon it, but does not give back? That is the argument being made by the commercial side.
- How is that I am reading the same sentence and coming to an opposite conclusion? :)
It has been hard to get answers apart from just statements like "copyleft licenses are not permissive...". So, I searched for more...
A 2012 paper in PLOS by Moris et. al. is a good read where they suggest what you should choose and why. A blog post in Astronomy Computing Today with the same title, referring to this paper, summarizes this nicely:
If you want…
…the widest possible distribution and adoption, fewest restrictions on users, open and transparent source code, peer review, community contributions to the codebase, and easy incorporation of your code by others… then a permissive FOSS license such as the BSD/MIT, Apache, or ECL licenses may work well
…to assure the benefits and openness of FOSS in all future derivatives of your work, open and transparent source code, peer review, community contributions to the codebase, and the potential incorporation of your code into other copyleft- licensed works… then you should consider a copyleft FOSS license like the GPL, LGPL, or MPL.
the ability to separately pursue proprietary models while leveraging the wide distribution, adoption, community contributions, and other benefits of open source software… then a hybrid or multi-license scheme may be ap-propriate.
…protect the confidentiality of your source code, reserve maximum control over the distribution and use of your software, and derive licensing revenue… then you should consider a proprietary license.”
Now we have a few pointers on when to choose which license. According to the comments above, a copyleft license makes sure that future derivatives are open as well. As I noted above, in my experience a vast majority of repositories are never touched again, after the publication of the research paper (this is purely anecdotal). There is proof that wide adoption can still happen, even with copyleft licenses, if the published software has value.
A blog post from Jake VanderPlas on "The Whys and Hows of Licensing Scientific Code" is also a good starting point for scientists to figure out what they want to do. The summary in the post is:
- Always license your code. Unlicensed code is closed code, so any open license is better than none (but see #2).
- Always use a GPL-compatible license. GPL-compatible licenses ensure broad compatibility for your code, and include GPL, new BSD, MIT, and others (but see #3).
- Always use a permissive, BSD-style license. A permissive license such as new BSD or MIT is preferable to a copyleft license such as GPL or LGPL.
This aligns with Prof. Stodden's views. One of the amazing things about the post is the discussion in its comments which was incredibly insightful. I will reproduce some of those comments here:
Jeremy Sanders writes:
If I’m publicly releasing code, I’m not writing it so that it can be taken by a for-profit company and essentially close-sourced. The GPL is philosophically much closer to academic freedom, where any modified versions distributed, also have the source distributed.
To which, Jake responds:
If I’m publicly releasing code, I hope it’s useful enough that a library like scipy, scikit-learn, or matplotlib will want to incorporate it. In that case, GPL is a barrier to its usefulness. Remember that in academia, it’s not the text of a license that protects your ideas, but the norms of the academic community.
Jake: that’s all very well for code that is purely academic and not interesting to commercial users. If you code is at all commercially viable, the GPL is a much better prospect. If your code is useful, people will use it regardless of the choice of free license. People even use black boxes like sm and galfit.
Jake believes that projects like Matplotlib, Numpy, scipy, pandas, scikit-learn, IPython are successful because it is "much more than a simple coincidence". I think this is alluding to the fact that they use permissive licenses. This is where it gets even more interesting because Matthew Turk who moved his yt library to BSD and wrote about it in "Relicensing yt from GPLv3 to BSD", chimes in:
Hi Jeremy, I agree. I was really quite torn about re-licensing yt, as I am personally an enormous fan of the GPL and the copyleft. I am, in fact, still rather torn about this for the reasons you wrote here — I am committed to ensuring that the code remains free, and that future versions of it are inspectable at the source level. BSD licensing does not enforce that freedom, but we as a community do — yt itself, even if a nefarious corporation takes it and improves it and sells it, will be free software, respecting all four freedoms. ...
I guess my main point is, I think you’re right, and I think the issue is a subtle one, but I also think that in the end the choice that we made as the yt community was the right one for us at the time. But that doesn’t keep me from being a little sad that the GPL is viewed as such a pariah, as in essence I firmly believe that the de facto standards of FLOSS in science will be GPL-like, with an active curation and preservation of the four freedoms, even if BSD in actual license.
This comment is critical to understand that there are scientists who do believe that a GPL-like license is aligned with the scientific endeavor.
Moving along, if you are scrolling through comments there is a gem by Jeremy Sanders:
...as Linux shows that it’s certainly possible for commercial companies to thrive in a GPL ecosystem. I’ve had commercial companies contributing towards my GPL code with no licensing issues. If a company is scared off by the GPL, it’s likely they need better lawyers. The code being GPL protects their contributions from being swallowed up by competitors in rival closed source apps.
and a detailed comment to prior "bundling" comments by Mohammad Akhlaghi:
This is a follow up to Jake’s comment at March 14, 2014 at 10:09 am. Sorry for continuing this discussion after about 2 years. I just wanted to know exactly what part of GNU LGPL restricted the bundling? ... So if I have understood GNU LGPL correctly, you can easily bundle an GNU LGPL library into a BSD-licenced program without much effort. But then again, I am not a lawyer or have too much experience in the different software licenses yet, so I would be grateful if you could let me know if I have incorrectly understood the GNU LGPL.
And finally, Don Barry's comment's first paragraph:
I must disagree strongly with your endorsement of permissive licenses and your arguments for them. What is nowhere here discussed is the enormous amount of corporate lobbying which has influenced the issue. Google, with the exception of work with the Linux kernel, a “too big to avoid” project, has actively sought to expunge the GPL from other efforts of theirs: Android contains only the kernel as GPL software. The fragmented Android ecosystem is one of many examples where this choice is hostile to the end user.
So far, the most common theme among scientists (and many programmers) that I see is the leaning towards MIT or BSD type of licenses. Licenses that allow users to freely modify and distribute, without any strings attached to where the source came from or giving back. This is true even if it means that some of them still love GPL and its derivatives and are sad about not being able to use copyleft licenses. It'd be worth looking into how much of the sentiment against copyleft licenses is due to the industry pressure.
Chatter outside the scientific community
There is no dearth of commercial programmers hating on copyleft licenses. But some voices clarify why we should stick to a license that helps gain the modifications back. Look at what MongoDB writes in their post about licensing and their choice of AGPL:
To say this another way: if you modify the core database source code, the goal is that you have to contribute those modifications back to the community.
This is the second piece of software (other than GNU/Linux kernel) that is "famous" and uses a copyleft license. I still fail to see why copyleft licenses are a stigma and hence the title of this post. It is working fine for some commercially successful software.
Moving on, David Wheeler has an essay that GPL compatibility is important and how to achieve it. It also notes which licenses to avoid and software projects that consider compatibility with GPL important. I think it is worth a read even though it is an old post.
In an old blog post (now unavailable, wayback link) Zed Shaw, talked about why he is using (A/L)GPL. However, in his Reddit comment he says he was wrong and gives the reason(s):
This article is old but basically everyone is right and I was wrong: I didn't make any money on Lamson and nobody contributed to it, no matter what license I used, and no matter where I hosted it. Pretty much, Lamson proved to me that all the reasons you're given for open source success (license, github, BDFL, etc.) are a load of horse shit. It's marketing dollars, propaganda, and random chance just like everything else. So now my stuff is BSD because I write books at http://learncodethehardway.org/ and there's not really an alternative license that works for code in books. That's all. No major revelations or drama, I just moved on to something better.
So, if Z. Shaw is to be believed that the commercial success of a project is mostly marketing, money, and chance, then to me, it makes sense to stick to copyleft license if sharing and openness is something you even remotely care about. It seems to me that copyleft licenses are beneficial, in general, to the scientific openness. I am not sure if the importance of "permissivity" of the license mostly a cry from those who find it cumbersome to give back while raking profits or there is more to it. I would love to know. However, in science, we cannot let this dictate the ethos of sharing and openness.
Open and accessible is the final word
In recent years, the term "Open Science" has become popular in our community. One of the ways Wikipedia defines it as:
Open science is transparent and accessible knowledge that is shared and developed through collaborative networks.
It is a noble effort and I am glad that there is awareness within the community to make sure that the information stays open and accessible. However, the question remains that why shouldn't science be "open" by default? Almost every researcher I know understands the power certain paid journals have since many research publications are behind paywalls and convoluted logins. Many agree that these places hinder science in the end. You should be able to use previous research and build upon it and this new work can inform the previous work that can be improved even further. If this model works for other parts of scientific work, then why is scientific software any different?
Many thanks to the following people who reviewed this post:
Prof. Stodden, V "Why Copyleft Isn’t Right for Scientific Code" https://web.stanford.edu/~vcs/talks/VictoriaStoddenIPSC2010.pdf
VanderPlas, J "The Whys and Hows of Licensing Scientific Code" https://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/
Reddit Comment by Zed Shaw https://www.reddit.com/r/programming/comments/llw0s/zed_shaw_on_gpl/c2tt5e3/
astrocompute. “A Quick Guide To Software Licensing for the Scientist Programmer.” Astronomy Computing Today (blog), February 25, 2014. https://astrocompute.wordpress.com/2014/02/25/a-quick-guide-to-software-licensing-for-the-scientist-programmer/.
Morin, Andrew, Jennifer Urban, and Piotr Sliz. “A Quick Guide to Software Licensing for the Scientist-Programmer.” PLOS Computational Biology 8, no. 7 (July 26, 2012): e1002598. https://doi.org/10.1371/journal.pcbi.1002598.
“Make Your Open Source Software GPL-Compatible. Or Else.” Accessed April 20, 2020. https://dwheeler.com/essays/gpl-compatible.html.
MongoDB. “The AGPL | MongoDB Blog.” Accessed May 10, 2020. https://www.mongodb.com/blog/post/the-agpl.
“Conditions on Distributing Ghostscript in a Commercial Context.” Accessed May 10, 2020. https://www.ghostscript.com/doc/current/Commprod.htm.
“Licenses & Standards | Open Source Initiative.” Accessed April 20, 2020. https://opensource.org/licenses.
astrocompute. “Licensing Your Code: GPL, BSD and Edvard Munch’s ‘The Scream.’” Astronomy Computing Today (blog), January 17, 2014. https://astrocompute.wordpress.com/2014/01/17/licensing-your-code-gpl-bsd-and-edvard-munchs-the-scream/.
“Neuroimaging in Python — NIPY Documentation.” Accessed April 20, 2020. http://nipy.sourceforge.net/nipy/stable/faq/johns_bsd_pitch.html.
“The GPL and License Compatibility.” Accessed April 20, 2020. https://producingoss.com/en/license-compatibility.html.
“The Whys and Hows of Licensing Scientific Code.” Accessed April 20, 2020. https://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/.
The yt Project Blog. “The Yt Project Blog » Post: Relicensing Yt from GPLv3 to BSD,” September 12, 2013. http://blog.yt-project.org/post/Relicensing/.
“Why I Don’t Use the GPL | Linux Journal.” Accessed April 20, 2020. https://www.linuxjournal.com/article/5935.
“Open Science.” In Wikipedia, May 12, 2020. https://en.wikipedia.org/w/index.php?title=Open_science&oldid=956239136.
“OpenDP.” Accessed May 20, 2020. https://privacytools.seas.harvard.edu/opendp.
“Society of Research Software Engineering.” Accessed May 20, 2020. https://society-rse.org/.
“CanDIG.” Accessed May 20, 2020. https://www.distributedgenomics.ca/.
“Mertonian Norms.” In Wikipedia, May 9, 2020. https://en.wikipedia.org/w/index.php?title=Mertonian_norms&oldid=955747690.
"Why I (A/L)GPL" Zed Shaw, From Wayback Machine, July 13, 2009. https://web.archive.org/web/20090831045815/http://zedshaw.com/blog/2009-07-13.html