Author: Carlos Alberto Gómez Grajales
Any particular day seems to be fit to change history. For instance, on May 5, 1976, Argentine former football player Juan Pablo Sorín was born. He played for many teams, such as River Plate, Cruzeiro, Paris Saint-Germain and Barcelona [1]. On that very same Wednesday in 1976, David Bowie held a concert at the Empire Pool in London, just a few years before it changed its name to Wembley Arena [2]. And as a matter of fact, another event took place on that same day: May 5, 1976. It was an event that would trigger one of the most notorious revolutions in statistics, one that we are still witnessing up to this day.
It was an informal meeting of just five people, who gathered at Murray Hill, New Jersey [3]. They all worked at Bell Labs and gathered to discuss ideas for designing a system for statistical computing, or perhaps adopting an existing system for their research work. Ideally the system would be used by around 20 statisticians who worked at Bell Labs, or so the team thought. Back in the 1970s, computers were, of course, far different from what we have now and statistical systems were also still in a rudimentary phase. At that time, the statistical analysis group of Bell Labs used mostly a library of Fortran-based software, designed to produce simulations, data analysis and graphs, though these tools were too restricted for their research purposes. They needed something new, but it wasn’t until after a month of meetings that the team decided to go ahead and create a new system.
The ideas that flowed in those first meetings would become foundations for what statistical software would be. Many, many years before the open software movement (the open source initiative would not exist until the 90’s), the team at Bell Labs considered that the extensibility of the software needed to be incorporated at a fundamental level, something quite unusual for statistical software at the time. It was the collaborative research environment the team embraced what allowed them to identify the many advantages of extensible, adaptable software.
Two of the original five people in the room became the leaders of the project: Rick Becker and John Chambers, who worked in developing the system for many months, with occasional inputs from other colleagues. By the end of that year “The system”, as it was initially called, was ready in a preliminary state. Everyone noticed that “The system” wasn’t a particularly inspiring name; so many other suggestions were made including Interactive SCS (ISCS), Statistical Computing System, and Statistical Analysis System, though that last one was already taken by the software team at SAS. Due to the fact that none of the names was widely accepted, and noticing that most of the names had an S in them, the new software was simply called “S”. Besides, that would also make the system consistent with the name of another language that was developed in Bell Labs around that time: the “C” programming language.
At first, S was developed in a local computer system, to be used only by the initial research team, but interest in the system grew quickly, so S was later developed under a portable version of UNIX, that could be used on other computers. That development would become S version 2, which was the first version ever to be commercially licensed. This would also mark the first time the system would leave Bell Labs, reaching a number of researchers in another companies, as well as some universities. This was important because the system was now being tested by more professionals, providing feedback and ideas for the core team. It was also notable because the list of new users included two remarkable statisticians at the University of Auckland in New Zealand.
Ross Ihaka and Robert Gentleman were among the first avid users of S outside of Bell Labs, during their time at the University of Auckland, where they gave lectures on statistics. They noticed, and often discussed, the fact that the then current implementations of the S language and many other systems at the time were still hard for their college students to grasp [5]. Just like S, other analysis software available 25 years ago had been designed by researchers, so they were indeed hard to use at the time. It was during a hallway conversation that the pair decided they wanted a technology better suited for their statistics students. At first, developing the language was a mere hobby for the professors, considering that none of them had any deep computer science training. But starting in 1991, both Ihaka and Gentleman started working full time in developing their new software.
R owes a huge debt to its grandfather, the S computer language. There’s no better example than the name itself. As you can imagine, the name is partly a play on the name of the Bell Labs language, but it is also based on the names of the two R authors: Robert Gentleman and Ross Ihaka. Besides that, R can actually be considered a different version of S [6]. In fact, the official R system webpage is not shy to discuss its rooted origins in the S- language [7]. The web itself describes R as simply another implementation of the S engine:
“We can regard S as a language with three current implementations or “engines”, the “old S engine” (S version 3; S-PLUS 3.x and 4.x), the “new S engine” (S version 4; S-PLUS 5.x and above), and R. Given this understanding, asking for “the differences between R and S” really amounts to asking for the specifics of the R implementation of the S language, i.e., the difference between the R and S engines.”[7]
The R system was originally designed using a language largely compatible with S. There were some important differences though, such as the use of a different evaluation model (one in which nested function definitions are lexically scoped) and some additional features borrowed from the Lisp/Scheme family of languages. Still, many commands written for S ran unaltered under R: drawing histograms, defining data.frame objects and many other basic operations were exactly the same.
The open-source nature of R was a key component of its success. As it is, any user can contribute “packages” that extend R’s capabilities. Right now you can find community created tools for R that allow you to perform many modern statistical tools, some of which are not available in other commercial software. The extensibility of R has grown even beyond that. There are packages that allow you to create interactive visualizations, maps, or web-based applications, all within R. Ihaka himself attributed the accomplishments of R to its collaborative nature…
R was first introduced to the world in a 1996 paper, where Ihaka and Gentleman presented it as the combination of two existing computer languages: S and Scheme [8]. By combining the strong points in each of these programming syntaxes, they created a language that looks like S on the surface, but that has a Scheme implementation beneath. The first version of R was described as “rough” by some statisticians who saw the system back there but, despite its flaws, R grabbed the attention of academics and students alike due to the fact that it followed a popular trend that developed along with it during the 1990s: the open-source movement.
Open-source software is a term used to describe computer software which makes the source code used to create it openly available to the users. The copyright provides rights to study, change and distribute the software for any purpose. This allows a collaborative environment where many people around the globe can tweak, modify and improve the software in question. As a result, open-software is upgraded at a faster pace than many other commercial choices. Open software is not limited at all to statistical systems; there are many other fields that have also witnessed the rise of open-source programs: You can, quite easily, find graphics editors, e-mail clients, instant-messaging clients, speech recognition systems or tools for quantum chemistry amongst open-source software lists [9].
The open-source nature of R was a key component of its success. As it is, any user can contribute “packages” that extend R’s capabilities. Right now you can find community created tools for R that allow you to perform many modern statistical tools, some of which are not available in other commercial software. The extensibility of R has grown even beyond that. There are packages that allow you to create interactive visualizations, maps, or web-based applications, all within R. Ihaka himself attributed the accomplishments of R to its collaborative nature:
“R is a real demonstration of the power of collaboration, and I don’t think you could construct something like this any other way. We could have chosen to be commercial, and we would have sold five copies of the software.” [5]
After its initial development, Ihaka and Gentleman were joined by many other professionals, all volunteers, who contributed to the development of the system. The “R Development Core Team” is the group responsible for the development of the language, those who can modify the official R source code archive [7]. The core team is the group responsible for the latest updates and developments of the software, although they have the aid of thousands of users who report bugs or glitches. The group includes personalities such as Doug Bates, Thomas Lumley, Brian Ripley and John Chambers, one of the original creators of S, among other statisticians and data analysts. The team founded the R foundation in 2003, a non-profit organization based in Austria, as a means to provide support for the R project and other developments in statistical computing [7].
The fact that Chambers changed “teams” by joining the R core development group does not mean that S is gone. Actually, the S language is still used to this day, with its last major redesign published in 1998. S was purchased in 2004 by the Insightful Corporation, which commercializes it in the S-Plus system. But S has long since been eclipsed by the rising popularity of its younger brother. Chambers himself declared that “R has grown and spread beyond anything the original authors of S are likely to have imagined” [3]. It is hard to estimate the number of R users in the world, but Revolution Analytics, creators of an enhanced distribution of R claim the number surpasses 2 million [10].
And despite the fact the software is free and open-source it has gathered some prominent advocates. As avid users of R, we can count companies like Google, who use the software to analyse trends in ad pricing, or Pfizer, where customized packages have been developed to aid in the study of non-clinical trials [5]. Microsoft has recently embraced R as well, using the software to fit forecasting models, tune fraud detection algorithms and improve Azure’s server reliability [11]. Even the Xbox brand is using R to improve matchmaking in online gaming. But Microsoft is not only viewing R as a business tool, but also as an investment. Microsoft acquired Revolution Analytics at the beginning of 2015, thus opening a window for a future distribution of Revolution R that uses both the Cloud infrastructure of Microsoft along with its database management suits. And why not? Maybe in some years, Microsoft can tackle some of the most notorious disadvantages of R.
As you well know, even after almost 20 years since R was introduced to the world, it is still far from perfect. R is notoriously hard to learn and rather complicated to use for some tasks, due to its intricate syntax. It really is hard to believe the system was initially devised as “easy-to-use” software. Although R can produce complex analysis few other software can match, sometimes producing simple procedures, like tabulations, can become a cumbersome task. And even after you finally get an output, obtaining print quality results or formatting them requires additional work. As such, some challengers of R have appeared, many of them aiming to fill the gap that R still has: simplicity. One of such challengers is Python, a more sophisticated programming language with a more “natural” syntax [12]. Recently, some Python libraries have extended the analytic capabilities of the system, turning the language into a powerful statistical toolbox. But even for those who aren’t too thrilled with programming, there are some available, open-source, point-and-click options [13]. Open-source software has become more prominent and reliable and, at least in our field, R is partially responsible for that. A few years back, “free” software would usually be associated with “bad” software, but R has dramatically changed that notion. Many of the free alternatives just discussed have a fighting chance to be noticed thanks to R.
R has, in some way, changed how we do statistics, by igniting a movement that has resulted in dozens of easily available analytical tools for researchers, everywhere. More interesting perhaps, R and its history is a parallel of the collaborative effort of a statistician’s work. Collaborating was always a part of the statistician’s mindset and both S and R drew from that practice to become one of the most important statistical tools worldwide. Created by statisticians for statisticians, the software can be regarded as an ambassador of the statistician’s working collaborative environment.
[1] Juan Pablo Sorin. Wikipedia – The Free Encyclopedia.
https://en.wikipedia.org/wiki/Juan_Pablo_Sorín
[2] Isolar – 1976 Tour. Wikipedia – The Free Encyclopedia.
https://en.wikipedia.org/wiki/Isolar_–_1976_Tour
[3] Chambers, John. Software for Data analysis – Programming with R. Springer, Jun 14, 2008
[4] S – Programming Language. Wikipedia – The Free Encyclopedia
https://en.wikipedia.org/wiki/S_(programming_language)
[5] Data Analysts Captivated by R’s Power. The New York Times (January, 2009)
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
[6] What is R? – The R project official website.
https://www.r-project.org/about.html
[7] R FAQ– The R project official website.
https://cran.r-project.org/doc/FAQ/R-FAQ.html
[8] Ihaka, Ross & Gentleman, Robert. R: A Language for Data analysis and Graphics. Journal of Computational and Graphical Statistics, Volume 5. Number 3, Pages 299-314.
[9] List of free and open-source software packages. Wikipedia – The Free Encyclopedia.
https://en.wikipedia.org/wiki/List_of_free_and_open-source_software_packages
[10] Community – Revolution Analytics Website
http://www.revolutionanalytics.com/community
[11] R at Microsoft – Revolution Analytics Blog
http://blog.revolutionanalytics.com/2015/06/r-at-microsoft.html
[12] The Python programming language. Python Official Website
https://www.python.org
[13] PSPP. GNU PSPP Official Website
https://www.gnu.org/software/pspp/