Terran: Should you learn data science in R or Python?

Introduction

The two most popular open source languages for data science are R and Python. I teach the the same course in both languages, so I know their strengths and weaknesses, and I've laid them out for you in this document so you can decide which advantages are the most important to you.

Advantages to Python

Python is Deployable

Python is a complete programming language and environment, with libraries for common services such as network communication. It is feasible to run a server in Python which executes your model and returns results, although the performance may not be acceptable for some uses. R is not a full-featured programming environment, lacking important modern language features such as exception handling, and having much weaker library support for general purpose computing tasks. R models are often deployed indirectly by exporting the specification in a format such as PMML and executing it with a different implementation, or by using proprietary 3rd party tools, such as those from Revolution Analytics/Microsoft, or by exporting the coefficients into a hand-coded production implementation.

Python Supports Larger Data

The standard Python math library, Numpy, supports a variety of integer and floating point data types, so you can have exactly what you need to fit your data; R, by contrast, supports only 32-bit integers and double precision floats, which can be inflexible and wasteful.

Numpy-based libraries have broad support for sparse matrices to save memory, and also integrate better with Spark, a common tool for distributed data processing in a cluster.

Both languages integrate equally well with SQL databases.

Python is the Standard for Neural Networks

The deep neural network community has unambiguously standardized on Python as their front end for invoking NN libraries (the libraries themselves are written in lower level languages such as C++ for performance and often run on GPUs). Although R interfaces to some libraries such as Keras do exist, they should be viewed as second class; the documentation is incomplete and even the data themselves need to be converted between column and row major ordering to move between R and Keras. Python is the only credible choice for deep neural networks.

Advantages to R

R has Better Graphics Tools

For data analysis graphics, you do not want to give commands at the level of "put a line here, with solid line type, color black. Now put a line here, in red." You want to give the plotting system a dataset, and say "map this column to the horizontal axis and this column to the vertical axis. Use color to indicate this attribute, and shape to indicate this other attribute." The difference in productivity is 10x, and this in turn supports the process of exploration and understanding which creates value.

R is a decade ahead of Python in this graphical sophistication. Sarkar published a book on his Lattice library in 2008, which was a watershed moment; Wickham followed shortly afterwards with ggplot, which has now become the de facto standard. Python users seem to have stuck with 1990s-era Matlab-inspired graphics for most of this time, and only recently discovered that the state of the art had moved on. In 2018, Python finally has two high level libraries which are approaching the level of functionality available in R; these are Altair (which still has significant bugs and limitations) and Plotnine (a clone of R's ggplot). Neither of them is a leading standard in the community, and neither has accompanying books.

R has Superior Books for Learning

R for Data Science (really about Data Analysis) by Wickham is pedagogically excellent; it starts with graphing some small but real datasets in the first chapter, and uses a desire to understand the data to motivate subsequent learning about new transformations. Everyone who has used it has found it to be accessible and engaging. Many other top-tier texts, such as those by Fox, Gelman and Hill, and Kruschke, also give their examples in R.

The two leading Python books, the Python Data Science Handbook by VanderPlas and Python for Data Analysis by McKinney, both start with hundreds of pages on data manipulation, mostly on contrived examples of integers and random numbers. They feel like a hard slog in API memorization. For graphics, they both show Matplotlib, which has an outmoded low-level API and should not be recommended to someone currently learning.

For statistical modeling, the leading beginner book is Introduction to Statistical Learning with Applications in R, by James, Witten, Hastie, and Tibshirani. This book is so much better than the alternatives that the best choice for Python appears to be to use this same book, and work with community-contributed Python ports of the code labs.

R has More API Innovation

Many key Python data analysis concepts are inspired by or directly copied from R:

Pandas is based on R Data Frames.
One of the better graphics libraries in Python, plotnine, is a clone of ggplot.
Statsmodels copies R's formula-based modeling API

Hadley Wickham, an author of several R libraries, popularized the “split/apply/combine” paradigm for more efficient and expressive data analysis, which has now become a standard way of approaching many analysis tasks.

The R (and previously S) community has been a major source of innovation in how modeling, statistics, and data analysis can be made more expressive and intuitive, with the Python community instead prioritizing more efficient implementations and support for larger data and distributed computation.

R has Higher Quality Mathematical Algorithms

The de-facto standard library for predictive modeling and non-NN machine learning models in Python is Scikit-Learn. Many sklearn users report that it fails to converge or converges to an incorrect solution with a frequency that presents problems in practical use. R libraries rarely have mathematical correctness problems (they more often have brittleness problems with unexpected inputs).

Neutral Factors

These factors sometimes come up in discussion, so I mention them here. I do not consider them to favor either language.

iPython Notebooks vs R Markdown

Both iPython (now Jupyter) notebooks and R Markdown in RStudio accomplish the same objective, giving you an IDE for creating documents with integrated code and graphs. Some individuals will have personal preferences for one of them, but the two systems are equivalent in capabilities for most purposes.

Library Fragmentation

Both languages offer multiple implementations of algorithms. In R, libraries are written by numerous authors, but usually conform to common practices established long ago in S for the API to train and predict. In Python, there are fewer major libraries (sklearn and statsmodels being two big open-source ones), but they each have completely different interfaces. Some of my clients are reporting that sklearn models are inadequate for their needs due to correctness problems, and they are moving to other library vendors such as H2O, introducing further fragmentation.

Python is More Familiar

Software engineers may already be familiar with Python, but few have used R. However, for data science work, most of the use of Python is essentially as a domain-specific language for invoking modeling and analysis tools; even the data types and basic arithmetic are different in Numpy. Knowing the core language is occasionally helpful, but other times it encourages the user to do things "the way they already know" instead of learning a new, more appropriate idiom.

Summary and Conclusions

To generalize broadly, Python is better at things that engineers care about, and R is better at things that scientists care about.

It's also important to consider what other people in your group or at your company use; alignment with your local environment may be a more important consideration than which tool would be ideal for your specific task.

If you are learning for the purpose of expanding your mental possibilities, you may get the most benefit by learning whichever is less comfortable to you: R for software engineers and Python for analysts. On the other hand, if you need to just learn the tools as quickly as possible, do the opposite.

If you ultimately intend to learn both R and Python, learn R first, because the tools are mature and are well integrated with the introductory books.

Jump Into the R Curriculum Prefer Python More Articles

Terran ‣ Articles ‣ Learning in R or Python