Terran: Recommended Data Science Books

After you complete the basic learning track, you should be able to read advanced books on most topics and understand them. Here are some books that I recommend, grouped by what you would like to accomplish. If you have no idea what to do next, read either Applied Predictive Modeling or Introduction to Probability with R.

Books with an asterisk require a greater level of mathematical fluency than the others and may require you to review linear algebra and matrix calculus.

To Get Better at Exploration

Now You See it, Stephen Few

In this book, Stephen Few gives explicit guidance on what type of graphs to use to visualize different types of information. He starts by explaining the limits and capabilities of the visual system and memory, then applies this to the information you want to see. He also has another book specifically about making dashboards if that’s what you need; there’s quite a bit of overlap, so I suggest one or the other.

Correspondence Analysis, Greenacre

Correspondence Analysis (CA) is a technique for visual interpretation of the row and column associations in count data. If you’ve done a two-way Chi-squared test, gotten the final result, and thought to yourself “there must be a better way of interpreting this beyond just significant/not”, then correspondence analysis may be for you. There are also extensions for mixed count and continuous data.

To Get Better at Statistics

Introductory Statistics with R, Peter Dalgaard

This book doesn’t teach much about statistics, but it does show how to execute statistical tests in R, assuming you already know what the statistical test does. It is the most comprehensive book I have found about what functionality is available in R for doing the statistics you might want to do.

Introduction to Probability with R, Baclawski

If you didn’t already read this as part of the main program, now would be an excellent time to do so. See the notes above for details. This book does have some code, but you won’t learn anything new about R from it; you’re reading it to learn the theory. This book is very accessible (aside from the errata, which you’ll definitely want to look up in advance) and I read it straight through and loved it.

Statistical Inference, Casella and Berger

This is a graduate-level stats book, which means it’s basically hundreds of pages of theorems and calculus. I’ll be completely honest — even I couldn’t stomach reading this one straight through. I use it as a reference for specific concepts when I need more detail. You can get the Indian paperback edition for a reasonable price.

To Get Better at Inference

Data Analysis using Regression and Multilevel/Hierarchical Models, Gelman and Hill

Actually about much more than just multilevel models; the first two parts are a review of regression and classification, good guidance on transformation, and a great discussion of causal inference from models and experiments which is not covered anywhere else on this list. You may want to read the first part of the book for this even if you’re not interested in multilevel models at all. If you do read all of this book, note that the windows-only BUGS tool they teach is now superseded by a tool called JAGS, with similar syntax; use JAGS when you do the examples.

Applied Regression… , John Fox

This is a full book solely on linear regression and related techniques. It goes into considerably more detail about aspects of the math, measures of influence, the hat-values, details of ANOVA, and so forth, which were mostly glossed over by the books that you read as part of the class. The 1st edition can be had used quite cheaply. Fox is the author of the “car” (Companion to Applied Regression) R package that you’ve used before, derived from this book.

Analysis of Categorical Data with R, Bilder and Loughlin

This book starts by explaining the problematic nature of confidence intervals on proportions, hinted at by Dalgaard, and then goes on to describe topics such as multinomial and Poisson regressions, overdispersion, and multiple observations per row, which are not covered by the books we used in the class.

Doing Bayesian Data Analysis, Kruschke

In Bayesian statistics, you explicitly express your “prior” belief and then update it with your data. These techniques are especially useful for smaller datasets with complex nested or otherwise unusual structures; they can also help with multiple comparisons. This book is the most useful and accessible introduction I have found.

*Bayesian Data Analysis, Gelman et al

This is a rigorous and more theoretical book on Bayesian statistics. It’s tough going at times and you had better be prepared to work through the calculus. I would suggest reading it only after another Bayesian book such as Kruschke or Gelman and Hill.

Mostly Harmless Econometrics, Angrist and Pischke

This is the most useful book I have found for practical causal inference when you cannot do a controlled experiment.

To Get Better at Prediction

Applied Predictive Modeling, Kuhn & Johnson

This book shows you a wider variety of regression and classification models, and also shows you (both conceptually and with a library) how to tune hyperparameters, which is required for most advanced model fitting. It also introduces the “caret” library, written by one of the authors, which provides a unified interface to multiple models.

*Elements of Statistical Learning, Hastie & Tibshirani (free from authors)

This is a more theoretical and mathematical book, with no code examples, from some of the same authors as Introduction to Statistical Learning. Recommended at least as a reference, even if you don’t read the whole thing.

Deep Learning, Goodfellow et al (free from authors)

This is an excellent introduction to the principles of deep neural networks, which is the only class of architecture that anybody cares about right now for problems with millions or more examples in the training set. Prerequisites include fairly strong linear algebra and some Bayesian statistics.

Deep Learning with (Python | R), François Chollet

After you’ve read the Goodfellow book on deep learning, this book will handhold you through applying common techniques using the Keras library. The DL community has standardized on Python, so read the Python version unless you are deeply committed to R.

To Learn More Ways of Framing Problems

Foundations of Statistical Natural Language Processing, Manning and Schütze

Great, readable book introducing basic principles for natural language work.

Statistical Analysis of Network Data with R, Kolaczyk et al

Graphs (the kind with vertices and edges, not the kind with bars and colors) are important.

Introductory Time Series with R, Cowpertwait and Metcalfe

Time series can be modeled with linear regression, but also have their own are important too. The first half of this book is a very accessible introduction, and the second half is a somewhat terse survey.

If you want to do more with time series beyond this book, I recommend Ruey Tsay’s books on Financial Time Series and Multivariate Time Series. Some people like Shumway and Stoffer, but I didn’t.

Data Mining, Aggarwal

A more advanced book which can sometimes be so terse that you can’t understand what he’s saying unless you already know the core ideas, but nonetheless a very useful reference. I especially like his taxonomy of techniques and data transformations.

Outlier Analysis, Aggarwal

If you plan to work with detecting abnormal data or anomalies in any way, this book is great.

To Get Better at Communication

Business Modeling and Data Mining, Dorian Pyle

This is not a technical book, it’s a business book. It talks about how to frame your problem in the business context, whether your data is valuable for the problem at hand, etc. It even has a flowchart at the back with specific steps to try when you have each kind of problem. It is most useful for when you are interfacing directly with businesspeople, as the primary or sole technology representative. Although there is a small section on data mining technology, you should not even read it, as it is both cursory and obsolete; the value in this book is elsewhere.

ggplot2, Wickham (available free but only if you build it from source)

This explains the theory behind how ggplot2 composes layers and aesthetics, and introduces more functionality than you saw in R for Data Science. It should be a pretty quick read and will help you up your game making graphs.

Data Science for Business, Provost and Fawcett

How to think about data science as a business asset; how to evaluate project proposals for correct alignment with objectives. Can help you communicate about the value of a project as opposed to the implementation of a project.

To Get Better at R

Art of R Programming, Matloff

This book explains all of the lower-level R programming which R for Data Science glossed over. Strongly recommended if you intend to continue to work in R. This book used to be a primary book in the class before r4ds was written.

Statistical Models in S, Chambers and Hastie

This book has more to say about what algorithms and statistics R actually implements for many of the functions in the base and stats libraries. Sometimes the online documentation is inadequate and refers you to this book.

Advanced R, Wickham(free from author)

If you want to learn how to write your own libraries in R, with classes or nonstandard evaluation (that’s how dplyr takes column names without quotes), or implement functions in C++ for faster execution, read this book.

Reading for Those who Want to use Python

Not sure if you want to use R or Python to learn? I've now written a complete article on that topic!

Choose R or Python for Data Science

Introduction to Statistical Learning

This is such a good book, with no real Python equivalent, that multiple people have ported the ISLR labs into Python:

Python Data Science Handbook, Jake van der Plas

This book is a good introduction to data manipulation and graphing in Python. I suggest you read parts 1 to 4, but skip the part on modelling, because the descriptions are too terse to learn from.

[something on doing statistics in Python]

I haven’t found any suitable books. You’ll have to just read the docs for scipy.stats; it will help to already understand how to do it in R.

Introduction to Machine Learning with Python, Müller and Guido

This book is not as good as Introduction to Statistical Learning with R, but it is the best book I have found so far which uses Python/Scikit. If you already understand how to fit models, this book will be good enough to let you transfer that understanding to the Python tools.

Even More?

If this list wasn't enough, I have even more book reviews on Goodreads.com, an external site.

See the Complete Book Reviews on Goodreads Want help learning?

Terran ‣ Articles ‣ Recommended Reading