Terran ‣ Articles ‣ Recommended Reading

After you complete the basic learning track, you should be able to read
advanced books on most topics and understand them. Here are some books that I
recommend, grouped by what you would like to accomplish. If you have no idea
what to do next, read either *Applied Predictive Modeling* or *Introduction
to Probability with R*.

Books with an asterisk require a greater level of mathematical fluency than the others and may require you to review linear algebra and matrix calculus.

In this book, Stephen Few gives explicit guidance on what type of graphs to use to visualize different types of information. He starts by explaining the limits and capabilities of the visual system and memory, then applies this to the information you want to see. He also has another book specifically about making dashboards if that’s what you need; there’s quite a bit of overlap, so I suggest one or the other.

Correspondence Analysis (CA) is a technique for visual interpretation of the row and column associations in count data. If you’ve done a two-way Chi-squared test, gotten the final result, and thought to yourself “there must be a better way of interpreting this beyond just significant/not”, then correspondence analysis may be for you. There are also extensions for mixed count and continuous data.

This book doesn’t teach much about statistics, but it does show how to execute statistical tests in R, assuming you already know what the statistical test does. It is the most comprehensive book I have found about what functionality is available in R for doing the statistics you might want to do.

If you didn’t already read this as part of the main program, now would be an excellent time to do so. See the notes above for details. This book does have some code, but you won’t learn anything new about R from it; you’re reading it to learn the theory. This book is very accessible (aside from the errata, which you’ll definitely want to look up in advance) and I read it straight through and loved it.

This is a graduate-level stats book, which means it’s basically hundreds of pages of theorems and calculus. I’ll be completely honest — even I couldn’t stomach reading this one straight through. I use it as a reference for specific concepts when I need more detail. You can get the Indian paperback edition for a reasonable price.

Actually about much more than just multilevel models; the first two parts are a review of regression and classification, good guidance on transformation, and a great discussion of causal inference from models and experiments which is not covered anywhere else on this list. You may want to read the first part of the book for this even if you’re not interested in multilevel models at all. If you do read all of this book, note that the windows-only BUGS tool they teach is now superseded by a tool called JAGS, with similar syntax; use JAGS when you do the examples.

This is a full book solely on linear regression and related techniques. It goes into considerably more detail about aspects of the math, measures of influence, the hat-values, details of ANOVA, and so forth, which were mostly glossed over by the books that you read as part of the class. The 1st edition can be had used quite cheaply. Fox is the author of the “car” (Companion to Applied Regression) R package that you’ve used before, derived from this book.

This book starts by explaining the problematic nature of confidence intervals on proportions, hinted at by Dalgaard, and then goes on to describe topics such as multinomial and Poisson regressions, overdispersion, and multiple observations per row, which are not covered by the books we used in the class.

In Bayesian statistics, you explicitly express your “prior” belief and then update it with your data. These techniques are especially useful for smaller datasets with complex nested or otherwise unusual structures; they can also help with multiple comparisons. This book is the most useful and accessible introduction I have found.

This is a rigorous and more theoretical book on Bayesian statistics. It’s tough going at times and you had better be prepared to work through the calculus. I would suggest reading it only after another Bayesian book such as Kruschke or Gelman and Hill.

This is the most useful book I have found for practical causal inference when you cannot do a controlled experiment.

This book shows you a wider variety of regression and classification models, and also shows you (both conceptually and with a library) how to tune hyperparameters, which is required for most advanced model fitting. It also introduces the “caret” library, written by one of the authors, which provides a unified interface to multiple models.

This is a more theoretical and mathematical book, with no code examples, from some of the same authors as Introduction to Statistical Learning. Recommended at least as a reference, even if you don’t read the whole thing.

This is an excellent introduction to the principles of deep neural networks, which is the only class of architecture that anybody cares about right now for problems with millions or more examples in the training set. Prerequisites include fairly strong linear algebra and some Bayesian statistics.

After you’ve read the Goodfellow book on deep learning, this book will handhold you through applying common techniques using the Keras library. The DL community has standardized on Python, so read the Python version unless you are deeply committed to R.

Great, readable book introducing basic principles for natural language work.

Graphs (the kind with vertices and edges, not the kind with bars and colors) are important.

Time series can be modeled with linear regression, but also have their own are important too. The first half of this book is a very accessible introduction, and the second half is a somewhat terse survey.

If you want to do more with time series beyond this book, I recommend Ruey Tsay’s books on Financial Time Series and Multivariate Time Series. Some people like Shumway and Stoffer, but I didn’t.

A more advanced book which can sometimes be so terse that you can’t understand what he’s saying unless you already know the core ideas, but nonetheless a very useful reference. I especially like his taxonomy of techniques and data transformations.

If you plan to work with detecting abnormal data or anomalies in any way, this book is great.

This is not a technical book, it’s a business book. It talks about how to frame your problem in the business context, whether your data is valuable for the problem at hand, etc. It even has a flowchart at the back with specific steps to try when you have each kind of problem. It is most useful for when you are interfacing directly with businesspeople, as the primary or sole technology representative. Although there is a small section on data mining technology, you should not even read it, as it is both cursory and obsolete; the value in this book is elsewhere.

This explains the theory behind how ggplot2 composes layers and aesthetics, and introduces more functionality than you saw in R for Data Science. It should be a pretty quick read and will help you up your game making graphs.

How to think about data science as a business asset; how to evaluate project proposals for correct alignment with objectives. Can help you communicate about the value of a project as opposed to the implementation of a project.

This book explains all of the lower-level R programming which R for Data Science glossed over. Strongly recommended if you intend to continue to work in R. This book used to be a primary book in the class before r4ds was written.

This book has more to say about what algorithms and statistics R actually implements for many of the functions in the base and stats libraries. Sometimes the online documentation is inadequate and refers you to this book.

If you want to learn how to write your own libraries in R, with classes or nonstandard evaluation (that’s how dplyr takes column names without quotes), or implement functions in C++ for faster execution, read this book.

Not sure if you want to use R or Python to learn? I've now written a complete article on that topic!

Choose R or Python for Data Science

This is such a good book, with no real Python equivalent, that multiple people have ported the ISLR labs into Python:

This book is a good introduction to data manipulation and graphing in Python. I suggest you read parts 1 to 4, but skip the part on modelling, because the descriptions are too terse to learn from.

I haven’t found any suitable books. You’ll have to just read the docs for scipy.stats; it will help to already understand how to do it in R.

This book is not as good as Introduction to Statistical Learning with R, but it is the best book I have found so far which uses Python/Scikit. If you already understand how to fit models, this book will be good enough to let you transfer that understanding to the Python tools.

If this list wasn't enough, I have even more book reviews on Goodreads.com, an external site.

See the Complete Book Reviews on Goodreads Want help learning?