TerranArticles ‣ R Data Science Curriculum

Terran ‣ Articles ‣ R Data Science Curriculum

Introduction

General Advice on How to Learn

When you learn skills that you perform (as distinct from learning facts that you just remember), you need to form in your mind two types of mappings. The first and easier is the mapping from a tool or technique to what it does. This is the mapping that you will learn just by reading. The second and more important is the mapping from the problem you need to solve back to the tools and techniques that are applicable to it. This mapping is developed only by doing problems, and without this mapping, your learning is of no practical use. Therefore it is imperative that you do exercises; do the ones from the book, or even better, make up your own related to a domain which is interesting to you.

Research indicates that you will form better memories by typing commands instead of cutting and pasting from examples. If you are using an online book, I suggest that you resist the temptation to cut and paste from the book into the R session — pretend that it’s a paper book and type out the commands.

Timing

For some books and sections, recommended timing information appears. This takes the form of a median and maximum time. The median is from those students who studied the same material and ultimately graduated from the training program. The maximum is a way of evaluating your progress: if you take longer than the listed maximum amount of time without successfully completing the material, it indicates that you may be inadequately prepared for this class and have a low probability of ultimately graduating. These timings are derived from data about software engineers, so if you have a business or analytics background, they may not apply to you.

Your corporate program sponsor can tell you if there are specific policies regarding the maximum time at your company.

R for Data Science, Hadley Wickham (“r4ds”)

This book is available in print or online from the author at http://r4ds.had.co.nz . Note that the chapters are numbered differently in the online version. Since most people seem to be using the online version, the numbers in this document are the online numbers. If you’re using the print version, do what I did and write the online numbers in your table of contents.

Part I

3Visualization Very important. Become fluent with this material.
4WorkflowTrivial, just read
5TransformationVery important. Become fluent with this material. Alternative: sqldf library.
6ScriptsTrivial, just read
7Exploratory...Very important. Become fluent with this material.
8ProjectsTrivial, just read.

Part II

10 Tibbles Important but short and easy to understand.
11 Data Import Important, become fluent. Very few new functions
12 Tidy Data Important and new concepts. Become fluent with spread/gather.
13 Relational... Mostly trivial for anyone who has used SQL before.
14 Strings Another easy chapter for software engineers. Don't memorize.
15 Factors Important and unique to R. Understand. (note bug in examples)
16 Dates and Times You've done dates before. Don't memorize every function.

Part III

18 Pipes Important and specific to R; learn.
19 Functions Mostly trivial for softare engineers.
20 Vectors Important and specific to R; learn.
21 Iteration Actually mostly about functional programming. Useful and interesting.

Part IV

We will not be using part IV from this book. It presents linear regression modeling in a very non-rigorous way, and I prefer to go through the required statistics first and then do modeling properly.

Part V

27& 29 R Markdown Optional. Read if you want to.
28 Graphics for... Really about advanced ggplot. Recommended.

Roughly the first half of your time in this book should go into chapters 3, 5, and 7, which contain substantial material that you need to practice repeatedly, and the second half of your time should go into everything else.

Supplemental material on programming with dplyr and dealing with the nonstandard argument passing: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

Timing

Part I Median 2 weeks Maximum 4 weeks
Part I, II, III Median 4.5 weeks Maximum 7 weeks (inclusive of Part I time)

Openintro Statistics, David Diez et al (“OpenIntro”), 3rd Edition

This is a verbose book with repetitive examples and exercises, so once you understand a concept, move on to the next one. Do a representative sample of the exercises, but stop doing exercises of a certain type once you understand how to do them.

This curriculum refers to the 3rd edition of this book. There is a 4th edition but I have not had time to read it and update, so you'll have to stick with the 3rd edition for now.

We’ll be using chapters 1 through 6. Read all sections, including those marked “special topic.” You are all special.

This book is conceptual and does not tell you any R commands, although whenever they provide example output, it corresponds perfectly to what R would produce. There is an online software supplement to the book here: https://www.openintro.org/stat/labs.php?stat_lab_software=R I recommend only the “Probability” through “Confidence Level” sections. I do not recommend the Inference sections, because they teach everything using their own custom functions which are not standard R functions.

This book is available printed, or as a PDF at openintro.org. Note that there is now a 4th edition, but my chapter numbers refer to the 3rd edition.

Earlier versions of this document also recommended Introductory Statistics with R by Peter Dalgaard. I have since decided that this book is too long relative to the value of its material, so I have cut it and started directly teaching the required material instead. If you are following this curriculum on your own, you will still need it. See the alternative version with Dalgaard, suitable for self-study

Detailed Chapter Notes [3rd edition]

1 Basic intro, about ½ of which is redundant with R4DS. Quick and easy.
2 Probability. Nontrivial if this is your first time seeing it. Easy if you did it before.
3 Distributions; very important. The most important distributions are the normal and binomial; do not memorize the others.
4 Confidence intervals and the central limit theorem. Important.
5 Statistical tests for continuous data and power tests. Very important.
6 Statistical tests for proportions (e.g. clicks) is very important. X2 also useful; not critical.

We do not use chapters 7 and 8 in OpenIntro Statistics

Additional Material in the OI Labs:

Additional Material I Will Teach You:

My recommendation is that you try to do at least some of the exercises in OpenIntro both by hand and also in R, and confirm you got the same answer. If you have trouble understanding, I recommend supplementing this book with the Cartoon Guide to Statistics by Gonick and Smith.

Timing

Ch 1-6 Median 3 weeks Maximum 6 weeks

An Introduction to Statistical Learning with Applications in R, by James, Witten, Hastie, and Tibshirani (“ISLR”)

(For a time I referred to this book as “JWHT” because some students had been getting it confused with another similarly named book, but that other book has been removed from the curriculum, so I am now back to the standard “ISLR” abbreviation for this book)

Chapter Notes

Chapter 1, the introduction, is trivial.


Every other chapter, 2 through 10, is both substantial and important. For ch2, the content is important but you don’t need to do the labs or exercises, which just cover basics of R that we already know. For all the subsequent chapters, the labs are important. Chapter 10 on unsupervised learning can be done out of order if you need it sooner for your project.

Supplemental material:

Plotmo: Plotmo README (see also the "Plotting regression surfaces with plotmo" vignette on that page)

Learning Tips

Prior to reading this book, take a couple hours to go back and review the material in the first part of r4ds; some people find that they got rusty while doing statistics.

This book uses base-R graphics and manipulation. I suggest that you do not try to deeply understand these; when you work the examples and exercises, do them using the ggplot and dplyr functions that you already learned from r4ds.

Community-created solutions reference: https://github.com/asadoughi/stat-learning/

I have not evaluated the quality of these solutions. They may contain serious errors.

Timing

Full book Median 8 weeks Maximum 11 weeks

Miscellaneous Additional Resources

http://www.mosaic-web.org/go/StatisticalModeling/Chapters/Chapter-17.pdf
Causal interpretation of models; what features to include or exclude to test your causal hypothesis. This will make the most sense after you’ve done some linear regressions.

https://blog.acolyer.org/2017/09/25/a-dirty-dozen-twelve-common-metric-interpretation-pitfalls-in-online-controlled-experiments/
Empirical article about things that are likely to go wrong in practice.

http://www.milefoot.com/math/stat/index.htm
Nice description and concise derivations of probability distributions not covered in depth in the books, such as chisq and F. There are dozens of comparable sites to choose from; I thought this one had the best writing of the ones I reviewed. (Do NOT read Wikipedia for math just because it’s the top Google result; it’s usually terrible at the exposition)

Final Project

All students do a final project to complete the the program. At a minimum, the final project must include:

Everyone does a different project, so timing will vary.

Introduction to Probability with R, Baclawski

This book is recommended for everyone who intends to continue learning, because it provides a adequate mathematical foundation in probability for reading advanced works. The level of mathematics provided by OpenIntro Statistics is adequate for ISLR but is not adequate for many of the more advanced books.

Unfortunately this otherwise-excellent book has numerous serious errata which impede understanding. You will want to go through the list and mark them in your copy prior to reading. If you skip this, you will regret it.

The minimum set of chapters I recommend are 1-6 and 9.

Final Prep before Advanced Reading

Review linear algebra - remember at least LU, QR, and SVD decompositions. If you need a book, I liked Bau and Trefethen’s, which offers some nice geometric interpretations.

Remember change of variables in calculus.

Additional R libraries which can be very useful:

data.table
This is a competing paradigm to dplyr, which is also very flexible but takes a totally different approach. It has different strengths and weaknesses; in particular, it is better optimized for speed with large datasets.
reshape
This is another Wickham library, but he doesn’t cover it in R4DS. This library is harder to understand than spread/gather but also more flexible and can handle additional cases.

See the Advanced Reading List Let's Talk About Classes at Your Company