Terran: Learn Data Science in Python

Introduction

This document covers the curriculum that I use to teach data science in Python. It is still a work in progress and a few sections are incomplete.

Learning in R or Python

Before you follow this curriculum in detail, I recommend reviewing my article on which language to use.

R vs Python for data science R curriculum

General Advice on How to Learn

When you learn skills that you perform (as distinct from learning facts that you just remember), you need to form in your mind two types of mappings. The first and easier is the mapping from a tool or technique to what it does. This is the mapping that you will learn just by reading. The second and more important is the mapping from the problem you need to solve back to the tools and techniques that are applicable to it. This mapping is developed only by doing problems, and without this mapping, your learning is of no practical use. Therefore it is imperative that you do exercises; do the ones from the book, or even better, make up your own related to a domain which is interesting to you.

Research indicates that you will form better memories by typing commands instead of cutting and pasting from examples. If you are using an online book, I suggest that you resist the temptation to cut and paste from the book into the R session — pretend that it’s a paper book and type out the commands.

Notes on the Python Version

If you're here, it's because you prefer to learn in Python instead of R (or your boss strongly prefers for you to do so). Be aware that the path is steeper and rockier going up the Python side of the mountain: you're going to have to memorize a bunch of boring data manipulation first, before you get to make any pretty graphs, and there are several topics which are not adequately covered by any available books. I've done the best I can to smooth over these gaps, and the overall experience is still going to be a rougher one than you would get with R.

Python for Data Science, Jake VanderPlas (JVDP)

The book is graciously available online from the author, or you can buy it as a paperback from any of the usual sources.

Chapter 1 - Installing Python and Jupyter

As of October 2018, I am having success with Anaconda 5.2.0 based on Python 3.6, but I had problems with Anaconda 5.3.0 based on Python 3.7 - some packages won't build under Python 3.7, including dependencies of plotnine. I recommend Python 3.6.

I use pyenv to manage versions and have had good results with it so far. You can use whatever you prefer.

Also install nbextensions and enable collapsible_headings and toc2.

Chapter 2 - Numpy

This book does not contain exercises, so I've written some.

Chapter 2 Exercise Notebook

Chapter 3 - Pandas

There is quite a lot in chapter 3; I wouldn't be surprised if it takes you more than one week to get through it.

Chapter 3 Exercise Notebook

Python notebooks with additional materials:

Supplemental Material Not In Python for Data Science (recommended)
Supplement on Indexing Heuristics (optional)

Chapter 4 - Matplotlib

Stop here, do not read chapter 4. We will not use matplotlib in this class.

Visualization with Altair

Prior to selecting Altair, I reviewed the available libraries. As of late 2018, the two best are Altair and Plotnine. I favor the former because, although it is not as good today, the fact that it renders using Javascript appeals to the kind of people who want to do data science in Python, and it has backing from bigger names in the Python community.

There is currently no book which covers Altair, so here are some online resources to use instead:

Openintro Statistics, David Diez et al ("OpenIntro"), 3rd Edition

This is a verbose book with repetitive examples and exercises, so once you understand a concept, move on to the next one. Do a representative sample of the exercises, but stop doing exercises of a certain type once you understand how to do them.

This curriculum refers to the 3rd edition of this book. There is a 4th edition but I have not had time to read it and update, so you'll have to stick with the 3rd edition for now.

We'll be using chapters 1 through 6. Read all sections, including those marked “special topic.” You are all special.

This book is available printed, or as a PDF at openintro.org. It offers labs, but only in R or SAS. Here's a Python notebook which demonstrates the key functions you will need:

Statistics Python Notebook

Detailed Chapter Notes

1 Basic intro, about ½ of which is redundant with R4DS. Quick and easy.

2 Probability. Nontrivial if this is your first time seeing it. Easy if you did it before.

3 Distributions; very important. The most important distributions are the normal and binomial; do not memorize the others.

4 Confidence intervals and the central limit theorem. Important.

5 Statistical tests for continuous data and power tests. Very important.

6 Statistical tests for proportions (e.g. clicks) is very important. X² also useful; not critical.

We do not use chapters 7 and 8 in OpenIntro Statistics

My recommendation is that you try to do at least some of the exercises in OpenIntro both by hand and also in Python, and confirm you got the same answer. If you have trouble understanding, I recommend supplementing this book with the Cartoon Guide to Statistics by Gonick and Smith.

Introduction to Statistical Learning, James, Witten, Hastie, and Tibshirani ("ISLR")

This book is so good that although its examples are all given in R, the best option for Python is currently to use this book anyway and substitute in alternative labs. There is a a full set of labs from J. Warmenhoven on Github. I have also developed some of my own demonstration notebooks, below.

Chapter 1-2

Chapter 1 is a trivial introduction. Chapter 2 contains important conceptual information about bias and variance; there is nothing in the labs (which are all about basic R programming) that you should try to do.

Chapter 3-4

These chapters are about basic regression and classification. I strongly recommend using the statsmodels library for this, and not the sklearn library, because the latter is missing all the useful diagnostics for your model.

Statsmodels Demo Notebook

Chapter 5-7

TODO: decide which libraries I'm using for these

Chapter 8

The h2o.ai library has reasonably good support for random forests and boosted trees.

H2O Demo Notebook

Chapter 9

TODO: decide which library I'm using for this

Chapter 10

Multi-Library Clustering Demos Multi-Library PCA Demos

Miscellaneous Additional Resources

http://www.mosaic-web.org/go/StatisticalModeling/Chapters/Chapter-17.pdf: Causal interpretation of models; what features to include or exclude to test your causal hypothesis. This will make the most sense after you’ve done some linear regressions.
https://blog.acolyer.org/2017/09/25/a-dirty-dozen-twelve-common-metric-interpretation-pitfalls-in-online-controlled-experiments/: Empirical article about things that are likely to go wrong in practice.
http://www.milefoot.com/math/stat/index.htm: Nice description and concise derivations of probability distributions not covered in depth in the books, such as chisq and F. There are dozens of comparable sites to choose from; I thought this one had the best writing of the ones I reviewed. (Do NOT read Wikipedia for math just because it’s the top Google result; it’s usually terrible at the exposition)

Final Project

All students do a final project to complete the the program. At a minimum, the final project must include:

Exploration and assessment of data
Modeling
A written report

Terran ‣ Articles ‣ Python Data Science Curriculum