Stats Fundamentals

October 18, 2020 | ☕️ ☕️ 9 min read

Notes on rereading An Introduction to Statistical Learning (ISL)

I. Overview

I like to think of ISL as a survey of toolkits across different modeling approaches for a) fitting a model and b) examining model fit. Below are notes on ISL’s classification, regression, tree, and support-vector machine toolkits.

Why estimate a model, f? The two main reasons are

  1. Inference: How does our response y change with our predictors x? Is Y~X linear or more complex?

  2. Prediction: Can we predict Y with X while minimizing reducible error?

How do we estimate f?

  1. Parametric methods: we assume a functional form and estimate the parameters (also called weights and coefficients) by fitting the model to train data

  2. Non-parametric methods: sometimes called “distribution-free” because they have fewer assumption, these methods determine the model structure by looking at how closely we can fit the data. Here we needs lots of data and it’s easy to overfit.

Bias-Variance Trade-off

Evaluating model fit often means considering the bias-variance trade-off, where bias is the error from approximating a real life problem with a simpler model and variance is the amount the model would change on a different train set. As you improve one, the other tends to degrade.

• • •

II. Unsupervised Learning

When we don’t have responses (i.e. labels, dependent variables) available, we can still look at the structure of the data to learn patterns and subgroups/clusters.

Clustering (i.e. segmentation) Approaches

  1. K-Means: specify number of clusters K, assign observations to a (non-overlapping) cluster, iteratively reduce within-cluster variance by minimizing objective function.

  2. Hierarchical Clustering (bottom-up/agglomerative clustering): Starting at the bottom of a tree, fuse the 2 cluster groups that are most similar and work upwards.

Principal Component Analysis (PCA)

PCA can be used to reduce the dimensionality (i.e. number of predictors) in a dataset by looking at the proportion of variance explained (PVE) by the predictors. The first principal component, a normalized linear combination of features, is a good summary of the data. The loading vectors are the weights of principal components and define the direction in the feature space along which the data varies the most.

• • •

III. Supervised Learning — Examining Quality of Fit

ISL covers a few ways to determine how “good” the model fit is.

Residual Standard Error (RSE)

Measures lack of fit, on average this is the error we expect. It assesses the standard deviation of the residual (our model error).


Measures how well the regression line fits the data (on a scale of 0–100%). R-squared will always increase when adding predictors because it allows us to better (over)fit the data. To avoid this, we can omit predictors with relatively small R-squared increases. Below are some metrics for measuring model fit, where RSS is residual sum of squares, RSE is residual standard error, and TSS is total sum of squares.

Testing the Null Hypothesis

The null hypothesis assumes no relationship between X and Y. A model that fits our data will disprove this.

Confidence vs. Prediction Intervals

Confidence intervals measure uncertainty over many points. If we take 100 random samples of our data, it tells us how many samples we expect to contain the true statistical parameter. Prediction intervals tell us in what range a future individual observation will fall.

With CI you can say “I am X% confident that the mean value falls in range (A,B)”. With PI you can say “I am X% confident that the next Y with a given X will be in the range (A,B)”.

Because PI predicts an individual value, there is greater uncertainty and so this will always have a wider interval than CI.

Debugging Tricks

• • •

IV. Classification

Bayes Classifier

Bayes rule is a way to update our beliefs (B) based on the arrival of new evidence (E).

It’s considered an unattainable gold standard because in real-life we don’t know the conditional probabilities of events. The Bayes error rate is the irreducible error rate.

K-Nearest Neighbors

If we draw a circle around the K nearest neighbors of a point, we put that point in the class that a majority of neighbors are in. This performs worse than linear regression when there are a large number of predictors, because we have fewer samples per-predictors (this is considered “the curse of dimensionality”).

Logistic Regression

To get the probability that a sample is in a given class, we fit an S curve (“sigmoid”) to our data (instead of a line, like linear regression) to keep values between 0 and 1. We take the sigmoid of the weighted sum of the input features. To find our weights, we need a cost function that is non-linear because of the sigmoid activation. We use the log-loss (i.e. cross-entropy function) and gradient descent to iteratively update the weights. The term “logits” defines raw predictions that a classification model generates before a function transforms them to probabilities.

Linear Discriminant Analysis (LDA)

This models the distribution of predictors separately in each of the response classes (density function) then uses Bayes’ theorem to flip them into estimates for the posterior, P(Y=k|X=X). We’d use LDA over logistic regression when

  1. classes are well-separated,

  2. the distribution of the predictors is approximately normal,

  3. and for more than 2 response classes.

Quadratic Discriminant Analysis is like LDA but assumes each class has its own covariance matrix. We’d use this when the train set is very large (so classifier variance is not a concern) or the assumption of shared covariance matrix is untenable

Classification-specific Performance Metrics

• • •

V. Regressions

Linear regression assumes a linear relationship between X and Y. The most common approach is to measure closeness of fit with least squares. Alternatives/updates to least squares include the following sections.

Subset selection

Fit separate least squares regressions for combinations of p predictors. Stepwise selection of forward, backwards, or a hybrid for choosing each predictor.


Where least squares is high variance, regularization improves model performance by reducing the weights of less important predictors. This should be applied after standardizing predictors so that they all have the same scale (and have standard deviation of 1).

Ridge regression: assumes coefficients are randomly distributed around 0. Minimizes RSS + shrinkage penalty. Larger tuning parameter = larger penalty; l2 norm.

Lasso regression: lets us shrink coefficients to 0 (assumes many coefficients are 0 in distribution); l1 norm

Dimension Reduction Methods

ISL explores 2 methods

  1. Principal component regression (PCR): Often a small number of principal components explain variability in data without overfitting. Principal component regression uses only a subset of principal components as effectively a regularization procedure.

  2. Partial least squares (PLS): supervised alternative to PCR that makes each loading vector the coefficient from a linear regression. Next principal components are found by taking residuals. Since PLS has the potential to increase variance, the benefits of PCR vs. PLS are about the same.


Splines are special functions defined piecewise by polynomials that improve model flexibility. A regression spline is separate low-degree polynomials over different regions of X, separated by knots. Because splines can have high variance at outer range of predictors (when X is very small or very large), we can add natural splines, or a function that requires spline to be linear at boundaries.

Separately, smoothing splines have knots but then regularize (smoothes the fit) by adding a roughness penalty term.

Generative Additive Models (GAMs)

GAMs apply a non-linear function to each predictor and backfit by repeatedly updating the fit for each predictor while holding the others constant. Since its additive, we can examine the effect of each predictor on Y individually.

• • •

VI. Trees

Trees stratify or segment the predictor space into a number of regions with the goal of minimizing RSS. The splitting rules can be summarized in a tree.

Pros: explainable, mirror human-decision making, displayed graphically, easily handle qualitative predictors. Cons: Lower prediction accuracy, not robust (change in data leads to very different tree).

Regression Trees

Recursive, binary, greedy splitting for tree with lowest RSS. The predicted response is the mean of training observations in terminal node. The basic method here is to grow a large tree, apply cost-complexity pruning to get best subtrees by alpha, and then use k-fold CV to choose alpha from average predicted error, returning the subtree with chosen value of alpha.

Classification Trees

These are similar to regression trees but the predicted response is the most commonly occurring class instead of the mean. Measurements of purity (how many nodes belong to most common class) include GINI Indexand Entropy, which tend to be similar values.

Ensemble Methods

Trees benefit from ensemble methods, where multiple learning algorithms are combined.

• • •

VII. Support Vector Machines (SVMs)

SVMs are useful for supervised learning problems where we suspect non-linear relationships. Here we separate the data with a hyperplane and apply a kernel function to map observations to a high-dimensional feature spaces. The SVM can also be thought of as minimizing the “hinge loss” where the model penalizes misclassified points AND correct points that the model is not confident in.

Some vocabulary:

Margin: perpendicular min distance from observation to plane

Support vectors: training observations closest to planes because they affect location of hyperplane

Hyperplane: flat affine subspace of dimension p — 1 (p=number of predictors)

Support Vector Classifier. This allows slack variables for letting observations be on the wrong side of the margin / hyperplane. There is a tuning parameter C for a number of allowed incorrect observations, where a large C means many are on the wrong side.

Support Vector Machine. The ~machine~ here enlarges the feature space in a tractable way through kernel methods. The kernel measures similarity between any pair of observations. There are linear, polynomial, and radial kernels.

• • •

Thanks for reading! Let know if you think this could benefit from other stats concepts at

More Like This

📍 New York

© 2022 Ashe Magalhaes.