Stats Fundamentals

|☕️ 9 min read

notes on reading an introduction to statistical learning (isl)

I. Overview

I like to think of ISL as a survey of toolkits across different modeling approaches for a) fitting a model and b) examining model fit. Below are notes on ISL’s classification, regression, tree, and support-vector machine toolkits.

Why estimate a model, f? The two main reasons are

  • Inference: How does our response y change with our predictors x? Is Y~X linear or more complex?
  • Prediction: Can we predict Y with X while minimizing reducible error?

How do we estimate f?

  • Parametric methods: we assume a functional form and estimate the parameters (also called weights and coefficients) by fitting the model to train data

Bias-Variance Trade-off

Evaluating model fit often means considering the bias-variance trade-off, where bias is the error from approximating a real life problem with a simpler model and variance is the amount the model would change on a different train set. As you improve one, the other tends to degrade.

• • •

II. Unsupervised Learning

When we don’t have responses (i.e. labels, dependent variables) available, we can still look at the structure of the data to learn patterns and subgroups/clusters.

Clustering (i.e. segmentation) Approaches

  • K-Means: specify number of clusters K, assign observations to a (non-overlapping) cluster, iteratively reduce within-cluster variance by minimizing objective function.
  • Hierarchical Clustering (bottom-up/agglomerative clustering): Starting at the bottom of a tree, fuse the 2 cluster groups that are most similar and work upwards.

Principal Component Analysis (PCA)_Bias

PCA can be used to reduce the dimensionality (i.e. number of predictors) in a dataset by looking at the proportion of variance explained (PVE) by the predictors. The first principal component, a normalized linear combination of features, is a good summary of the data. The loading vectors are the weights of principal components and define the direction in the feature space along which the data varies the most.

• • •

III. Supervised Learning — Examining Quality of Fit

ISL covers a few ways to determine how “good” the model fit is.

Residual Standard Error (RSE)
Measures lack of fit, on average this is the error we expect. It assesses the standard deviation of the residual (our model error).

Measures how well the regression line fits the data (on a scale of 0–100%). R-squared will always increase when adding predictors because it allows us to better (over)fit the data. To avoid this, we can omit predictors with relatively small R-squared increases. Below are some metrics for measuring model fit, where RSS is residual sum of squares, RSE is residual standard error, and TSS is total sum of squares.

Testing the Null Hypothesis
The null hypothesis assumes no relationship between X and Y. A model that fits our data will disprove this.

  • T-Statistic: Number of standard deviations that our learned coefficient is from 0.

  • P-value: The probability of observing any number greater than or equal to assuming Small p-value means this probability is low, and we can reject the null hypothesis.

  • F-statistic: with many predictors we want this metric because the p-value will be less than .05 5% of the time by chance. Similar to a p-value, a small F-statistic means we can reject the null hypothesis.

Confidence vs. Prediction Intervals
Confidence intervals measure uncertainty over many points. If we take 100 random samples of our data, it tells us how many samples we expect to contain the true statistical parameter. Prediction intervals tell us in what range a future individual observation will fall.

With CI you can say “I am X% confident that the mean value falls in range (A,B)”. With PI you can say “I am X% confident that the next Y with a given X will be in the range (A,B)”.

Because PI predicts an individual value, there is greater uncertainty and so this will always have a wider interval than CI.

Debugging Tricks

  • Heteroscedasticity: this measures if there are non-constant variances in the errors.

  • Plot residual by time: could show a correlation in error terms (occurs in time-series data). In this case, standard errors will underestimate true standard errors.

  • Plot residual by fitted values: if there is a discernible pattern (funnel shape in the case of heteroscedasticity), you can apply a non-linear transformation or interaction term to increase fit.

  • Plot studentized residuals (residuals divided by standard error): any points greater than 3 are possible outliers

  • Variation Inflation Factor (VIF): Regress a predictor onto all other predictors and take 1 /(1-R-squared) to estimate how much the variance is “inflated” by a linear dependence with other predictors. VIF > 2.5 indicates collinearity.

  • Consider confounding variables when testing for causal relationships. A confounding variable influences Y and other predictors, which can lead to seemingly contradictory coefficients. They can be found with VIF and by building baseline models and examining the effect of adding each predictor.

  • Watch for p > n, in which case classical statistical methods break down and R-squared can’t be trusted.

• • •

IV. Classification

Bayes Classifier
Bayes rule is a way to update our beliefs (B) based on the arrival of new evidence (E).

It’s considered an unattainable gold standard because in real-life we don’t know the conditional probabilities of events. The Bayes error rate is the irreducible error rate.

K-Nearest Neighbors
If we draw a circle around the K nearest neighbors of a point, we put that point in the class that a majority of neighbors are in. This performs worse than linear regression when there are a large number of predictors, because we have fewer samples per-predictors (this is considered “the curse of dimensionality”).

Logistic Regression
To get the probability that a sample is in a given class, we fit an S curve (“sigmoid”) to our data (instead of a line, like linear regression) to keep values between 0 and 1. We take the sigmoid of the weighted sum of the input features. To find our weights, we need a cost function that is non-linear because of the sigmoid activation. We use the log-loss (i.e. cross-entropy function) and gradient descent to iteratively update the weights. The term “logits” defines raw predictions that a classification model generates before a function transforms them to probabilities.

Linear Discriminant Analysis (LDA)
This models the distribution of predictors separately in each of the response classes (density function) then uses Bayes’ theorem to flip them into estimates for the posterior, P(Y=k|X=X). We’d use LDA over logistic regression when

  • classes are well-separated.

  • the distribution of the predictors is approximately normal.

  • for more than 2 response classes.

Quadratic Discriminant Analysis is like LDA but assumes each class has its own covariance matrix. We’d use this when the train set is very large (so classifier variance is not a concern) or the assumption of shared covariance matrix is untenable

Classification-specific Performance Metrics

  • ROC: from communications theory, “receiver operator characteristics”, shows true positive and false positive rate for all classification thresholds

  • True Positive (TP) rate = sensitivity = recall

  • False positive rate

  • TP / Positive rate = precision

• • •

V. Regressions

Linear regression assumes a linear relationship between X and Y. The most common approach is to measure closeness of fit with least squares. Alternatives/updates to least squares include the following sections.

Subset selection

Fit separate least squares regressions for combinations of p predictors. Stepwise selection of forward, backwards, or a hybrid for choosing each predictor.


Where least squares is high variance, regularization improves model performance by reducing the weights of less important predictors. This should be applied after standardizing predictors so that they all have the same scale (and have standard deviation of 1).

Ridge regression: assumes coefficients are randomly distributed around 0. Minimizes RSS + shrinkage penalty. Larger tuning parameter = larger penalty; l2 norm.

Lasso regression: lets us shrink coefficients to 0 (assumes many coefficients are 0 in distribution); l1 norm

Dimension Reduction Methods
ISL explores 2 methods

  • Principal component regression (PCR): Often a small number of principal components explain variability in data without overfitting. Principal component regression uses only a subset of principal components as effectively a regularization procedure.

  • Partial least squares (PLS): supervised alternative to PCR that makes each loading vector the coefficient from a linear regression. Next principal components are found by taking residuals. Since PLS has the potential to increase variance, the benefits of PCR vs. PLS are about the same.


Splines are special functions defined piecewise by polynomials that improve model flexibility. A regression spline is separate low-degree polynomials over different regions of X, separated by knots. Because splines can have high variance at outer range of predictors (when X is very small or very large), we can add natural splines, or a function that requires spline to be linear at boundaries.

Separately, smoothing splines have knots but then regularize (smoothes the fit) by adding a roughness penalty term.

Generative Additive Models (GAMs)

GAMs apply a non-linear function to each predictor and backfit by repeatedly updating the fit for each predictor while holding the others constant. Since its additive, we can examine the effect of each predictor on Y individually.

• • •

VI. Trees

Trees stratify or segment the predictor space into a number of regions with the goal of minimizing RSS. The splitting rules can be summarized in a tree. Pros: explainable, mirror human-decision making, displayed graphically, easily handle qualitative predictors. Cons: Lower prediction accuracy, not robust (change in data leads to very different tree).

Regression Trees

Recursive, binary, greedy splitting for tree with lowest RSS. The predicted response is the mean of training observations in terminal node. The basic method here is to grow a large tree, apply cost-complexity pruning to get best subtrees by alpha, and then use k-fold CV to choose alpha from average predicted error, returning the subtree with chosen value of alpha.

Classification Trees

These are similar to regression trees but the predicted response is the most commonly occurring class instead of the mean. Measurements of purity (how many nodes belong to most common class) include GINI Index and Entropy, which tend to be similar values.

Ensemble Methods Trees benefit from ensemble methods, where multiple learning algorithms are combined.

  • Bagging: takes repeated samples and average predictions or take majority vote.

  • Random Forest: decorrelates trees by taking a random sample of predictors (normally square root p) to avoid strong predictor affecting all models.

  • Boosting: grows trees sequentially by fitting to residuals. This results in a slow improvement in areas where the model does not do well.

• • •

VII. Support Vector Machines (SVMs)

SVMs are useful for supervised learning problems where we suspect non-linear relationships. Here we separate the data with a hyperplane and apply a kernel function to map observations to a high-dimensional feature spaces. The SVM can also be thought of as minimizing the “hinge loss” where the model penalizes misclassified points AND correct points that the model is not confident in.

Some vocabulary:

  • Margin: perpendicular min distance from observation to plane

  • Support vectors: training observations closest to planes because they affect location of hyperplane

  • Hyperplane: flat affine subspace of dimension p — 1 (p=number of predictors)

  • Support Vector Classifier. This allows slack variables for letting observations be on the wrong side of the margin / hyperplane. There is a tuning parameter C for a number of allowed incorrect observations, where a large C means many are on the wrong side.

  • Support Vector Machine. The ~machine~ here enlarges the feature space in a tractable way through kernel methods. The kernel measures similarity between any pair of observations. There are linear, polynomial, and radial kernels.

• • •

Thanks for reading! Let know if you think this could benefit from other stats concepts at

More Like This

Automating Online Hate Speech Detection

The Limitations of the EU'S GDPR

Why I Built An App