Stats Fundamentals

October 18, 2020 | ☕️ ☕️ 9 min read

Notes on rereading An Introduction to Statistical Learning (ISL)

I. Overview

I like to think of ISL as a survey of toolkits across different modeling approaches for a) fitting a model and b) examining model fit. Below are notes on ISL’s classification, regression, tree, and support-vector machine toolkits.

*Why estimate a model, f?* The two main reasons are

**Inference:**How does our response y change with our predictors x? Is Y~X linear or more complex?**Prediction:**Can we predict Y with X while minimizing reducible error?

*How do we estimate f?*

**Parametric methods:**we assume a functional form and estimate the parameters (also called weights and coefficients) by fitting the model to train data**Non-parametric methods:**sometimes called “distribution-free” because they have fewer assumption, these methods determine the model structure by looking at how closely we can fit the data. Here we needs lots of data and it’s easy to overfit.

*Bias-Variance Trade-off*

Evaluating model fit often means considering the bias-variance trade-off, where **bias** is the error from approximating a real life problem with a simpler model and **variance** is the amount the model would change on a different train set. As you improve one, the other tends to degrade.

• • •

II. Unsupervised Learning

When we don’t have responses (i.e. labels, dependent variables) available, we can still look at the structure of the data to learn patterns and subgroups/clusters.

*Clustering (i.e. segmentation) Approaches*

**K-Means:**specify number of clusters K, assign observations to a (non-overlapping) cluster, iteratively reduce within-cluster variance by minimizing objective function.**Hierarchical Clustering (bottom-up/agglomerative clustering):**Starting at the bottom of a tree, fuse the 2 cluster groups that are most similar and work upwards.

*Principal Component Analysis (PCA)*

PCA can be used to reduce the dimensionality (i.e. number of predictors) in a dataset by looking at the proportion of variance explained (PVE) by the predictors. The first principal component, a normalized linear combination of features, is a good summary of the data. The loading vectors are the weights of principal components and define the direction in the feature space along which the data varies the most.

• • •

III. Supervised Learning — Examining Quality of Fit

ISL covers a few ways to determine how “good” the model fit is.

**Residual Standard Error (RSE)**

Measures lack of fit, on average this is the error we expect. It assesses the standard deviation of the residual (our model error).

**R-squared**

Measures how well the regression line fits the data (on a scale of 0–100%). R-squared will always increase when adding predictors because it allows us to better (over)fit the data. To avoid this, we can omit predictors with relatively small R-squared increases. Below are some metrics for measuring model fit, where RSS is residual sum of squares, RSE is residual standard error, and TSS is total sum of squares.

**Testing the Null Hypothesis**

The null hypothesis assumes no relationship between X and Y. A model that fits our data will disprove this.

**T-Statistic:**Number of standard deviations that our learned coefficient is from 0.**P-value:**The probability of observing any number greater than or equal to assuming**Small p-value**means this probability is low, and we can reject the null hypothesis.**F-statistic:**with many predictors we want this metric because the p-value will be less than .05 5% of the time by chance. Similar to a p-value, a**small F-statistic**means we can reject the null hypothesis.

**Confidence vs. Prediction Intervals**

Confidence intervals measure uncertainty over many points. If we take 100 random samples of our data, it tells us how many samples we expect to contain the true statistical parameter. Prediction intervals tell us in what range a future individual observation will fall.

With CI you can say “I am X% confident that the mean value falls in range (A,B)”. With PI you can say “I am X% confident that the next Y with a given X will be in the range (A,B)”.

Because PI predicts an individual value, there is greater uncertainty and so this will always have a wider interval than CI.

**Debugging Tricks**

Heteroscedasticity: this measures if there are non-constant variances in the errors.

Plot residual by time: could show a correlation in error terms (occurs in time-series data). In this case, standard errors will underestimate true standard errors.

Plot residual by fitted values: if there is a discernible pattern (funnel shape in the case of heteroscedasticity), you can apply a non-linear transformation or interaction term to increase fit.

Plot studentized residuals (residuals divided by standard error): any points greater than 3 are possible outliers

Variation Inflation Factor (VIF): Regress a predictor onto all other predictors and take 1 /(1-R-squared) to estimate how much the variance is “inflated” by a linear dependence with other predictors. VIF > 2.5 indicates collinearity.

Consider confounding variables when testing for causal relationships. A confounding variable influences Y and other predictors, which can lead to seemingly contradictory coefficients. They can be found with VIF and by building baseline models and examining the effect of adding each predictor.

Watch for p > n, in which case classical statistical methods break down and R-squared can’t be trusted.

• • •

IV. Classification

**Bayes Classifier**

Bayes rule is a way to update our beliefs (B) based on the arrival of new evidence (E).

It’s considered an unattainable gold standard because in real-life we don’t know the conditional probabilities of events. The Bayes error rate is the irreducible error rate.

**K-Nearest Neighbors**

If we draw a circle around the K nearest neighbors of a point, we put that point in the class that a majority of neighbors are in. This performs worse than linear regression when there are a large number of predictors, because we have fewer samples per-predictors (this is considered “the curse of dimensionality”).

**Logistic Regression**

To get the probability that a sample is in a given class, we fit an S curve (“sigmoid”) to our data (instead of a line, like linear regression) to keep values between 0 and 1. We take the sigmoid of the weighted sum of the input features. To find our weights, we need a cost function that is non-linear because of the sigmoid activation. We use the log-loss (i.e. cross-entropy function) and gradient descent to iteratively update the weights. The term “logits” defines raw predictions that a classification model generates before a function transforms them to probabilities.

**Linear Discriminant Analysis (LDA)**

This models the distribution of predictors separately in each of the response classes (density function) then uses Bayes’ theorem to flip them into estimates for the posterior, P(Y=k|X=X). We’d use LDA over logistic regression when

classes are well-separated,

the distribution of the predictors is approximately normal,

and for more than 2 response classes.

**Quadratic Discriminant Analysis** is like LDA but assumes each class has its own covariance matrix. We’d use this when the train set is very large (so classifier variance is not a concern) or the assumption of shared covariance matrix is untenable

**Classification-specific Performance Metrics**

ROC: from communications theory, “receiver operator characteristics”, shows true positive and false positive rate for all classification thresholds

True Positive (TP) rate = sensitivity = recall

False positive rate

TP / Positive rate = precision

• • •

V. Regressions

Linear regression assumes a linear relationship between X and Y. The most common approach is to measure closeness of fit with least squares. Alternatives/updates to least squares include the following sections.

**Subset selection**

Fit separate least squares regressions for combinations of p predictors. Stepwise selection of forward, backwards, or a hybrid for choosing each predictor.

**Shrinkage/Regularization**

Where least squares is high variance, regularization improves model performance by reducing the weights of less important predictors. This should be applied after standardizing predictors so that they all have the same scale (and have standard deviation of 1).

**Ridge regression:** assumes coefficients are randomly distributed around 0. Minimizes RSS + shrinkage penalty. Larger tuning parameter = larger penalty; l2 norm.

**Lasso regression:** lets us shrink coefficients to 0 (assumes many coefficients are 0 in distribution); l1 norm

**Dimension Reduction Methods**

ISL explores 2 methods

Principal component regression (PCR): Often a small number of principal components explain variability in data without overfitting. Principal component regression uses only a subset of principal components as effectively a regularization procedure.

Partial least squares (PLS): supervised alternative to PCR that makes each loading vector the coefficient from a linear regression. Next principal components are found by taking residuals. Since PLS has the potential to increase variance, the benefits of PCR vs. PLS are about the same.

**Splines**

Splines are special functions defined piecewise by polynomials that improve model flexibility. A **regression spline** is separate low-degree polynomials over different regions of X, separated by knots. Because splines can have high variance at outer range of predictors (when X is very small or very large), we can add **natural splines**, or a function that requires spline to be linear at boundaries.

Separately, **smoothing splines** have knots but then regularize (smoothes the fit) by adding a roughness penalty term.

**Generative Additive Models (GAMs)**

GAMs apply a non-linear function to each predictor and backfit by repeatedly updating the fit for each predictor while holding the others constant. Since its additive, we can examine the effect of each predictor on Y individually.

• • •

VI. Trees

Trees stratify or segment the predictor space into a number of regions with the goal of minimizing RSS. The splitting rules can be summarized in a tree.

Pros: explainable, mirror human-decision making, displayed graphically, easily handle qualitative predictors. Cons: Lower prediction accuracy, not robust (change in data leads to very different tree).

**Regression Trees**

Recursive, binary, greedy splitting for tree with lowest RSS. The predicted response is the mean of training observations in terminal node. The basic method here is to grow a large tree, apply cost-complexity pruning to get best subtrees by alpha, and then use k-fold CV to choose alpha from average predicted error, returning the subtree with chosen value of alpha.

**Classification Trees**

These are similar to regression trees but the predicted response is the most commonly occurring class instead of the mean. Measurements of purity (how many nodes belong to most common class) include **GINI Index**and **Entropy**, which tend to be similar values.

**Ensemble Methods**

Trees benefit from ensemble methods, where multiple learning algorithms are combined.

**Bagging:**takes repeated samples and average predictions or take majority vote.**Random Forest:**decorrelates trees by taking a random sample of predictors (normally square root p) to avoid strong predictor affecting all models.**Boosting:**grows trees sequentially by fitting to residuals. This results in a slow improvement in areas where the model does not do well.

• • •

VII. Support Vector Machines (SVMs)

SVMs are useful for supervised learning problems where we suspect non-linear relationships. Here we separate the data with a hyperplane and apply a kernel function to map observations to a high-dimensional feature spaces. The SVM can also be thought of as minimizing the “hinge loss” where the model penalizes misclassified points AND correct points that the model is not confident in.

Some vocabulary:

**Margin:** perpendicular min distance from observation to plane

**Support vectors:** training observations closest to planes because they affect location of hyperplane

**Hyperplane:** flat affine subspace of dimension p — 1 (p=number of predictors)

**Support Vector Classifier.** This allows slack variables for letting observations be on the wrong side of the margin / hyperplane. There is a tuning parameter C for a number of allowed incorrect observations, where a large C means many are on the wrong side.

**Support Vector Machine**. The ~machine~ here enlarges the feature space in a tractable way through kernel methods. The kernel measures similarity between any pair of observations. There are linear, polynomial, and radial kernels.

• • •

Thanks for reading! Let know if you think this could benefit from other stats concepts at ashe.magalhaes@gmail.com.

More Like This