Stats Fundamentals
notes on reading an introduction to statistical learning (isl)
I. Overview
I like to think of ISL as a survey of toolkits across different modeling approaches for a) fitting a model and b) examining model fit. Below are notes on ISL’s classification, regression, tree, and supportvector machine toolkits.
Why estimate a model, f? The two main reasons are
 Inference: How does our response y change with our predictors x? Is Y~X linear or more complex?
 Prediction: Can we predict Y with X while minimizing reducible error?
How do we estimate f?
 Parametric methods: we assume a functional form and estimate the parameters (also called weights and coefficients) by fitting the model to train data
BiasVariance Tradeoff
Evaluating model fit often means considering the biasvariance tradeoff, where bias is the error from approximating a real life problem with a simpler model and variance is the amount the model would change on a different train set. As you improve one, the other tends to degrade.
II. Unsupervised Learning
When we don’t have responses (i.e. labels, dependent variables) available, we can still look at the structure of the data to learn patterns and subgroups/clusters.
Clustering (i.e. segmentation) Approaches
 KMeans: specify number of clusters K, assign observations to a (nonoverlapping) cluster, iteratively reduce withincluster variance by minimizing objective function.
 Hierarchical Clustering (bottomup/agglomerative clustering): Starting at the bottom of a tree, fuse the 2 cluster groups that are most similar and work upwards.
Principal Component Analysis (PCA)_Bias
PCA can be used to reduce the dimensionality (i.e. number of predictors) in a dataset by looking at the proportion of variance explained (PVE) by the predictors. The first principal component, a normalized linear combination of features, is a good summary of the data. The loading vectors are the weights of principal components and define the direction in the feature space along which the data varies the most.
III. Supervised Learning — Examining Quality of Fit
ISL covers a few ways to determine how “good” the model fit is.
Residual Standard Error (RSE)
Measures lack of fit, on average this is the error we expect. It
assesses the standard deviation of the residual (our model error).
Rsquared
Measures how well the regression line fits the data (on a scale of
0–100%). Rsquared will always increase when adding predictors because
it allows us to better (over)fit the data. To avoid this, we can omit
predictors with relatively small Rsquared increases. Below are some
metrics for measuring model fit, where RSS is residual sum of squares,
RSE is residual standard error, and TSS is total sum of squares.
Testing the Null Hypothesis
The null hypothesis assumes no relationship between X and Y. A model
that fits our data will disprove this.

TStatistic: Number of standard deviations that our learned coefficient is from 0.

Pvalue: The probability of observing any number greater than or equal to assuming Small pvalue means this probability is low, and we can reject the null hypothesis.

Fstatistic: with many predictors we want this metric because the pvalue will be less than .05 5% of the time by chance. Similar to a pvalue, a small Fstatistic means we can reject the null hypothesis.
Confidence vs. Prediction Intervals
Confidence intervals measure uncertainty over many points. If we take
100 random samples of our data, it tells us how many samples we expect
to contain the true statistical parameter. Prediction intervals tell us
in what range a future individual observation will fall.
With CI you can say “I am X% confident that the mean value falls in range (A,B)”. With PI you can say “I am X% confident that the next Y with a given X will be in the range (A,B)”.
Because PI predicts an individual value, there is greater uncertainty and so this will always have a wider interval than CI.
Debugging Tricks

Heteroscedasticity: this measures if there are nonconstant variances in the errors.

Plot residual by time: could show a correlation in error terms (occurs in timeseries data). In this case, standard errors will underestimate true standard errors.

Plot residual by fitted values: if there is a discernible pattern (funnel shape in the case of heteroscedasticity), you can apply a nonlinear transformation or interaction term to increase fit.

Plot studentized residuals (residuals divided by standard error): any points greater than 3 are possible outliers

Variation Inflation Factor (VIF): Regress a predictor onto all other predictors and take 1 /(1Rsquared) to estimate how much the variance is “inflated” by a linear dependence with other predictors. VIF > 2.5 indicates collinearity.

Consider confounding variables when testing for causal relationships. A confounding variable influences Y and other predictors, which can lead to seemingly contradictory coefficients. They can be found with VIF and by building baseline models and examining the effect of adding each predictor.

Watch for p > n, in which case classical statistical methods break down and Rsquared can’t be trusted.
IV. Classification
Bayes Classifier
Bayes rule is a way to update our beliefs (B) based on the arrival of
new evidence (E).
It’s considered an unattainable gold standard because in reallife we don’t know the conditional probabilities of events. The Bayes error rate is the irreducible error rate.
KNearest Neighbors
If we draw a circle around the K nearest neighbors of a point, we put
that point in the class that a majority of neighbors are in. This
performs worse than linear regression when there are a large number of
predictors, because we have fewer samples perpredictors (this is
considered “the curse of dimensionality”).
Logistic Regression
To get the probability that a sample is in a given class, we fit an S
curve (“sigmoid”) to our data (instead of a line, like linear
regression) to keep values between 0 and 1. We take the sigmoid of the
weighted sum of the input features. To find our weights, we need a cost
function that is nonlinear because of the sigmoid activation. We use
the logloss (i.e. crossentropy function) and gradient descent to
iteratively update the weights. The term “logits” defines raw
predictions that a classification model generates before a function
transforms them to probabilities.
Linear Discriminant Analysis (LDA)
This models the distribution of predictors separately in each of the
response classes (density function) then uses Bayes’ theorem to flip
them into estimates for the posterior, P(Y=kX=X). We’d use LDA over
logistic regression when

classes are wellseparated.

the distribution of the predictors is approximately normal.

for more than 2 response classes.
Quadratic Discriminant Analysis is like LDA but assumes each class has its own covariance matrix. We’d use this when the train set is very large (so classifier variance is not a concern) or the assumption of shared covariance matrix is untenable
Classificationspecific Performance Metrics

ROC: from communications theory, “receiver operator characteristics”, shows true positive and false positive rate for all classification thresholds

True Positive (TP) rate = sensitivity = recall

False positive rate

TP / Positive rate = precision
V. Regressions
Linear regression assumes a linear relationship between X and Y. The most common approach is to measure closeness of fit with least squares. Alternatives/updates to least squares include the following sections.
Subset selection
Fit separate least squares regressions for combinations of p predictors. Stepwise selection of forward, backwards, or a hybrid for choosing each predictor.
Shrinkage/Regularization
Where least squares is high variance, regularization improves model performance by reducing the weights of less important predictors. This should be applied after standardizing predictors so that they all have the same scale (and have standard deviation of 1).
Ridge regression: assumes coefficients are randomly distributed around 0. Minimizes RSS + shrinkage penalty. Larger tuning parameter = larger penalty; l2 norm.
Lasso regression: lets us shrink coefficients to 0 (assumes many coefficients are 0 in distribution); l1 norm
Dimension Reduction Methods
ISL explores 2 methods

Principal component regression (PCR): Often a small number of principal components explain variability in data without overfitting. Principal component regression uses only a subset of principal components as effectively a regularization procedure.

Partial least squares (PLS): supervised alternative to PCR that makes each loading vector the coefficient from a linear regression. Next principal components are found by taking residuals. Since PLS has the potential to increase variance, the benefits of PCR vs. PLS are about the same.
Splines
Splines are special functions defined piecewise by polynomials that improve model flexibility. A regression spline is separate lowdegree polynomials over different regions of X, separated by knots. Because splines can have high variance at outer range of predictors (when X is very small or very large), we can add natural splines, or a function that requires spline to be linear at boundaries.
Separately, smoothing splines have knots but then regularize (smoothes the fit) by adding a roughness penalty term.
Generative Additive Models (GAMs)
GAMs apply a nonlinear function to each predictor and backfit by repeatedly updating the fit for each predictor while holding the others constant. Since its additive, we can examine the effect of each predictor on Y individually.
VI. Trees
Trees stratify or segment the predictor space into a number of regions with the goal of minimizing RSS. The splitting rules can be summarized in a tree. Pros: explainable, mirror humandecision making, displayed graphically, easily handle qualitative predictors. Cons: Lower prediction accuracy, not robust (change in data leads to very different tree).
Regression Trees
Recursive, binary, greedy splitting for tree with lowest RSS. The predicted response is the mean of training observations in terminal node. The basic method here is to grow a large tree, apply costcomplexity pruning to get best subtrees by alpha, and then use kfold CV to choose alpha from average predicted error, returning the subtree with chosen value of alpha.
Classification Trees
These are similar to regression trees but the predicted response is the most commonly occurring class instead of the mean. Measurements of purity (how many nodes belong to most common class) include GINI Index and Entropy, which tend to be similar values.
Ensemble Methods Trees benefit from ensemble methods, where multiple learning algorithms are combined.

Bagging: takes repeated samples and average predictions or take majority vote.

Random Forest: decorrelates trees by taking a random sample of predictors (normally square root p) to avoid strong predictor affecting all models.

Boosting: grows trees sequentially by fitting to residuals. This results in a slow improvement in areas where the model does not do well.
VII. Support Vector Machines (SVMs)
SVMs are useful for supervised learning problems where we suspect nonlinear relationships. Here we separate the data with a hyperplane and apply a kernel function to map observations to a highdimensional feature spaces. The SVM can also be thought of as minimizing the “hinge loss” where the model penalizes misclassified points AND correct points that the model is not confident in.
Some vocabulary:

Margin: perpendicular min distance from observation to plane

Support vectors: training observations closest to planes because they affect location of hyperplane

Hyperplane: flat affine subspace of dimension p — 1 (p=number of predictors)

Support Vector Classifier. This allows slack variables for letting observations be on the wrong side of the margin / hyperplane. There is a tuning parameter C for a number of allowed incorrect observations, where a large C means many are on the wrong side.

Support Vector Machine. The ~machine~ here enlarges the feature space in a tractable way through kernel methods. The kernel measures similarity between any pair of observations. There are linear, polynomial, and radial kernels.
Thanks for reading! Let know if you think this could benefit from other stats concepts at ashe.magalhaes@gmail.com.
More Like This
Automating Online Hate Speech Detection