class: center, middle, inverse, title-slide # DATA1001 ## Classification ### Peter Straka ### Week 10, 2018-10-01 --- ## Statistical Learning Supervised learning is a branch of statistical learning, where some variables are considered **"inputs"** `\(\mathbf x\)` (or predictors, or regressors) and some other variable is considered an **"output"** (or response, or dependent variable) `\(y\)`. The aim is to **learn** a function `\(f\)` which maps `\(\mathbf x\)` to a good approximation `\(f(\mathbf x)\)` of `\(y\)`. ![](index_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- ## Example: Simple linear regression $$ y = f(x) + u, \text{ where } f(x) = \beta_0 + \beta_1 x $$ and where `\(u\)` is the residual of the approximation. ```r library(ISLR); head(Carseats) ``` ``` Sales CompPrice Income Advertising Population Price ShelveLoc Age 1 9.50 138 73 11 276 120 Bad 42 2 11.22 111 48 16 260 83 Good 65 3 10.06 113 35 10 269 80 Medium 59 4 7.40 117 100 4 466 97 Medium 55 5 4.15 141 64 3 340 128 Bad 38 6 10.81 124 113 13 501 72 Bad 78 Education Urban US 1 17 Yes Yes 2 10 Yes Yes 3 12 Yes Yes 4 14 Yes Yes 5 13 Yes No 6 16 No Yes ``` --- # Classification Above, the output `\(y\)` was a **continuous** random variable. For classification problems, the output `\(y\)` is a **categorical** random variable. Consider the `Default` dataset: ```r head(Default) ``` ``` default student balance income 1 No No 729.5265 44361.625 2 No Yes 817.1804 12106.135 3 No No 1073.5492 31767.139 4 No No 529.2506 35704.494 5 No No 785.6559 38463.496 6 No Yes 919.5885 7491.559 ``` --- ## Linear regression, categorical output Let's try to predict the `default` variable. Why don't we just code `No` with 0 and `Yes` with 1, and fit a linear regression? -- ![](index_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## Logistic regression, categorical output Logistic regression models the _probability_ that a binary response variable `\(Y\)` with possible values `\(0\)` and `\(1\)` takes the value `\(Y=1\)`. -- ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## The Exponential function `\(x \mapsto \exp(x) = e^x\)` where `\(e \approx 2.71\ldots\)` is the Euler number. ![](index_files/figure-html/unnamed-chunk-6-1.png)<!-- --> Properties: -- `\(\exp(x) > 0\)`; `\(\frac{d}{dx} \exp(x) = \exp(x)\)`; grows faster than `\(x^k\)` for any `\(k\)`; `\(\exp(x+y) = \exp(x) \exp(y)\)` --- ## The Natural Logarithm The natural logarithm is the inverse function of the exponential function: $$ \log(\exp(x)) = x = \exp(\log(x)) $$ ![](index_files/figure-html/unnamed-chunk-7-1.png)<!-- --> Properties: -- Defined only for `\(x>0\)`; Negative if `\(0<x<1\)`; `\(\log(ab) = \log(a) + \log(b)\)`; `\(\log(1/a) = -\log(a)\)`; grows slower than any `\(x^k\)` where `\(k>0\)` --- ## The Logistic Model First: similar to linear regression, compute the _score_ `\(Z\)` for a given predictor `\(\mathbf x\)`: $$ Z = \beta_0 + \beta_1 x_1 + \ldots + \beta_m x_m $$ Second: Map the score to the probability `\(\mathbf P(Y=1)\)` via the **logistic function** (or "sigmoid" function): $$ \mathbf P(Y=1) = \sigma(Z) = \frac{\exp(Z)}{1 + \exp(Z)} $$ ![](index_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## Predicting with probabilities Suppose that for each of the next three days, you predict rain with 0.5 probability, whereas BOM predicts rain with 0.83. It turns out that two of the three days were rainy. Whose was the better prediction? -- Recall how we judged the quality of predicting outcomes `\(y_n\)` with model outputs `\(f(x_n)\)` for linear regression: -- Mean squared error: $$ {\rm MSE} = \sum_{n=1}^N (y_n - f(\mathbf x_n))^2 $$ -- In machine learning, _model error_ is commonly called "loss", and MSE is the most common loss function for continuous output variables. > We need a loss function for categorical output variables. --- ## Logistic loss Let's write `\(p := \sigma(Z)\)` for the probability computed by the logistic model. The logistic loss (also called "binary cross entropy" in deep learning) for a single outcome is defined as follows: * If `\(y=1\)`, then `\(L(y,p) = -\log p\)` * If not `\(y=1\)`, then `\(L(y,p) = -\log (1-p)\)`. -- A shorter way to write this is $$ L(y,p) = -y\log p - (1-y) \log(1-p) $$ -- Convince yourself why this is a reasonable loss function! * What loss do you get if `\(Y=1\)` and you predicted `\(Y=1\)` with high probability? -- * What loss do you get if `\(Y=0\)` and you predicted `\(Y=1\)` with low probability? -- * What loss do you get if `\(Y=1\)` and you predicted `\(Y=1\)` with low probability? -- * What loss do you get if `\(Y=0\)` and you predicted `\(Y=1\)` with high probability? --- ## Back to weather prediction accuracy We calculate the mean logistic loss for the two predictions: ```r outcome <- c(1,0,1) log_loss <- function(pred, outcome) -sum(log(pred) * outcome + log(1-pred) * (1-outcome)) / length(pred) sapply(list(rep(0.5, 3), rep(0.83, 3)), function(x) log_loss(x, outcome)) ``` ``` [1] 0.6931472 0.7148720 ``` If measuring prediction accuracy via logistic loss, then for 2/3 rainy days, the three 50% predictions are better than the three 83% predictions. --- ## What would have been the "best" prediction? We plot the loss against the prediction probabilities `\((p, p, p)\)`, where `\(p\)` ranges from 0.4 to 1: -- ```r xx <- seq(from=0.4, to=0.99, length.out = 500) yy <- sapply(xx, function(x) log_loss(rep(x, 3), outcome)) x_min <- xx[which.min(yy)] data.frame(pred=xx, loss=yy) %>% ggplot(aes(x=pred, y=loss)) + geom_line() + geom_vline(xintercept = x_min) ``` ![](index_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ```r x_min ``` ``` [1] 0.6672144 ``` --- ## Fitting a logistic regression model Going back to the credit card default dataset: We consider the logistic model $$ P({\rm default}=\text{Yes}) = \sigma(Z) = \sigma(\beta_0 + \beta_1 X_1) $$ where `\(X_1\)` is the `balance` variable. To "fit" a logistic regression model is the same as "estimating" the parameters `\(\beta_0\)` and `\(\beta_1\)`: -- It means to find the values `\(\hat \beta_0\)` and `\(\hat \beta_1\)` that minimize the logistic loss $$ \sum_{n=1}^N L(y_n, \sigma(\beta_0 + \beta_1 X_1)) $$ -- By the way: The negative logistic loss is known as the log-likelihood (logarithm of the likelihood) in statistics. Minimizing loss means maximizing likelihood. Maximum likelihood is the bread and butter of statisticians. --- ## Let's try minimizing loss by hand ```r sigmoid <- function(z) {exp(z) / (1+exp(z))} score <- function(x, beta0, beta1) {beta0 + beta1 * x} logistic_model <- function(x, beta0, beta1) {sigmoid(score(x, beta0, beta1))} beta0 <- seq(from=-15, to=-5, length.out = 100) loss <- sapply(beta0, function(beta0) { pred <- logistic_model(Default$balance, beta0, 0.005) log_loss(pred, Default$default == "Yes")}) data.frame(beta0=beta0, loss=loss) %>% ggplot(aes(x=beta0, y=loss)) + geom_line() ``` ![](index_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- ```r beta1 <- seq(from=0.004, to=0.006, length.out = 100) loss <- sapply(beta1, function(beta1) { pred <- logistic_model(Default$balance, -10, beta1) log_loss(pred, Default$default == "Yes") }) data.frame(beta0=beta0, loss=loss) %>% ggplot(aes(x=beta0, y=loss)) + geom_line() ``` ![](index_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## Optimizing loss In this course, we won't go into the details of how to find the parameters that minimize loss. But aside, we mention: * There exists a unique minimum for each coefficient `\(\hat \beta_k\)` * A generalization of [Newton's Method](https://en.wikipedia.org/wiki/Newton%27s_method) for more than one variable is used to find this minimum. * This is implemented in most statistical packages, e.g. `glm` in R. --- ## The package `glm` ```r log_reg <- glm(formula = default~balance, family = "binomial", data = Default) log_reg ``` ``` Call: glm(formula = default ~ balance, family = "binomial", data = Default) Coefficients: (Intercept) balance -10.651331 0.005499 Degrees of Freedom: 9999 Total (i.e. Null); 9998 Residual Null Deviance: 2921 Residual Deviance: 1596 AIC: 1600 ``` (Aside: _Deviance_ is used in statistics for `\(2 \times\)` loss. Divided by `\(2\)` and `\(N = 10000\)`, this is the same as our "computed" minimal loss.) --- ## Odds and log-odds * The probability of the event `\(Y=1\)` is `\(\mathbf P(Y=1)\)`. * The **odds** of `\(Y=1\)` are `$$\frac{\mathbf P(Y=1)}{\mathbf P(Y\neq 1)}$$` * The **log-odds** of `\(Y=1\)` are defined as the logarithm of the odds. Example: Say `\(P(Y=1) = 0.9\)`. Then `\(P(Y\neq 1) = 0.1\)`, and the odds of `\(Y=1\)` are `\(0.9/0.1=9\)`. The log-odds of `\(Y=1\)` are `\(\log 9 \approx 2.197\)`. Think of odds and log-odds just as different scales for probability: Odds range from `\(0\)` to `\(\infty\)`, and log-odds from `\(-\infty\)` to `\(\infty\)`, depending on if `\(\mathbf P(Y=1)\)` is close to `\(0\)` or close to `\(1\)`. --- ## The inverse sigmoid function The sigmoid function `\(\sigma(z) = \exp(z) / (1+\exp(z))\)` * maps the _score_ `\(z\)` to the probability `\(\mathbf P(Y=1)\)` * is increasing so has an inverse function, called the **logit** function: `$${\rm logit}(\sigma(z)) = z = \sigma({\rm logit}(z))$$` ![](index_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- ## The inverse sigmoid function Let's calculate the logit function: -- `$${\rm logit}(p) = \log \frac{p}{1-p}, \quad \text{where } 0 < p < 1$$` To summarize: * score `\(\stackrel{\sigma(\,)}{\longrightarrow}\)` probability * probability `\(\stackrel{\rm logit}{\longrightarrow}\)` log-odds > So the score and the log-odds are the same thing. --- ## Interpreting the coefficients Let's use all the variables in the Default dataset: ```r log_reg <- glm(formula = default ~ student + balance + income, family = "binomial", data = Default) log_reg$coefficients ``` ``` (Intercept) studentYes balance income -1.086905e+01 -6.467758e-01 5.736505e-03 3.033450e-06 ``` These are our coefficient estimates `\(\hat\beta_0\)`, `\(\hat\beta_1\)`, `\(\hat\beta_2\)`, `\(\hat\beta_3\)`, where `\(X_1\)` is `studentYes`, `\(X_2\)` is `balance` and `\(X_3\)` is `income`. -- * Think of the intercept as the "baseline" value for the log-odds. It applies when the remaining variables are all 0 (although this may be unrealistic). Here, the intercept is negative, which hints that most cases are non-default. -- * Being a student decreases the log-odds of default by 0.65. -- * For every thousand dollars on the balance, the log-odds of default increase by 5.74 -- * For every million dollars of income, the log-odds of default increase by 0 (We will revoke this statement later.) --- ## Significance The sample was random; and since the coefficients were calculated from the sample, they are also random! If we'd sample and fit a model again, we would get different coefficients. -- The coefficient for `income` is really close to 0. Would it be possible that in (hypothetical) repeated samples it would be sometimes negative, sometimes positive? If so, what would this mean? -- > If a coefficient fluctuates around 0, we would think that its variable has any real influence in the model. We say it is not **significant**. --- ## Significance The key selling point of a statistics degree over machine learning is the focus on _quantifying uncertainty_. In a "proper" statistics class, you will learn how to estimate the distribution of the coefficients `\(\hat \beta_k\)`. (It is bell-shaped, but not quite the normal distribution: the t-distribution.) -- > If `\(\hat \beta_k = 0\)` is a very unlikely value in this distribution, then we will say that the variable `\(X_k\)` is **significant**. -- > How likely the value 0 is in the distribution of `\(\beta_k\)`, is measured by the **p-value**. Below is the output of ```r summary(log_reg) ``` --- ``` Call: glm(formula = default ~ student + balance + income, family = "binomial", data = Default) Deviance Residuals: Min 1Q Median 3Q Max -2.4691 -0.1418 -0.0557 -0.0203 3.7383 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 *** studentYes -6.468e-01 2.363e-01 -2.738 0.00619 ** balance 5.737e-03 2.319e-04 24.738 < 2e-16 *** income 3.033e-06 8.203e-06 0.370 0.71152 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2920.6 on 9999 degrees of freedom Residual deviance: 1571.5 on 9996 degrees of freedom AIC: 1579.5 Number of Fisher Scoring iterations: 8 ``` --- ## Choosing a model The last column contains the p-value for each coefficient. Indeed, the `income` variable is not significant, so we omit it from the model. As a model for the probability of default, we choose `$$\log \frac{\mathbf P(Y=1)}{\mathbf P(Y \neq 1)} = -10.9 -0.65 \times {\tt student} + 0.0057 \times {\tt balance}$$` --- # Classifying future data Often, the main point of having a model is to make predictions about the future: * What will be the probability that a certain customer signing up for a credit card will default on their loan? - Use the model output to decide whether or not to approve the loan. * What will be the probability that the visitor of a web page with certain characteristics clicks a certain link? - Use this to decide what customers to target with advertisements with emails. In machine learning, the quality of a model is measured by the loss of the model when predicting outcomes based on new data. --- ## Train and test data The performance on new data can be measured by using "test data", or "holdout data": * Randomly sample a fraction of training data, e.g. 80% of all data. Use these to fit a model. * The remaining fraction of 20% is used as test data, to evaluate the predictive performance of the model. ```r N <- dim(Default)[1] train_part <- 0.8 set.seed(123) train_idx <- sample(x = 1:N, size = round(train_part * N), replace = FALSE) log_reg <- glm(formula = default ~ student + balance, family = "binomial", data = Default[train_idx, ]) ``` --- ## Train and test data Predictions on the test data, and their prediction loss are calculated easily in R: ```r preds_test <- predict(object = log_reg, newdata = Default[-train_idx, ], type = "response") actuals_test <- (Default$default == "Yes")[-train_idx] log_loss(preds_test, actuals_test) ``` ``` [1] 0.0831293 ``` Note that the loss for predicting new data is typically higher than the loss on the training set: ```r preds_train <- predict(object = log_reg, newdata = Default[train_idx, ], type = "response") actuals_train <- (Default$default == "Yes")[train_idx] log_loss(preds_train, actuals_train) ``` ``` [1] 0.0775589 ``` --- class: middle # Overfitting The more parameters a model has, the more complex it is. For example, for each variable that enters a linear model, you have one additional parameter. -- If a model becomes too complex, it "memorizes" the training output, and does a bad job "generalizing" to new data; this is called **overfitting**. -- If loss is *much* higher on the test set than it is on the training set, then you are likely **overfitting** your data. -- We will see an examples next week with classification and regression trees. --- ## Threshold The vector `preds_test` contains the probabilities for each test case to lead to default. > Suppose that these cases were applications for new loans. How would you use the predicted probabilities of default to approve their loans? -- Let's pick a threshold, and record whether or not the probability is above or below the threshold: ```r threshold <- 0.5 high_risk <- preds_test > threshold high_risk[1:20] ``` ``` 2 4 9 14 15 16 20 23 34 37 40 48 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 53 63 65 74 87 94 102 105 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` --- ## Confusion matrix Let's tabulate the cases we deem high risk, and the actual defaults: ```r table(actuals_test, high_risk) ``` ``` high_risk actuals_test FALSE TRUE FALSE 1928 14 TRUE 43 15 ``` Interpret each of the four fields. What is the cost for the bank in each case? --- ## False positives & false negatives The prediction of our model contains positives (`high_risk`) and negatives (non `high-risk`). -- The misclassified cases are called: * "false positives" (FP) if the test result was positive (`high_risk`) but `default` variable was "No" * "false negatives" (FN) if the test result was negative (not `high_risk`) but `default` variable was "Yes" -- **Sensitivity** is defined as TP / (TP + FN); also called the "true positive rate" **Specificity** is defined as TN / (TN + FP); also called the "true negative rate" -- > What happens to Sensitivity as you vary the threshold? -- > What happens to Specificity as you vary the threshold? --- ## ROC Curve As you vary the threshold from low values near 0, to high values near 1, the pair ("Sensitivity", "Specificity") traces a curve in the plane: ```r roc_test <- roc(actuals_test, preds_test) plot(roc_test) ``` ![](index_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- The higher the area under curve (AUC), the better. For our test data, we have an AUC of ```r roc_test$auc ``` ``` Area under the curve: 0.9314 ``` What AUC would we get with completely random guesses? --- ```r shuffled <- sample(x = preds_test) shuffled_roc <- roc(actuals_test, shuffled) plot(shuffled_roc) ``` ![](index_files/figure-html/unnamed-chunk-26-1.png)<!-- --> ```r shuffled_roc$auc ``` ``` Area under the curve: 0.5528 ``` --- ## How to read AUC values * AUC `\(\approx 0.5\)`: useless model * AUC `\(\approx 0.7\)`: fair classification performance * AUC `\(\approx 0.9\)`: good classification performance --- ## Recap * You have learnt about classification, which is at the basis of many interesting data science problems -- * You know the basics of logistic regression, the standard tool for classification. (As you become a data scientist, you'll get to know many more tools, e.g. classification trees, gradient boosted trees and neural networks.) -- * You have seen how to use the package `glm` to fit a model, and how to use the fitted model for predictions on new data. Try to reproduce this yourself. -- * Predictive performance of your fitted model is best measured on test data, i.e. on data that were not used to fit your model. -- * You have learned about how to threshold a predicted probability for actual predictions, and the trade-off between false positives and false negatives. --- # References Chapter 4 in: James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013. Probability of Default Estimation: https://en.wikipedia.org/wiki/Probability_of_default#PD_Estimation Sensitivity and Specificity: https://en.wikipedia.org/wiki/Sensitivity_and_specificity Statistical significance: Section 3.1.2 in: James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.