# Information Systems

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics – Theory and Methods

1Module 4: Analytics Theory/Methods

1Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics – Theory and Methods

During this lesson the following topics are covered:

• Technical description of a logistic regression model

• Common use cases for the logistic regression model

• Interpretation and scoring with the logistic regression model

• Diagnostics for validating the logistic regression model

• Reasons to Choose (+) and Cautions (-) of the logistic regression model

Lesson 4b: Logistic Regression

Module 4: Analytics Theory/Methods 2

The topics covered in this lesson are listed.

Module 4: Analytics Theory/Methods 2

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Logistic Regression

• Used to estimate the probability that an event will occur as a function of other variables

The probability that a borrower will default as a function of his credit score, income, the size of the loan, and his existing debts

• Can be considered a classifier, as well Assign the class label with the highest probability

• Input variables can be continuous or discrete

• Output: A set of coefficients that indicate the relative impact of each driver

A linear expression for predicting the log-odds ratio of outcome as a function of drivers. (Binary classification case)

Log-odds ratio easily converted to the probability of the outcome

3Module 4: Analytics Theory/Methods

We use logistic regression to estimate the probability that an event will occur as a function of other variables. An example is that the probability that a borrower will default as a function of his credit score , income, loan size, and his current debts.

We will be discussing classifiers in the next lesson. Logistic regression can also be considered a classifier. Recall the discussions on classifiers in lesson 1 of this module(Clustering). Classifiers are methods to assign class labels (default or no_default) based on the highest probability.

In logistic regression input variables can be continuous or discrete. The output is a set of coefficients that indicate the relative impact of each of the input variables.

In a binary classification case (true/false) the output also provides a linear expression for predicting the log odds ratio of the outcome as a function of drivers. The log odds ratios can be converted to the probability of an outcome and many packages do this conversion in their outputs automatically.

3Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Logistic Regression Use Cases

• The preferred method for many binary classification problems: Especially if you are interested in the probability of an event, not

just predicting the “yes or no“

Try this first; if it fails, then try something more complicated

• Binary Classification examples: The probability that a borrower will default

The probability that a customer will churn

• Multi-class example The probability that a politician will vote yes/vote no/not show up

to vote on a given bill

4Module 4: Analytics Theory/Methods

Logistic regression is the preferred method for many binary classification problems

Two examples of a binary classification problem are shown in the slide above. Other examples :

• true/false

• approve/deny

• respond to medical treatment/no response

• will purchase from a website/no purchase

• likelihood Spain will win the next World Cup

The third example on the slide “ The probability that a politician will vote yes/vote no/not show up to vote on a given bill” is a multiclass problem. We will only discuss binary problems (such as loan default) for simplicity in this lesson.

Logistic regression is especially useful if you are interested in the probability of an event, not just predicting the class labels. In a binary class problem Logistic regression must be tried first to fit a model. And only if it does not work models such as GAMS (generalized additive methods), Support Vector Machines and Ensemble Methods are tried (these models are out of scope for this course).

4Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Logistic Regression Model – Example

• Training data: default is 0/1 default=1 if loan defaulted

• The model will return the probability that a loan with given characteristics will default

• If you only want a “yes/no” answer, you need a threshold The standard threshold is 0.5

5Module 4: Analytics Theory/Methods

The slide shows an example “Probability of Default”

Default (output for this model) is defined as a function of credit score, income, loan amount and existing debt.

The training data represents the default as either 0 or 1 where default = 1 if the loan is defaulted.

Fitting and scoring the logistic regression model will return the probability that a loan with a given value for each of the input variables will default.

If only Yes/No type answer is desired a threshold must be set for the value of probability to return the class label. The standard threshold is 0.5.

5Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Logistic Regression- Visualizing the Model

Overall fraction of default: ~20%

Logistic regression returns a score that estimates the probability that a borrower will default

The graph compares the distribution of defaulters and non-defaulters as a function of the model’s predicted probability, for borrowers scoring higher than 0.1

Blue=defaulters

6Module 4: Analytics Theory/Methods

This is an example of how one might visualize the model. Logistic regression returns a score that estimates the probability that a borrower will default. The graph compares the distribution of defaulters and non defaulters as a function of model’s predicted probability for borrowers scoring higher than 0.1 and less than 0.98

The graph is overlaid – think of the blue graph (defaulters) as being transparent and “in front of” the red graph (non defaulters).

The takeaway from the graph is that the higher a borrower scores, the more likely empirically that he will default.

The graph only considers borrowers who score > 0.1 and < 0.98 because this graph had large spikes near 0 and 1, so the graph becomes hard to read. We can see, however, that a fraction of low scoring borrowers do actually default. (the overlap)

6Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Technical Description (Binary Case)

• y=1 is the case of interest: ‘TRUE’ • LHS is called logit(P(y=1))

hence, “logistic regression”

• logit(P(y=1)) is inverted by the sigmoid function standard packages can return probability for you

• Categorical variables are expanded as with linear regression • Iterative solution to obtain coefficient estimates, denoted bj

“Iteratively re-weighted least squares”

7Module 4: Analytics Theory/Methods

𝑙𝑛 𝑃(𝑦 = 1)

1 − 𝑃(𝑦 = 1) = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 …+ 𝛽𝑝𝑥𝑝−1

The quantity on LHS (Left Hand Side) is the log odds ratio. We first compute the ratio of probability of y equal to 1 vs. the probability of y not equal to 1 and take a log of this ratio. In logistic regression the log odds ratio is equal to linear additive combination of the drivers. LHS is called logit(P(y=1)) and hence this method came to be known as logistic regression. The inverse of the logit is the sigmoid function. The output of the sigmoid is the actual probabilities. Standard packages give the inverse as a standard output.

Categorical values are expanded exactly the way we did in the linear regression. Computing the estimated coefficients, denoted bj, can also be accomplished as the least square method but implemented as iteratively re-weighted least squares converging to the true probabilities with every iteration.

Logistic regression has exactly the same problems that a OLS method has and the computational complexity increases with more input variables and with categorical values with multiple levels.

7Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Interpreting the Estimated Coefficients, bi

• Invert the logit expression:

• exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj

• Example: bcreditScore = -0.69 • exp(bcreditScore) = 0.5 = 1/2 • for the same income, loan, and existing debt, the odds-ratio of

default is halved for every point increase in credit score • Standard packages return the significance of the coefficients in the

same way as in linear regression

8Module 4: Analytics Theory/Methods

If we invert the logit expression shown in the slide, we come up with the logit as a product of the exponents of the coefficients times the drivers.

The exponent of the first coefficient, b0, represents the odds-ratio of the outcome in the “reference situation” – the situation that is represented by all the continuous variables set to zero, and the categorical variables at their reference

That means the exponent of the coefficients exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj

Suppose we have bcreditScore =- 0.69 implies exp(-0.69) = 0.5 = 1/2

This means for the same income, loan amount, existing debt, the odds ratio of default is cut in half for every point of increase of credit score. The negative number on the coefficient indicates that there is a negative relation between the credit score and the probability of default. Higher credit score implies lower probability of default.

Significance of the credit score is returned in the same way as in linear regression. So you should look for very low “p” values.

8Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

An Interesting Fact About Logistic Regression

“The probability mass equals the counts”

• If 20% of our loan risk training set defaults The sum of all the training set

scores will be 20% of the number of training examples

• If 40% of applicants with income < $50,000 default The sum of all the training set

scores of people in this income category will be 40% of the number of examples in this income category

9Module 4: Analytics Theory/Methods

“Logistic regression preserves summary statistics of the training data” – in other words, logistic regression is a very good way of concisely describing the probability of all the different possible combination of features in the training data.

Two examples of this feature are shown in the slide. If you sum up everybody’s score after putting them through the model the total computed will be equal to the sum of all the training set scores.

What this means is that it is almost like a continuous look up probability table. Assume that we have all categorical variables and you have the table of probability of every possible combination of variables, Logistic regression is a concise version of the table. This is what can be defined as a “well calibrated” model.

Reference: http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic- regression/

9Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics

• Hold-out data: Does the model predict well on data it hasn’t seen?

• N-fold cross-validation: Formal estimate of generalization error

• “Pseudo-R2” : 1 – (deviance/null deviance) Deviance, null deviance both reported by most standard packages

The fraction of “variance” that is explained by the model

Used the way R2 is used

10Module 4: Analytics Theory/Methods

This is all very similar to linear regression. We use the hold-out data method, and N-fold cross validation on the fitted model. This is exactly what we did with linear regression to determine if the model predicts well.

The model should explain more than just this simple guess. Pseudo R2 is the term we use in Logistic regression which we use the same way we use R2 in linear regression. It is basically “the fraction” of the variance .

Deviance, for the purposes of this discussion, is analogous to “variance” in linear regression.

The null deviance is the deviance (or “error’) that you would make if you always assumed that the probability of true were simply the global probability.

1 – (deviance/null deviance) is the “fraction” that defines Pseudo R2 which is a measure of how well the model explains the data.

10Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics (Cont.)

• Sanity check the coefficients Do the signs make sense? Are the coefficients excessively large?

Wrong sign is an indication of correlated inputs, but doesn’t necessarily affect predictive power.

Excessively large coefficient magnitudes may indicate strongly correlated inputs; you may want to consider eliminating some variables, or using regularized regression techniques.

Infinite magnitude coefficients could indicate a variable that strongly predicts a subset of the output (and doesn’t predict well on the rest).

▪ Try a Decision Tree on that variable, to see if you should segment the data before regressing.

11Module 4: Analytics Theory/Methods

The sanity checks are exactly the same as what we discussed in linear regression.

Once we determine the fit is good we need to perform the sanity checks. Logistic regression is an explanatory model and the coefficients provide the required details.

First check the sign of the coefficients. Do the signs make sense. For example, should the income increase with age or years of education? The coefficients should be positive. If not there might be something wrong. It is often an indicator that the variables are correlated to each other. Regression works best if all the drivers are independent. This does not in fact affect the predictive power but the explanatory capability is compromised here.

We also need to check if the magnitude of the coefficients make sense? They sometimes can become excessively large and we prefer them not to be very large. This is also an indication of strongly correlated inputs. In this case consider eliminating some variables. Note that unlike linear regression, where we have regularized regression techniques, there are not any standard methods with logistic regression. If there is a requirement one should implement one’s own method.

Sometimes you may get infinite magnitude coefficients which could indicate that there is a variable that strongly predicts a certain subset of the output and does not predict well on the rest. For example there is a range of age for which the output income is perfectly predicted. In such conditions plot the output vs. the input and determine the segment at which the prediction goes wrong. We should then segment the data before fitting the model. Decision Trees can be used on that variable, to see if you should segment the data before regressing.

11Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics: ROC Curve

Area under the curve (AUC)

tells you how well the model

predicts. (Ideal AUC = 1)

For logistic regression, ROC

curve can help set classifier

threshold

12Module 4: Analytics Theory/Methods

Logistic models do very well at predicting class probabilities; but if you want to use them as a classifier you have to set a threshold. For a given threshold, the classifier will give false positives and false negatives. False positive rate (fpr) is the fraction of negative instances that were misclassified.

False negative rate (fnr) is the fraction of positive instances that were misclassified. True positive rate (tpr) = 1 – fnr

The ROC (Receiver Operating Characteristics) curve plots (fpr, tpr) as the threshold is varied from 0 (the upper right hand corner) to 1 (the lower left hand corner).

As the threshold is raised, the false positive rate decreases, but the true positive rate decreases, too.

The ideal classifier (only true instances have probability near 1) would trace the upper left triangle of the unit square: as the threshold increases, fpr decreases without lowering tpr.

Usually, ROC curves are only used to evaluate prediction quality – how close the AUC is to 1. But they can also be used to set thresholds; if you have upper bounds on your desired fpr and fnr, you can use the ROC curve (or more accurately, the software that you use to plot the ROC curve) to give you the range of thresholds that meet those constraints.

For logistic regression, the ROC curve can help set the classifier threshold.

An excellent primer on ROC is available in the following reference:

http://home.comcast.net/~tom.fawcett/public_html/papers/ROC101.pdf

12Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Diagnostics: Plot the Histograms of Scores

good separation

13Module 4: Analytics Theory/Methods

The next diagnostic method is plotting the histogram of the scores. The graph in the top half is what we saw earlier in the lesson. The graph tells us how well the model discriminates true instances from false instances. Ideally, true score high and false instances score low. If so, most of the mass of the two histograms are separated. That is what you see in the graph at the top.

The graph shown at the bottom shows substantial overlap. The model did not predict well. This means the input variables are not strong predictors of the output.

13Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Reasons to Choose (+) Cautions (-)

Explanatory value:

Relative impact of each variable on the outcome

in a more complicated way than linear regression

Does not handle missing values well

Robust with redundant variables, correlated variables

Lose some explanatory value

Assumes that each variable affects the log-odds of the

outcome linearly and additively

Variable transformations and modeling variable

interactions can alleviate this

A good idea to take the log of monetary amounts

or any variable with a wide dynamic range

Concise representation with the

the coefficients

Cannot handle variables that affect the outcome in a

discontinuous way.

Step functions

Easy to score data Doesn’t work well with discrete drivers that have a lot

of distinct values

For example, ZIP code

Returns good probability estimates of an event

Preserves the summary statistics of the training data

“The probabilities equal the counts”

Logistic Regression – Reasons to Choose (+) and Cautions (-)

Module 4: Analytics Theory/Methods 14

Logistic regressions have the explanatory values and we can easily determine how the variables affect the outcome. The explanatory values are a little more complicated than linear regression. It works well with (robust) redundant variables and correlated variables. In this case the prediction is not impacted but we lose some explanatory value with the fitted model. Logistic regression provides the concise representation of the outcome with the coefficients and it is easy to score the data with this model. Logistic regression returns probability estimates of an event. It also returns calibrated model it preserves the summary statistics of the training data.

Cautions (-) are that the Logistic regression does not handle missing values well. It assumes that each variable affects the log odds of the outcome linearly and additively. So if we have some variables that affect the outcome non-linearly and the relationships are not actually additive the model does not fit well.

Variable transformations and modeling variable interactions can address this to some extent. It is recommended to take the log of monetary amounts or any variable with a wide dynamic range. It cannot handle variables that affect the outcome in a discontinuous way. We discussed the issue of infinite magnitude coefficients earlier where the prediction is inconsistent in ranges. Also when you have discrete drivers with a large number of distinct values the model becomes complex and computationally inefficient.

Module 4: Analytics Theory/Methods 14

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Check Your Knowledge

1. What is a logit and how do we compute class probabilities from the logit?

2. How is ROC curve used to diagnose the effectiveness of the logistic regression model?

3. What is Pseudo R2 and what does it measure in a logistic regression model?

4. How do you describe a binary class problem? 5. Compare and contrast linear and logistic regression methods.

Your Thoughts?

15Module 4: Analytics Theory/Methods

Record your answers here.

15Module 4: Analytics Theory/Methods

Copyright © 2014 EMC Corporation. All rights reserved.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Advanced Analytics – Theory and Methods

During this lesson the following topics were covered:

• Technical description of a logistic regression model

• Common use cases for the logistic regression model

• Interpretation and scoring with the logistic regression model

• Diagnostics for validating the logistic regression model

• Reasons to Choose (+) and Cautions (-) of the logistic regression model

Lesson 4b: Logistic Regression – Summary

Module 4: Analytics Theory/Methods 16

This lesson covered these topics. Please take a moment to review them.

Module 4: Analytics Theory/Methods 16