Using Logistic Regression to predict Loan Defaults

- Pentaho

I. Overview

Logistic Regression can be used to predict the likelihood of an outcome based on the input variables. Logistic regression is generally used where the dependent variable is Binary or Dichotomous. That means the dependent variable can take only two possible values such as “Yes or No”, “Default or No Default”, “Living or Dead”, “Responder or Non Responder”, “Yes or No” etc. Independent factors or variables can be categorical or numerical variables.

Do note that even though logistic regression is frequently used for binary variables (2 classes), it can be used for categorical dependent variables with more than 2 classes. In this case it’s called Multinomial Logistic Regression. Logistic Regression relationship can be expressed as shown in the below provided equation;

f(y) = ey / 1 + ey , for –<y<

Note, that as y → ∞, f(y) → 1, and as y → -∞, f(y) → 0. The value of the logistic function f(y) varies from 0 to 1 as y increases. This is demonstrated in the image represented below;

Moreover, in logistic regression, y is expressed as a linear function of the input variables. In other words, the formula given above can be represented as the equation mentioned below;

y = β0 + β1X1 + β2X2 + …. + βp-1Xp-1

The above equation is comparable to linear regression model, however, one difference is that the values of y are not directly observed. Only the value of f(y) in terms of success or failure is observed. Using p to denote f(y), we can further modify the equation as provided below;

ln(p/1-p) = y = β0 + β1X1 + β2X2 + …. + βp-1Xp-1

The quantity ln(p/1-p), is known as the log odds ratio, or the logit of p. Techniques such as Maximum Likelihood Estimation (MLE) are used to estimate the model parameters.

II. Example Use Cases

The logistic regression model is applied to a variety of situations in both the public and the private sector. Some common ways that the logistic regression model is used include the following:

  • Medical : Develop a model to determine the likelihood of a patient’s successful response to a specific medical treatment or procedure. Input variables could include age, weight, blood pressure, and cholesterol levels.
  • Finance : Risk Assessment – Using a loan applicant ‘s credit history and the details on the loan, determine the probability that an applicant will default on the loan. Based on the prediction, the loan can be approved or denied, or the terms can be modified.
  • Marketing : Churn Prediction – Determine a wireless customer’s probability of switching carriers (known as churning) based on age, number of family members on the plan, months remaining on the existing contract, and social network contacts. With such insight, target the high-probability customers with appropriate offers to prevent churn.
  • Engineering : Based on operating conditions and various diagnostic measurements, determine the probability of a mechanical part experiencing a malfunction or failure. With this probability estimate, schedule the appropriate preventive maintenance activity.

III. Conclusion

To summarize logistic regression:

  • The aim of a Logistic Regression is to model the probability of an event that occurs depending on the values of the independent variables.
  • A Logistic Regression classifies observations by estimating the probability that an observation is in a particular category.
  • A Logistic Regression estimates the probability that an event occurs for a randomly selected observation versus the probability of this event not occurring at all.