Skip to main content

Analyzing Loan Data with Binomial and Poisson Distributions in Python

 Credit Risk and Statistical Distributions

Scenario

Imagine you’re a data scientist at a lending institution, and you’ve been asked to understand and predict certain events, like the likelihood of loan defaults or the frequency of inquiries a borrower makes in a given period.

This is where statistical distributions, like the Binomial and Poisson distributions, come into play.

Steps:

  1. Load and Explore the Loan Dataset
  2. Understand the Binomial Distribution
  3. Implementing the Binomial Distribution in Python
  4. Understand the Poisson Distribution
  5. Implementing the Poisson Distribution in Python

Step 1: Load and Explore the Loan Dataset

Start by loading the dataset and taking a quick exploratory glance.

import pandas as pd

# Load the dataset
loans_data = pd.read_csv('loansdata.csv')

# Check the first few rows of the dataset
loans_data.head()

Output:

Understand the Data

The original data used in this exercise comes from publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet.

There are 14 variables used in the data, and a brief data dictionary is provided below:

Data Dictionary

For this article, we’ll be focusing on some of these variables to explain the Binomial and Poisson distributions.

Step 2: Understanding the Binomial Distribution

The Binomial distribution represents the number of successes in a fixed number of independent trials, where each trial has only two possible outcomes (like success or failure).

Mathematical Equation

The probability mass function (PMF) for a Binomial distribution is:

In the above equation,

  • the binomial coefficient represents the number of ways to choose `k` successes out of `n` trials.
binomial coefficient
  • The second term (p^k) is the probability of having k successes.
  • The third term ((1−p)^(n−k)) is the probability of having (n−k) failures.

Step 3: Implementing the Binomial Distribution in Python

Good, the theory is behind us. Let’s take a sample case and learn how to perform binomial distribution in Python.

For example, in our loan data, we can model the probability of a borrower defaulting on a loan.

To do this, let’s first find out the probability of default. Remember, not.fully.paid represents the variable denoting default or not.

loans_data['not.fully.paid'].value_counts()

Output:

Out of 9578 records, there are 1533 default cases. So the probability of default comes out to be 16%.

Let’s say we want to model the probability of exactly 3 out of 10 borrowers defaulting on their loans.

So, n = 10, p = 0.16 (calculated above), and k =3. The Python code to implement this is below.

from scipy.stats import binom

# Define the parameters

# number of trials (borrowers)
n = 10

# probability of default
p = 0.16

# Calculate the probability of exactly 3 defaults
k = 3

binom_prob = binom.pmf(k, n, p)
print(f"Probability of exactly {k} out of {n} borrowers defaulting: {binom_prob:.4f}")

Output:

So the answer is 0.145. So what’s happening in the above code is that the first line imports the binom module from the scipy.stats library.

Then we set the parameters and finally the binom.pmf function computes the probability mass function (PMF) for the Binomial distribution.

This tells us how likely it is to see exactly 3 borrowers default out of 10.

Step 4: Understanding the Poisson Distribution

The Poisson distribution is a way to understand how often an event occurs within a specific period of time or a certain area.

It’s especially useful when you’re dealing with rare events, like the number of car accidents at a particular intersection in a month.

Unlike the Binomial distribution, which has a set number of attempts (like flipping a coin 10 times), the Poisson distribution doesn’t require a fixed number of trials.

Instead, it focuses on the rate of occurrence. This rare phenomenon also means that the probability of the event happening in any tiny interval is small, but the distribution tells us about the number of times the event could happen in the larger interval (like a day, a week, etc.).

Mathematical Equation

The probability mass function (PMF) for a Binomial distribution is:

In the above equation,

  • `λ` is the average rate of occurrence.
  • `e` is the base of the natural logarithm (approximately 2.71828).
  • `k` is the number of occurrences.

Step 5. Implementing the Poisson Distribution in Python

Suppose we want to model the number of inquiries a borrower makes in the last 6 months. If the average number of inquiries is known, we can use the Poisson distribution to predict the probability of a borrower making exactly say 4 inquiries.

The average rate (mean) of inquiries is calculated for the data with the following code:

loans_data['inq.last.6mths'].mean()

With the value of `λ` calculated, let’s use Poisson distribution to calculate the probability of exactly 4 inquiries.

from scipy.stats import poisson

# Define the parameter
lambda_ = loans_data['inq.last.6mths'].mean()

# Calculate the probability of exactly 4 inquiries
k = 4
poisson_prob = poisson.pmf(k, lambda_)
print(f"Probability of exactly {k} inquiries in the last 6 months: {poisson_prob:.4f}")

Output:

In the code above, `poisson.pmf(k, lambda_)` calculates the probability mass function (PMF) for exactly `k` inquiries when the average rate of inquiries is `lambda_`.

Conclusion

By using real-world loan data, we’ve explored how the Binomial and Poisson distributions can be applied in the financial sector.

These distributions are not just great theoretical concepts to boost about, they’re powerful tools that help several organizations manage risk and make informed decisions.

    If you’re as passionate about AI, ML, DS, Strategy and Business Planning as I am, I invite you to:

    Connect with me:

    #DataScience, #PythonProgramming, #ProbabilityDistributions, #LoanAnalytics, #StatisticalAnalysis

    Comments

    Popular posts from this blog

    10 Projects You Can Discuss in Interviews Even If You Don't Have Work Experience

     If you are an aspiring data scientist, you might wonder what kind of projects you can talk about to stand out. The good news is that you don’t need a formal job history to have meaningful projects to discuss. Building and sharing your own projects can demonstrate your understanding of machine learning, AI, analytics, and data handling. This post lists 10 project ideas that you can create and confidently discuss in interviews. These projects cover a range of skills and tools relevant to data science and generative AI. Each project example includes practical tips on how to approach it and what you can highlight during your interview.                Data visualization dashboard created for a personal analytics project 1. Data Cleaning and Exploration Project Start with a raw dataset from sources like Kaggle or UCI Machine Learning Repository. Focus on cleaning the data by handling missing values, removing duplicates, and correcting errors....

    How to Create Stunning Data Visualizations in Python: Top 10 Techniques to Learn

      A Visual Analytics Journey In this guide, you’re going to learn some of the coolest and most popular visualization techniques, one plot at a time, using the mpg dataset in Python. Whether you’re interested in visualizing univariate (histograms), bivariate (scatter plot) or multivariate (heatmaps) variables, we’ve got it all covered here in this guide. We’ll start by loading the `mpg` dataset from Seaborn, and before you know it, you’ll be the Picasso of Python plots. So lets get going! Dataset First things first, we need to grab the `mpg` dataset. Think of this dataset as a collection of cool cars from the 1970s and 80s. It’s a nostalgic look at how much fuel (miles per gallon) these cars guzzled. import seaborn as sns import pandas as pd # Load the mpg dataset from seaborn mpg = sns.load_dataset( 'mpg' ) # Display the first few rows to get a feel of the data mpg.head() Output: Boom! We’ve got a dataset full of horsepower, cylinders, and other engine-sort-of-things! L...

    Phases of data science and analytics

    Data Science and analytics isn’t a destination — it’s a journey of continuous learning and application. In my experience, this journey can be divided into five distinct phases:                                         5 Phases of Analytics: Image by Author 1. Descriptive Analytics: Focused on understanding what happened in the past. 2. Diagnostic Analytics: Answers the critical question: why did it happen? 3. Predictive Analytics: Often seen as the most glamorous phase, it predicts what will happen next. 4. Prescriptive Analytics: Goes a step further to recommend what should be done based on predictions; or how can you optimize business processes or decisions. 5. Automated Analytics: Finally, the ‘product/software’ development stage of analytics. It automates the process — from descriptive to predictive — making analytics accessible and actionable for business stak...