Skip to main content

Beginners Guide to Linear Regression in Python

 


Hello everyone! Welcome to this step-by-step tutorial where we’ll dive into the world of linear regression using the mpg dataset in the seaborn library. Note that this is a beginner level tutorial suitable for students and aspiring data science professionals. 

I also started my data science journey with multiple linear regression back in 2008 - not in Python, but in Excel. Its a different thing that we were not aware of the term 'data science' back then. Anyways, lets get into the task at hand. 

Understanding the Dataset

The mpg dataset contains information about various car models and their fuel efficiency. It’s often used for tasks such as regression analysis to predict fuel efficiency based on the other attributes. The dataset contains the following attributes or columns:

1. mpg: Miles per gallon. This is the target variable that indicates the fuel efficiency of the car.

2. cylinders: The number of cylinders in the car’s engine. It can be thought of as an indicator of the car’s engine size and power.

3. displacement: The volume swept by all the pistons inside the cylinders of an internal combustion engine. It’s a measure of the engine’s capacity.

4. horsepower: The power generated by the engine. It’s a measure of the car’s performance.

5. weight: The weight of the car.

6. acceleration: The time it takes for the car to accelerate from 0 to 60 miles per hour.

7. model_year: The year when the car model was manufactured.

8. origin: The country of origin or manufacture.

9. name: The name of the car model.

Let’s get into the coding steps!

Step 1: Loading and Exploring the Dataset

In this step, we load the dataset and display the first five rows. This gives us an overview of the data we’re working with.

import seaborn as sns
import pandas as pd

# Load the "mpg" dataset
df = sns.load_dataset("mpg")

# Display the first few rows of the loaded dataset
df.head()

Step 2: Preliminary data exploration

It’s always good to look at the summary statistics of the numerical variables. The describe() function provides us with these details.

df.describe()

We can see from the output above that the total number of records in the dataset is 398, with the exception of the horsepower variable.

In addition, the scale varies for the variables, which means normalising the independent variables would be a good idea.

Let’s look at the presence or absence of the missing values:

df.isnull().sum()

Additionally, there are few categorical variables, and we’ll look at their unique labels:

print(df['name'].nunique()); print(df['model_year'].nunique()); print(df['origin'].nunique())

You can see that the variables name and model_year could be removed from the data, as we have many labels in these. Well, one can argue in favour of model_year but we’ll keep it out of our final data here.

Step 3: Preparing the data for modelling

First, we replace the missing values in the horsepower variable, followed by dropping the model_year and name variables.

df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)
df.drop('model_year', axis=1, inplace=True)
df.drop('name', axis=1, inplace=True)
df.head()

Next, we convert the origin variable into a numeric format suitable for our model.

df = pd.get_dummies(df, columns=['origin'], drop_first=True, prefix='origin')
df.head()

Step 4 — Creating Arrays for the Features and the Response Variable

The code below creates arrays for the dependent and independent variables, respectively. It aslo performs feature scaling which ensures that all numerical features are on the same scale. This helps our model avoid any bias due to differing feature magnitudes.

target_column = ['mpg']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe()

Simple Linear Regression

In Simple Linear Regression, we’re working with a single predictor variable (also known as the independent variable) to predict a target variable (dependent variable).

Step 5: Create the train and test datasets

The code below :

  1. imports the necessary libraries and modules
  2. defines the independent variable (X) and dependent variable (y). In this case, we use displacement as the independent variable. No specific reason, you can choose any other variable as well.
  3. creates the train-test split: 70% of the data will be used to train and the model with the remaining to be used for testing the model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X = df[['displacement']]
y = df['mpg']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

Step 6: Build the simple linear regression model

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the target variable on train and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

Step 7: Simple Regression Model Evaluation

We will evaluate the performance of the model using two metrics — R-squared value and Root Mean Squared Error (RMSE).

R-squared values range from 0 to 1 and are commonly stated as percentages. It is a statistical measure that represents the proportion of the variance for a target variable that is explained by the independent variables.

The other commonly used metric for regression problems is RMSE, that measures the average magnitude of the residuals or error.

Ideally, lower RMSE and higher R-squared values are indicative of a good model.

We will be using both these metrics to evaluate the model performance.

The code below evaluates and prints the evaluation metrics for the simple linear regression model.

# Calculate RMSE on train and test sets
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Calculate R-squared on train and test sets
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Print the coefficients (slope and intercept)
print("Coefficient (Slope):", model.coef_[0])
print("Intercept:", model.intercept_)

# Print RMSE and R-squared scores on train and test sets
print("RMSE on Train Set:", rmse_train)
print("RMSE on Test Set:", rmse_test)
print("R-squared on Train Set:", r2_train)
print("R-squared on Test Set:", r2_test)

The train and test set R-squared comes out to be 65% and 64%, respectively. This is not bad given that we only have considered one variable. Let’s see if we can improve the results with multiple linear regression.

Multiple Linear Regression

Multiple Linear Regression is the machine learning algorithm that involves multiple predictor variables to predict the target variable. This reflects a more realistic scenario where multiple factors can influence the outcome.

The code below is the repetition of the step we did above, but the only difference is that this time we are using all the predictor variables, instead of only displacement.

X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=10)
print(X_train.shape); print(X_test.shape)

The code below instantiates, trains, and evaluates the multiple linear regression model to predict mpg

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict the target variable on train and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate RMSE on train and test sets
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Calculate R-squared on train and test sets
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

# Print the coefficients (slope and intercept)
print("Coefficient (Slope):", model.coef_[0])
print("Intercept:", model.intercept_)

# Print RMSE and R-squared scores on train and test sets
print("RMSE on Train Set:", rmse_train)
print("RMSE on Test Set:", rmse_test)
print("R-squared on Train Set:", r2_train)
print("R-squared on Test Set:", r2_test)

The train and test set R-squared comes out to be 72% and 70%, respectively. This is a clear improvement over the single-variable linear regression model, which is also expected, because this shows that more features are able to model the relationship better.

Conclusion

In this guide, you have learned about building Linear Regression models using the powerful Python library, scikit-learn.

The next steps would be to try out regularization techniques and see how that impacts the model performance.

Remember, your journey in data science and machine learning domain is unique, and it’s essential to find what works best for you. Don’t hesitate to share your experiences, questions, and insights in the comments section below.

Let’s connect and learn from each other as we dive deeper into the fascinating world of data science and machine learning!

If you’re interested in statistics, data science and machine learning, please follow this blog as will be posting many more topics based on my experience and learning. 

    You can also connect with me on LinkedIn.

    Happy Learning!

    Comments