Skip to main content

Complete Guide to Sampling Methods: Random, Stratified, Systematic, and Cluster Sampling with Python Examples

 A step-by-step guide to sampling methods: random, stratified, systematic, and cluster sampling explained with Python implementation. Perfect for data science learning.

In the world of data science, statistics, and analytics, it’s often impossible to collect data from the entire population.

That’s where sampling methods come to the rescue — helping us pick a smaller group that still represents the whole.

In this blog, we’ll break down four major sampling techniques:

  • Random Sampling
  • Stratified Sampling
  • Systematic Sampling
  • Cluster Sampling

We’ll also show you how to implement them step-by-step in Python using the famous Titanic dataset!

What is Sampling?

Sampling is simply selecting a subset of individuals from a larger population, so we can study and make conclusions about the entire group without examining every individual.

A good sampling method ensures that your sample is representative, unbiased, and accurate.

1. Simple Random Sampling

Every individual has an equal chance of being selected. It’s like drawing names out of a hat — pure luck, no bias.

Simple Random Sampling is used when you want a straightforward and fair representation of the entire population.

Formula

There is no complex formula — every item has equal probability:

Where:

  • n = sample size
  • N = total population size

Simple Example

Suppose there are 100 passengers on Titanic, and you randomly pick 10 passengers. Each passenger has a 10/ 100=10% chance of being selected.

Python Implementation (Titanic Dataset)

import seaborn as sns
import pandas as pd

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Drop missing age values for simplicity
titanic = titanic.dropna(subset=['age'])

# Random sampling: Select 10% of the dataset randomly
random_sample = titanic.sample(frac=0.10, random_state=30)
print(random_sample.head())

Output

A random selection of about 10% of the passengers.

2. Stratified Sampling

Concept

In stratified sampling, the population is divided into groups (strata) based on an important characteristic (like gender), and then samples are drawn proportionally from each group.

This ensures your sample respects the structure of the population.

Formula

Sampling from each stratum:

Where:

  • Ni = size of stratum ii
  • N = total population size
  • n = total sample size

Simple Example

Suppose on Titanic, 70% passengers are male and 30% are female.
If you want a sample of 100 passengers, you should pick 70 males and 30 females.

Python Implementation (Stratified by Gender)

from sklearn.model_selection import train_test_split

# Drop missing 'sex' values
titanic = titanic.dropna(subset=['sex'])

# Stratified sampling: maintain the same male/female ratio
stratified_sample, _ = train_test_split(titanic,
stratify=titanic['sex'],
test_size=0.8,
random_state=42)

print(stratified_sample['sex'].value_counts(normalize=True))

print(titanic['sex'].value_counts(normalize=True))

Output

The gender proportion (male vs female) in the sample will match the original dataset.

3. Systematic Sampling

Concept

In systematic sampling, you select every k-th item from an ordered list.
 It’s simple, fast, and effective when your data is organized.

Formula

First, calculate the sampling interval k:

Where:

  • N = total population size
  • n = desired sample size

Simple Example

If there are 500 Titanic passengers and you want 50 samples, you pick every 10th passenger after a random starting point.

Python Implementation (Selecting Every k-th Passenger)

import numpy as np

# Set sample size
n = 50
N = titanic.shape[0]
k = N // n

# Random starting point
start = np.random.randint(0, k)

# Systematic sampling
systematic_sample = titanic.iloc[start::k]
print(systematic_sample.head())

Output

A sample of passengers selected at regular intervals.

4. Cluster Sampling

Concept

In cluster sampling, instead of picking individuals, you randomly select entire groups (clusters) and survey everyone within the selected groups.

Formula

No complex formula needed:

  • Divide the population into clusters.
  • Randomly pick clusters.
  • Survey everyone from selected clusters.

Simple Example

Suppose the Titanic passengers are divided into cabins (A, B, C, D…), and you randomly pick cabins B and C and survey everyone from those cabins.

Python Implementation (Clusters by Passenger Class)

Let’s treat ‘class’ (First, Second, Third) as clusters.

# Check unique clusters
print(titanic['class'].unique())

# Randomly select 1 passenger class
selected_clusters = np.random.choice(titanic['class'].unique(), size=1, replace=False)

# Select all passengers from selected cluster
cluster_sample = titanic[titanic['class'].isin(selected_clusters)]

print(f"Selected Cluster: {selected_clusters[0]}")
print(cluster_sample.head())

Output

All passengers from, say, Second class, selected as your sample.

Quick Summary Table

Final Words

Choosing the right sampling method is crucial for reliable results:

  • Need fairness? Choose Random Sampling.
  • Need subgroup representation? Pick Stratified Sampling.
  • Want a simple structured approach? Go with Systematic Sampling.
  • Need quick access to large groups? Cluster Sampling is perfect.

By mastering these techniques — and knowing how to code them in Python — you’ll be better prepared for real-world data science and analytics projects.

Enjoyed this journey? Request you to please hit that clap button and share it with someone who’s interested in Sampling, Statistics, Market Research and Data science!

Connect with me:

Collection of my other blogs can be found here.

Happy Learning!

Comments

Popular posts from this blog

10 Projects You Can Discuss in Interviews Even If You Don't Have Work Experience

 If you are an aspiring data scientist, you might wonder what kind of projects you can talk about to stand out. The good news is that you don’t need a formal job history to have meaningful projects to discuss. Building and sharing your own projects can demonstrate your understanding of machine learning, AI, analytics, and data handling. This post lists 10 project ideas that you can create and confidently discuss in interviews. These projects cover a range of skills and tools relevant to data science and generative AI. Each project example includes practical tips on how to approach it and what you can highlight during your interview.                Data visualization dashboard created for a personal analytics project 1. Data Cleaning and Exploration Project Start with a raw dataset from sources like Kaggle or UCI Machine Learning Repository. Focus on cleaning the data by handling missing values, removing duplicates, and correcting errors....

How to Create Stunning Data Visualizations in Python: Top 10 Techniques to Learn

  A Visual Analytics Journey In this guide, you’re going to learn some of the coolest and most popular visualization techniques, one plot at a time, using the mpg dataset in Python. Whether you’re interested in visualizing univariate (histograms), bivariate (scatter plot) or multivariate (heatmaps) variables, we’ve got it all covered here in this guide. We’ll start by loading the `mpg` dataset from Seaborn, and before you know it, you’ll be the Picasso of Python plots. So lets get going! Dataset First things first, we need to grab the `mpg` dataset. Think of this dataset as a collection of cool cars from the 1970s and 80s. It’s a nostalgic look at how much fuel (miles per gallon) these cars guzzled. import seaborn as sns import pandas as pd # Load the mpg dataset from seaborn mpg = sns.load_dataset( 'mpg' ) # Display the first few rows to get a feel of the data mpg.head() Output: Boom! We’ve got a dataset full of horsepower, cylinders, and other engine-sort-of-things! L...

Phases of data science and analytics

Data Science and analytics isn’t a destination — it’s a journey of continuous learning and application. In my experience, this journey can be divided into five distinct phases:                                         5 Phases of Analytics: Image by Author 1. Descriptive Analytics: Focused on understanding what happened in the past. 2. Diagnostic Analytics: Answers the critical question: why did it happen? 3. Predictive Analytics: Often seen as the most glamorous phase, it predicts what will happen next. 4. Prescriptive Analytics: Goes a step further to recommend what should be done based on predictions; or how can you optimize business processes or decisions. 5. Automated Analytics: Finally, the ‘product/software’ development stage of analytics. It automates the process — from descriptive to predictive — making analytics accessible and actionable for business stak...