In this project, I analyzed various customer segments in an Online Retail dataset using Python. For this task, I employed cohort analysis, RFM Analysis, and k-means clustering.
Identify the customer segments in the dataset and prescribe a course of business action for each segment.
Example of a segment might be the customers who bring the max profit and visit frequently.
Source: The UCI Machine Learning Repository
This dataset contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a non-store online retail.
- 🧹 Removed Null Values
- ✂️ Removed Duplicate Values
- 📍 Maximum transactions are from the UK
A cohort is a set of users who share similar characteristics over time. Cohort analysis groups users into mutually exclusive groups, and their behavior is measured over time.
There are three types of cohort analysis:
- 📅 Time Cohorts: Groups customers by their purchase behavior over time.
- 📦 Behavior Cohorts: Groups customers by the product or service they signed up for.
- 📏 Size Cohorts: Groups customers by their spending within a period.
For this project, I chose time cohorts. The steps are as follows:
-
🗓️ Identified cohort month for each customer (the month when the customer first transacted).
# First Transaction month (Cohort Month) for each customer df3['Cohort Month'] = df3.groupby('CustomerID')['InvoiceFormat'].transform(min)
-
🔢 Identified cohort index (difference between transaction month and cohort month) for each transaction.
# This function calculates difference between invoice format and cohort month def diff(d, x1, y1): l = [] for i in range(len(d)): xyear = d[x1][i].year xmonth = d[x1][i].month yyear = d[y1][i].year ymonth = d[y1][i].month diff = ((xyear - yyear) * 12) + (xmonth - ymonth) + 1 l.append(diff) return l
-
📊 Grouped data by cohort month and cohort index.
-
📋 Developed a pivot table.
- 🔥 Developed a time cohort heatmap.
- 💔 Roughly 10% of new joiners remain after a year. Retention is quite poor.
- 🎯 About 250 new people join each month, which indicates marketing efforts are satisfactory.
RFM stands for Recency, Frequency, Monetary. It evaluates:
- 📅 Recency: How recently a customer transacted.
- 🔄 Frequency: How often they transacted.
- 💰 Monetary: How much they spent.
These scores help group customers for further analysis.
- The last transaction in the dataset was on 2011-12-09. Thus, recency was calculated using 2012-01-01 as the snapshot date.
freq = df6.groupby(["CustomerID"])[["InvoiceNo"]].count()
df6["total"] = df6["Quantity"] * df6["UnitPrice"]
money = df6.groupby(["CustomerID"])[["total"]].sum()
Before applying K-means clustering, I addressed data skewness.
-
The data was left-skewed, so I used log transformation:
inertia = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(scaled)
inertia.append(kmeans.inertia_)
From the graph, I chose 3 clusters.
- 🌟 Best Customers: High-frequency, high-monetary value, recent transactions.
⚠️ At-Risk Customers: Long time since the last transaction, low spending.- 💼 Average Customers: Regular transactions, moderate spending.
-
⚠️ At Risk Customers:- Suggestion: Analyze why they left; offer sales or discounts to win them back.
-
💼 Average Customers:
- Suggestion: Convert to best customers through discounts, excellent support, and targeted promotions.
-
🌟 Best Customers:
- Suggestion: Focus advertising and product launches on this group. Heavy discounts aren’t needed.
- Expand Analysis: Include new customer features, like demographics and lifetime value.
- Dynamic Segmentation: Automate real-time segmentation for evolving behaviors.
- Advanced Models: Explore hierarchical clustering or DBSCAN for complex relationships.
- Predictive Insights: Use predictive analytics to forecast customer behavior and recommend proactive strategies.
This project sets a strong foundation for tailored customer engagement, paving the way for smarter, data-driven business decisions! 😊