Customer segmentation is the practice of dividing a company's customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business.
Objective Statement:
- Get business insight about how many product sold every month.
- Get business insight about how much customer spend their money every month.
- Get business insight about how many customers make transactions each month.
- Get business insight about how much is the frequency of transactions in months, days, and hours.
- Get business insight about the most popular products.
- Get business insight about the most consumers by country.
- To reduce risk in deciding where, when, how, and to whom a product, service, or brand will be marketed.
- To increase marketing efficiency by directing effort specifically toward the designated segment in a manner consistent with that segment’s characteristics.
Challenges:
- Large size of data, can not maintain by excel spreadsheet.
- Need several coordination from each department.
- Demography data have a lot missing values.
Business Benefit:
- Helping Business Development Team to create product differentiation based on the characteristic for each customer.
- Know how to treat customer with specific criteria.
Expected Outcome:
- Know how many product sold every month.
- Know how much customer spend their money every month.
- Know how many customers make transactions each month.
- Know how much is the frequency of transactions in months, days, and hours.
- Know the most popular products.
- Know the most customer by the country.
- Customer segmentation analysis.
- Recommendation based on customer segmentation.
-
The data is a real online retail transaction data set of two years.
-
The data consists of 2 datasets where:
- Dataset 1:
- Online Retail Dataset between 01/12/2009 until 09/12/2010.
- Dataset 1 consists of 525461 rows and 8 columns.
- Dataset 2:
- Online Retail Dataset between 01/12/2010 until 09/12/2011.
- Dataset 2 consists of 541910 rows and 8 columns.
- Dataset 1:
-
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 until 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
-
Data Dictionary:
- Invoice: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- Invoice Date: Invice date and time. Numeric. The day and time when a transaction was generated.
- Price: Unit price. Numeric. Product price per unit in sterling (£).
- Customer ID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.
Retail is the process of selling consumer goods or services to customers through multiple channels of distribution to earn a profit.
This case has some business question using the data:
- How many product sold every month?
- How much customer spend their money every month?
- How many customers make transactions each month?
- How much is the frequency of transactions in months, days, and hours?
- What products are the most popular?
- Most consumers by country?
- How about Customer segmentation analysis?
- How about recommendation based on customer segmentation?
Product sold in November has the highest quantity that has around 13,97% product sold from all transaction along 1 year.
Product sold in November has the highest quantity that has around 15,42% product sold from all transaction along 1 year.
The business team can increase sales in this month such as promoting new products to customers in this month.
Revenue in November has the highest amount that has around 14,11% revenue from total revenue along 1 year.
Revenue in November has the highest amount that has around 15,6% revenue from total revenue along 1 year.
The business team can replicate the success of sales strategies in November to be implemented in other months.
The number of customers from December 2009 to November 2010 was fluctuating. However, in general, the number of customers almost every month tends to show an increase, only in January, April, July, and August do the number of customers show a decrease.The business team can provide special discounts in January, April, July, and August to increase the number of customers and sales in this month.
The number of customers from December 2010 to November 2011 was fluctuating. However, in general, the number of customers almost every month tends to show an increase, only in January, February,and April do the number of customers show a decrease.The business team can provide special discounts in January, February,and April to increase the number of customers and sales in this month.
- The number of customers in November is the highest number of customers that has around 15,3% of the total customers along 1 year. The business team can increase sales by promoting new products to customers in November.
- Most consumers make transactions on Thursday, which is around 19,8% of the total daily transactions. Business teams can increase sales by promoting new products to customers on Thursday
- Most consumers order the products at 12 AM with a transaction amount of 17.8% of the total daily transactions. Business teams can increase sales by promoting new products to customers at 12 AM.
- The number of customers in November is the highest number of customers that has around 17,3% of the total customers along 1 year. The business team can increase sales by promoting new products to customers in November.
- Most consumers make transactions on Thursday, which is around 19,5% of the total daily transactions. Business teams can increase sales by promoting new products to customers on Thursday.
- Most consumers order the products at 12 AM with a transaction amount of 18,2% of the total daily transactions. Business teams can increase sales by promoting new products to customers at 12 AM.
White Hanging Heart T-Light Holder became the product that was most in-demand by consumers in 2010. The number of purchases of White Hanging Heart T-Light Holder reached 2369 units in 2010.The business team can provide special discounts from this product to attract more users.
White Hanging Heart T-Light Holder became the product that was most in-demand by consumers in 2011. The number of purchases of White Hanging Heart T-Light reached 1625 units in 2011.The business team can provide special discounts from this product to attract more users.
The United Kingdom became the city with the highest number of customers in 2010. The total number of customers in United Kingdom reached 302776 (91.71%) customers in 2010. The business team can focus on promotions in the United Kingdom to increase sales.
The United Kingdom became the city with the highest number of customers in 2011. The total number of customers in United Kingdom reached 286683 (90%) customers in 2011. The business team can focus on promotions in the United Kingdom to increase sales.
Recency, Frequency, Monetary Value (RFM) analysis method is a method of customer analysis and segmentation based on customer habits. The variables used to perform RFM analysis are:
- Recency : How recently the customer made a transaction.
- Frequency : How often customers make transactions
- Monetary : How many transactions the customer has made
In this case, the dataset contains transaction data from 01/12/2009 to 01/12/2011, so the RFM Value is treated as follows:
- Recency : The difference between the last day the customer made a transaction and the day he did the analysis. In this case, the day of analysis uses the data of the last day of the transaction.
- Frequency : The number of transactions made by customers from 01/12/2009 to 01/12/2011.
- Monetary : Total order amount issued by customers from 01/12/2009 to 01/12/2011.
Here are the steps in RFM analysis:
The calculation of the individual RFM Score can be done using the Quartile statistical method. The steps is:
- Split the metrics into segments using quantiles.
- Assign a score from 1 to 4 to Recency, Frequency and Monetary.
- Four is the best/highest value, and one is the lowest/worst value.
A total RFM score is calculated simply by combining individual RFM score numbers.
K-Means clustering algorithm is an unsupervised machine learning algorithm that uses multiple iterations to segment the unlabeled data points into K different clusters in a way such that each data point belongs to only a single group that has similar properties. K-means gives the best result under the following conditions:
- Data’s distribution is not skewed.
The data is highly skewed,therefore we will perform log transformations to reduce the skewness of each variable.I add a small constant as log transformation demands all the values to be positive.
- Data is standardised (i.e. mean of 0 and standard deviation of 1).
The cluster value where this decrease in inertia value becomes constant can be chosen as the right cluster value for our data. Looking at the above elbow curve, we can choose any number of clusters between 3 to 5.
From the flattened graphs and the snake plots it is evident that having a cluster value of 4, segments our customers well. We could also go for higher number of clusters, it completely depends on how the company wants to segment their customers.
The cluster value where this decrease in inertia value becomes constant can be chosen as the right cluster value for our data. Looking at the above elbow curve, we can choose any number of clusters between 3 to 5.
From the flattened graphs and the snake plots it is evident that having a cluster value of 4, segments our customers well. We could also go for higher number of clusters, it completely depends on how the company wants to segment their customers.
1. Davies Bouldin Score
Davies Bouldin Score is a metric for evaluating clustering algorithms. The smaller Davies Bouldin Score is the more optimal the cluster.
2. Silhouetter Score
Silhoutter Score is a metric for evaluating clustering algorithms. The higher Silhouter Score is the more optimal the cluster.
1. Davies Bouldin Score
K-Means with 4 clusters has lowest davies bouldin score than other cluster. Therefore the optimum cluster is 4.
2. Silhouetter Score
K-Means with 4 clusters has higher silhouette score than other cluster. Therefore the optimum cluster is 4.
1. Davies Bouldin Score
K-Means with 4 clusters has lowest davies bouldin score than other cluster. Therefore the optimum cluster is 4.
2. Silhouetter Score
K-Means with 4 clusters has higher silhouette score than other cluster. Therefore the optimum cluster is 4.
Based on the 4 clusters, we could formulate marketing strategies relevant to each cluster: