Exploring Sales data with RFM Analysis

Scrollable Code BlockFlourish Map Embed

RFM Summary

RFM Analysis is a measurement technique that uses existing customer behavior data, based on Recency, Frequency and Monetary value of purchases, to predict how customers will act in the future as well as to group them into behavioral segments.

RFM can help you answer these questions and more:

Who can you upsell higher-value products to?
Who can you send personalized messages to keep them from churning?
Who is most likely to engage with your brand?
Who can you offer limited-time offers to and try to reactivate?
...

For our visualization, three dimensions of Recency, Frequency, and Monetary coincide with 3D graphics. However, it doesn't work effectively and difficult to understand the result. By combining Frequency and Monetary and turning it to 2D, it is much easier to understand because when a customer subscribes to a month for 100 dollar 12 times in 12 months, it is equivalent to when customers subscribe for $1200 per year all at once. At that time, Recency represented the customer's engagement with our brand.

Having 11 customer segments have the same behavior as the table that we will group them into:

Customer segment	RFM scores	Activity	Action Recommend
Champions	555, 554, 544, 545, 454, 455, 445	Bought recently, buy often and spend the most!	Reward them. Can be early adopters for new products. Will promote your brand.
Loyal customers	543, 444, 435, 355, 354, 345, 344, 335	Spend good money with us often. Responsive to promotions.	Upsell higher value products. Ask for reviews. Engage them.
Potential loyalist	553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323	Recent customers, but spent a good amount and bought more than once.	Offer membership / loyalty program, recommend other products.
Recent customers	512, 511, 422, 421, 412, 411, 311	Bought most recently, but not often.	Provide on-boarding support, give them early success, start building relationship.
Promising	525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313	Recent shoppers, but haven’t spent much.	Create brand awareness, offer free trials
Customers Needing Attention	535, 534, 443, 434, 343, 334, 325, 324	Above average recency, frequency and monetary values. May not have bought very recently though.	Make limited time offers, Recommend based on past purchases. Reactivate them.
About to sleep	331, 321, 312, 221, 213	Below average recency, frequency and monetary values. Will lose them if not reactivated.	Share valuable resources, recommend popular products / renewals at discount, reconnect with them
At risk	255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124	Spent big money and purchased often. But long time ago. Need to bring them back!	Send personalized emails to reconnect, offer renewals, provide helpful resources.
Can’t lose them	155, 154, 144, 214, 215, 115, 114, 113	Made biggest purchases, and often. But haven’t returned for a long time.	Win them back via renewals or newer products, don’t lose them to competition, talk to them.
Hibernating	332, 322, 231, 241, 251, 233, 232, 223, 222, 132, 123, 122, 212, 211	Last purchase was long back, low spenders and low number of orders.	Offer other relevant products and special discounts. Recreate brand value.
Lost	111, 112, 121, 131, 141, 151	Lowest recency, frequency and monetary scores.	Revive interest with reach out campaign, ignore otherwise.

Checking the data

Here our data set for the project. And let's jump right in by checking the data

Input [1]:

SELECT * FROM sales;

Output [1]:

Row_ID	Order_ID	Order_Date	Ship_Date	Ship_Mode	Customer_ID	Customer_Name	Segment	Country	City	State	Postal_Code	Region	Product_ID	Category	Sub_Category	Product_Name	Sales
1	CA-2017-152156	08-11-17	11-11-17	Second Class	CG-12520	Claire Gute	Consumer	United States	Henderson	Kentucky	42420	South	FUR-BO-10001798	Furniture	Bookcases	Bush Somerset Collection Bookcase	261.9599915
2	CA-2017-152156	08-11-17	11-11-17	Second Class	CG-12520	Claire Gute	Consumer	United States	Henderson	Kentucky	42420	South	FUR-CH-10000454	Furniture	Chairs	Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back	731.9400024
3	CA-2017-138688	12-06-17	16-06-17	Second Class	DV-13045	Darrin Van Huff	Corporate	United States	Los Angeles	California	90036	West	OFF-LA-10000240	Office Supplies	Labels	Self-Adhesive Address Labels for Typewriters by Universal	14.61999989
4	US-2016-108966	11-10-16	18-10-16	Standard Class	SO-20335	Sean O'Donnell	Consumer	United States	Fort Lauderdale	Florida	33311	South	FUR-TA-10000577	Furniture	Tables	Bretford CR4500 Series Slim Rectangular Table	957.5775146
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9797	CA-2016-128608	12-01-16	17-01-16	Standard Class	CS-12490	Cindy Schnelling	Corporate	United States	Toledo	Ohio	43615	East	OFF-AR-10001374	Office Supplies	Art	BIC Brite Liner Highlighters, Chisel Tip	10.36800003
9798	CA-2016-128608	12-01-16	17-01-16	Standard Class	CS-12490	Cindy Schnelling	Corporate	United States	Toledo	Ohio	43615	East	TEC-PH-10004977	Technology	Phones	GE 30524EE4	235.1880035
9799	CA-2016-128608	12-01-16	17-01-16	Standard Class	CS-12490	Cindy Schnelling	Corporate	United States	Toledo	Ohio	43615	East	TEC-PH-10000912	Technology	Phones	Anker 24W Portable Micro USB Car Charger	26.37599945
9800	CA-2016-128608	12-01-16	17-01-16	Standard Class	CS-12490	Cindy Schnelling	Corporate	United States	Toledo	Ohio	43615	East	TEC-AC-10000487	Technology	Accessories	SanDisk Cruzer 4 GB USB Flash Drive	10.38399982

9800 rows x 18 columns

Before taking further steps, we need to ensure our data’s integrity by counting the null values we have missed, so as to avoid coming to the wrong conclusion about the data.

Input [2]:

SELECT SUM(
    CASE WHEN CONCAT_WS('', 
    Row_ID, 
    Order_ID, 
    Order_Date, 
    Ship_Date, 
    Ship_Mode, 
    Customer_ID, 
    Customer_Name, 
    Segment, 
    Country, 
    City, 
    State, 
    Postal_Code, 
    Region, 
    Product_ID, 
    Category, 
    Sub_Category, 
    Product_Name, 
    Sales) IS NULL THEN 1 ELSE 0 END 
) AS count_null
FROM sales;

Output [2]:

count_null
0

Calculating metrics

After reviewing the data, we may realize some crucial attributes in the table for analyzing and comparing each type with each other, such as:

Ship mode includes Standard Class, Second Class, First Class, and Same Day, representing the delivery methods that we have served.
Segment includes Consumer, Corporate, and Home Office, which represent our customer segments.
Category includes Furniture, Office Supplies, and Technology.
Each Sub-category includes Chairs, Bookcases, Tables, Art, Paper, Copiers, Accessories, and Phones,... every single sub-category has numerous items from different brands.
State represents 49 states in the US excluding Hawaii, which our store isn't located.
Year: 2015 - 2018

Let's calculate crucial metrics for the data that we have explored based on its characteristics.

Code and Table Layout

Sales by
Ship Mode

Input [3]:

SELECT Ship_Mode, 
      ROUND(SUM(sales),2) AS shipmode_sales 
FROM sales 
GROUP BY Ship_Mode
ORDER BY 2 DESC;

Outnput [3]:

Ship_Mode	shipmode_sales
Standard Class	1340831.31
Second Class	449914.18
First Class	345572.26
Same Day	125219.04

Sales by
Segment

Input [4]:

SELECT Segment, 
      ROUND(SUM(sales),2) AS segment_sales
FROM sales 
GROUP BY Segment
ORDER BY 2 DESC;

Output [4]:

Segment	segment_sales
Consumer	1148060.53
Corporate	688494.07
Home Office	424982.18

Sales by
Category

Input [5]:

SELECT Category, 
      ROUND(SUM(sales),2) AS category_sales 
FROM sales
GROUP BY Category
ORDER BY 2 DESC;

Output [5]:

Category	category_sales
Technology	827455.87
Furniture	728658.57
Office Supplies	705422.33

Sales by
Sub-
Category

Input [6]:

SELECT Sub_Category, 
      ROUND(SUM(sales),2) AS sub_sales 
FROM sales
GROUP BY Sub_Category
ORDER BY 2 DESC;

Output [6]:

#	Sub Category	sub_category _sales
1	Phones	327782.45
2	Chairs	322822.73
3	Storage	219343.39
...	...	...
16	Labels	12347.73
17	Fasteners	3001.96

Sales by
Year

Input [7]:

SELECT	YEAR(Order_Date) as Year,
        MONTH(Order_Date) as Month, 
        ROUND(SUM(Sales),2) Revenue
FROM sales
GROUP BY YEAR(Order_Date), MONTH(Order_Date)
ORDER BY 1,2;

Output [7]:

Year	Month	Revenue
2015	1	14205.71
2015	2	4519.89
...	...	...
2018	11	117938.15
2018	12	83030.39

Sales by
State

Input [8]:

SELECT  State, 
        ROUND(SUM(sales),2) Revenue
FROM sales
GROUP BY State
ORDER BY 2 DESC;

Output [8]:

#	State	Revenue
1	California	446306.46
2	New York	306361.15
3	Texas	168572.53
...	...	...
48	West Virginia	1209.82
49	North Dakota	919.91

RFM Analysis

After calculating metrics for core characteristics, the most important section of the RFM technique is grouping our customer segments and determining how much engagement our brand has with the consumer.

Input [9]:

WITH rfm AS 
  (
    SELECT 
      Customer_Name, 
      SUM(sales) MonetaryValue,
      AVG(sales) AvgMonetaryValue,
      COUNT(Order_ID) Frequency,
      MAX(Order_Date) last_order_date,
      (SELECT MAX(Order_Date) FROM sales) max_order_date,
      DATEDIFF(DD, MAX(Order_Date), (SELECT MAX(Order_Date) FROM sales)) Recency
    FROM sales
    GROUP BY Customer_Name
  ),
  rfm_calc AS
  (
      SELECT r.*,
      NTILE(5) OVER (ORDER BY Recency DESC) rfm_recency,
      NTILE(5) OVER (ORDER BY Frequency) rfm_frequency,
      NTILE(5) OVER (ORDER BY MonetaryValue) rfm_monetary
    FROM rfm r
  )
  SELECT 
    c.*, rfm_recency+ rfm_frequency+ rfm_monetary AS rfm_cell,
    CAST(rfm_recency AS VARCHAR) + CAST(rfm_frequency AS VARCHAR) + CAST(rfm_monetary  AS VARCHAR)rfm_cell_string
  INTO #rfm
  FROM rfm_calc c
  
  SELECT Customer_Name , rfm_recency, rfm_frequency, rfm_monetary,
    CASE
      WHEN rfm_cell_string IN (555, 554, 544, 545, 454, 455, 445) THEN 'Champions'
      WHEN rfm_cell_string IN (543, 444, 435, 355, 354, 345, 344, 335) THEN 'Loyal customers'
      WHEN rfm_cell_string IN (553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323) THEN 'Potential loyalist'
      WHEN rfm_cell_string IN (512, 511, 422, 421, 412, 411, 311) THEN 'Recent customers'
      WHEN rfm_cell_string IN (525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313) THEN 'Promising'
      WHEN rfm_cell_string IN (535, 534, 443, 434, 343, 334, 325, 324) THEN 'Customers needing attention' 
      WHEN rfm_cell_string IN (331, 321, 312, 221, 213) THEN 'About to sleep' 
      WHEN rfm_cell_string IN (255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124) THEN 'At risk'
      WHEN rfm_cell_string IN (155, 154, 144, 214, 215, 115, 114, 113) THEN 'Can’t lose them'
      WHEN rfm_cell_string IN (332, 322, 231, 241, 251, 233, 232, 223, 222, 132, 123, 122, 212, 211) THEN 'Hibernating' 
      WHEN rfm_cell_string IN (111, 112, 121, 131, 141, 151) THEN 'Lost'
    END rfm_segment
  FROM #rfm
  ORDER BY rfm_recency DESC, rfm_frequency DESC ,rfm_monetary DESC;

Output [9]:

#	Customer_Name	rfm_recency	rfm_frequency	rfm_monetary	rfm_segment
1	Arianne Irving	5	5	5	Champions
2	Suzanne McNair	5	5	5	Champions
3	Rick Bensley	5	5	5	Champions
4	Mike Pelletier	5	5	5	Champions
...	...	...	...	...
790	Anna Chung	1	1	1	Lost
791	Carlos Meador	1	1	1	Lost
792	Bobby Trafton	1	1	1	Lost
793	Shirley Schmidt	1	1	1	Lost

According to the characteristics of each group, let's segment them again into larger groups for easier analysis:

Loyal Group as Group 1 includes Champions, Loyal customers.
Vulnerable Group as Group 2 includes Potential loyalist, Promising, Customers needing action, At risk, Can't lose them and Recent customers.
Lost Group as Group 3 includes About to sleep, Hibernating and Lost.

As the result shows, our supermarket has the number of customers in Group 1 at 32.83%, who pay over average and also shop frequently. With the highest engagement, they can be early adopters of new products, need to reward them by asking for reviews to get discounts or coupons on the next purchase. We could lessen our focus on them because they trusted to buy our products.

Instead, Group 2 is the group that needs the most concentration because their vulnerability makes it easy to get influenced or attracted by competitor products or on-sale products from others and leave our brand as presented by our characteristics. They are 51.37% of our consumer base, that much force we must focus on and take action on, such as a mix of methods like discounts on products, email marketing and social advertising to get their attention and bring them back, depending on our financial situation, for instance that group in them we may consider focusing on as Potential loyalist and At risk has over 2000 customers, which is approximately 55% of vulnerable customers.

And at least focusing is Lost Group (Group 3),which represents around 15% of our consumer base. Characteristics of this group include variety-seeking buying behavior and comparing with other products, making it challenging to re-engage them after a long period of not interacting with or trusting alternative products. 15% seems pretty large, however the reason for least focusing is because our resources must be spent on Group 2 much more than Group 3 to achieve maximum effectiveness.

Further Steps

Great results and conclusions from RFM Analysis helps us to determine which customer group we have to focus on.

Imagining after considering all of the conditions, we need the Orders data to answer the question 'Which item comes together the most when customers choose an order that has two items in?" that represents which category and sub-category have purchased the most when customers only buy exactly 2 items per order to make decisions.

Input [10]:

SELECT DISTINCT Order_ID, 
  STUFF(
  (SELECT ',' + Category 
  FROM sales p
  WHERE Order_ID IN (
            SELECT Order_ID 
            FROM (SELECT DISTINCT Order_ID, COUNT(*) row_number
                FROM sales 
                GROUP BY Order_ID
                )m 
            WHERE row_number = 2
            AND p.Order_ID = s.Order_ID
                        ) 
  FOR XML PATH ('')),1,1,'') AS Category,
  STUFF(
  (SELECT ',' + Sub_Category
  FROM sales p
  WHERE Order_ID in (
            SELECT Order_ID 
            FROM (SELECT DISTINCT Order_ID, COUNT(*) row_number
                FROM sales 
                GROUP BY Order_ID
                )m 
            WHERE row_number = 2
            AND p.Order_ID = s.Order_ID
                        )
  FOR XML PATH ('')),1,1,'') AS Sub_Category
FROM sales s
ORDER BY 2 DESC;

Output [10]:

#	Order_ID	Category	Sub_Category
1	CA-2015-134726	Technology,Technology	Accessories,Phones
2	CA-2015-135993	Technology,Technology	Phones,Accessories
3	CA-2015-139192	Technology,Technology	Phones,Accessories
4	CA-2015-143840	Technology,Technology	Phones,Phones
...	...	...	...
4919	US-2018-168613	NULL	NULL
4920	US-2018-168690	NULL	NULL
4921	US-2018-168802	NULL	NULL
4922	US-2018-169551	NULL	NULL

The length of the table is too long and a brief presentation is required, the Output [10] cannot comprehensively represent all results. Because of that, I added an extra file that contains all of the Output [10] in my repo to make it easier for analysis.

Let's query again to check how many different categories come together and which sub-category is more frequently purchased by customers at a time.

Note: As presented in Calculating Metrics, we have 3 different categories, allowing customers to buy 3 × 3 = 9 different combinations when customers buy exactly 2 items per order. However, Sub-Category isn't that easy, we have 17 different sub-categories leading to customers having 17 × 17 = 289 different combination options, it's too much that I can present it in a table, and we also brief it as previous ones.

Category
by number
of Orders

Input [11]:

SELECT category, 
      COUNT(*) AS count_order 
FROM extra
GROUP BY category
ORDER BY 2 DESC;

Output [11]:

Category	count_order
Office Supplies, Office Supplies	846
Office Supplies, Technology	312
Furniture, Office Supplies	296
Technology, Office Supplies	266
Office Supplies, Furniture	264
Furniture,Technology	118
Furniture,Furniture	108
Technology,Furniture	108
Technology,Technology	78

Sub-
Category
by number
of Orders

Input [12]:

SELECT category, sub_category, 
      COUNT(*) AS count_order 
FROM extra
GROUP BY category, sub_category
ORDER BY 1, 3 DESC;

Output [12]:

#	Category	Sub Category	Count Order
1	Furniture, Furniture	Furnishings, Furnishings	28
2	Furniture, Furniture	Chairs, Furnishings	18
...	...	...	...
219	Technology, Technology	Machines, Accessories	2
220	Technology, Technology	Phones, Copiers	2

Rank 1 each
Sub-Category
in Category

Input [13]:

WITH rank_item AS (
  SELECT  category, sub_category, 
          COUNT(*) AS count_order,
          RANK() OVER(PARTITION BY category 
          ORDER BY COUNT(*) DESC) AS rank
  FROM extra
  GROUP BY category,sub_category)
SELECT category, sub_category, 
      count_order, rank
FROM rank_item
WHERE rank = 1
ORDER BY 1;

Output [13]:

category	sub category	count _order	rank
Furniture, Furniture	Furnishings, Furnishings	28	1
Furniture, Office Supplies	Furnishings, Binders	44	1
Furniture, Technology	Furnishings, Phones	30	1
Office Supplies, Furniture	Paper, Furnishings	30	1
Office Supplies, Office Supplies	Paper, Binders	58	1
Office Supplies, Technology	Paper, Accessories	38	1
Office Supplies, Technology	Binders, Accessories	53	1
Technology, Furniture	Furnishings, Phones	24	1
Technology, Office Supplies	Phones, Binders	34	1
Technology, Technology	Accessories, Phones	38	1

Here is the list of every sub-category with the highest number of orders by customer in 9 combination options per category. Based on that, for each program in each different sub-category, we all have data to make decisions more accurately.

Furthermore, we can adjust or add more rank in the query to get more data in a category such as Technology-Technology, to determine which will be combination sub-category in our top selections.

Limitation

The limitation of the query is that the tables in SQL cannot detect "Phones,Accessories" and "Acessories,Phones" as the same, leading to the data being missed on the reverse side and not counted in the query. And the same thing happens with other sub-categories, one side that is not recognized by SQL is missing. This means that our data integrity was not ensured.

Because of the limitations of capabilities, I have not yet been able to dig deeper into this complex task. I'm going to upgrade my skills, complete it and update as soon as possible.

Share on

Twitter Facebook LinkedIn

RFM Summary

Checking the data

Calculating metrics

Sales by Ship Mode

Sales by Segment

Sales by Category

Sales by Sub-Category

Sales by Year

Sales by State