Exploring Sales data with RFM Analysis
RFM Summary
RFM Analysis is a measurement technique that uses existing customer behavior data, based on Recency, Frequency and Monetary value of purchases, to predict how customers will act in the future as well as to group them into behavioral segments.
RFM can help you answer these questions and more:
- Who can you upsell higher-value products to?
- Who can you send personalized messages to keep them from churning?
- Who is most likely to engage with your brand?
- Who can you offer limited-time offers to and try to reactivate?
- ...
For our visualization, three dimensions of Recency, Frequency, and Monetary coincide with 3D graphics. However, it doesn't work effectively and difficult to understand the result. By combining Frequency and Monetary and turning it to 2D, it is much easier to understand because when a customer subscribes to a month for 100 dollar 12 times in 12 months, it is equivalent to when customers subscribe for $1200 per year all at once. At that time, Recency represented the customer's engagement with our brand.
Having 11 customer segments have the same behavior as the table that we will group them into:
Customer segment | RFM scores | Activity | Action Recommend |
---|---|---|---|
Champions | 555, 554, 544, 545, 454, 455, 445 | Bought recently, buy often and spend the most! | Reward them. Can be early adopters for new products. Will promote your brand. |
Loyal customers | 543, 444, 435, 355, 354, 345, 344, 335 | Spend good money with us often. Responsive to promotions. | Upsell higher value products. Ask for reviews. Engage them. |
Potential loyalist | 553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323 | Recent customers, but spent a good amount and bought more than once. | Offer membership / loyalty program, recommend other products. |
Recent customers | 512, 511, 422, 421, 412, 411, 311 | Bought most recently, but not often. | Provide on-boarding support, give them early success, start building relationship. |
Promising | 525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313 | Recent shoppers, but haven’t spent much. | Create brand awareness, offer free trials |
Customers Needing Attention | 535, 534, 443, 434, 343, 334, 325, 324 | Above average recency, frequency and monetary values. May not have bought very recently though. | Make limited time offers, Recommend based on past purchases. Reactivate them. |
About to sleep | 331, 321, 312, 221, 213 | Below average recency, frequency and monetary values. Will lose them if not reactivated. | Share valuable resources, recommend popular products / renewals at discount, reconnect with them |
At risk | 255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124 | Spent big money and purchased often. But long time ago. Need to bring them back! | Send personalized emails to reconnect, offer renewals, provide helpful resources. |
Can’t lose them | 155, 154, 144, 214, 215, 115, 114, 113 | Made biggest purchases, and often. But haven’t returned for a long time. | Win them back via renewals or newer products, don’t lose them to competition, talk to them. |
Hibernating | 332, 322, 231, 241, 251, 233, 232, 223, 222, 132, 123, 122, 212, 211 | Last purchase was long back, low spenders and low number of orders. | Offer other relevant products and special discounts. Recreate brand value. |
Lost | 111, 112, 121, 131, 141, 151 | Lowest recency, frequency and monetary scores. | Revive interest with reach out campaign, ignore otherwise. |
Checking the data
Here our data set for the project. And let's jump right in by checking the dataInput [1]:
SELECT * FROM sales;
Output [1]:
Row_ID | Order_ID | Order_Date | Ship_Date | Ship_Mode | Customer_ID | Customer_Name | Segment | Country | City | State | Postal_Code | Region | Product_ID | Category | Sub_Category | Product_Name | Sales |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | CA-2017-152156 | 08-11-17 | 11-11-17 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9599915 |
2 | CA-2017-152156 | 08-11-17 | 11-11-17 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back | 731.9400024 |
3 | CA-2017-138688 | 12-06-17 | 16-06-17 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | California | 90036 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters by Universal | 14.61999989 |
4 | US-2016-108966 | 11-10-16 | 18-10-16 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775146 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9797 | CA-2016-128608 | 12-01-16 | 17-01-16 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615 | East | OFF-AR-10001374 | Office Supplies | Art | BIC Brite Liner Highlighters, Chisel Tip | 10.36800003 |
9798 | CA-2016-128608 | 12-01-16 | 17-01-16 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615 | East | TEC-PH-10004977 | Technology | Phones | GE 30524EE4 | 235.1880035 |
9799 | CA-2016-128608 | 12-01-16 | 17-01-16 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615 | East | TEC-PH-10000912 | Technology | Phones | Anker 24W Portable Micro USB Car Charger | 26.37599945 |
9800 | CA-2016-128608 | 12-01-16 | 17-01-16 | Standard Class | CS-12490 | Cindy Schnelling | Corporate | United States | Toledo | Ohio | 43615 | East | TEC-AC-10000487 | Technology | Accessories | SanDisk Cruzer 4 GB USB Flash Drive | 10.38399982 |
Before taking further steps, we need to ensure our data’s integrity by counting the null values we have missed, so as to avoid coming to the wrong conclusion about the data.
Input [2]:
SELECT SUM(
CASE WHEN CONCAT_WS('',
Row_ID,
Order_ID,
Order_Date,
Ship_Date,
Ship_Mode,
Customer_ID,
Customer_Name,
Segment,
Country,
City,
State,
Postal_Code,
Region,
Product_ID,
Category,
Sub_Category,
Product_Name,
Sales) IS NULL THEN 1 ELSE 0 END
) AS count_null
FROM sales;
Output [2]:
count_null |
---|
0 |
Calculating metrics
After reviewing the data, we may realize some crucial attributes in the table for analyzing and comparing each type with each other, such as:
- Ship mode includes Standard Class, Second Class, First Class, and Same Day, representing the delivery methods that we have served.
- Segment includes Consumer, Corporate, and Home Office, which represent our customer segments.
- Category includes Furniture, Office Supplies, and Technology.
- Each Sub-category includes Chairs, Bookcases, Tables, Art, Paper, Copiers, Accessories, and Phones,... every single sub-category has numerous items from different brands.
- State represents 49 states in the US excluding Hawaii, which our store isn't located.
- Year: 2015 - 2018
Let's calculate crucial metrics for the data that we have explored based on its characteristics.
Sales by
Ship Mode
Input [3]:
SELECT Ship_Mode,
ROUND(SUM(sales),2) AS shipmode_sales
FROM sales
GROUP BY Ship_Mode
ORDER BY 2 DESC;
Outnput [3]:
Ship_Mode | shipmode_sales |
---|---|
Standard Class | 1340831.31 |
Second Class | 449914.18 |
First Class | 345572.26 |
Same Day | 125219.04 |
Sales by
Segment
Input [4]:
SELECT Segment,
ROUND(SUM(sales),2) AS segment_sales
FROM sales
GROUP BY Segment
ORDER BY 2 DESC;
Output [4]:
Segment | segment_sales |
---|---|
Consumer | 1148060.53 |
Corporate | 688494.07 |
Home Office | 424982.18 |
Sales by
Category
Input [5]:
SELECT Category,
ROUND(SUM(sales),2) AS category_sales
FROM sales
GROUP BY Category
ORDER BY 2 DESC;
Output [5]:
Category | category_sales |
---|---|
Technology | 827455.87 |
Furniture | 728658.57 |
Office Supplies | 705422.33 |
Sales by
Sub-
Category
Input [6]:
SELECT Sub_Category,
ROUND(SUM(sales),2) AS sub_sales
FROM sales
GROUP BY Sub_Category
ORDER BY 2 DESC;
Output [6]:
# | Sub Category | sub_category _sales |
---|---|---|
1 | Phones | 327782.45 |
2 | Chairs | 322822.73 |
3 | Storage | 219343.39 |
... | ... | ... |
16 | Labels | 12347.73 |
17 | Fasteners | 3001.96 |
Sales by
Year
Input [7]:
SELECT YEAR(Order_Date) as Year,
MONTH(Order_Date) as Month,
ROUND(SUM(Sales),2) Revenue
FROM sales
GROUP BY YEAR(Order_Date), MONTH(Order_Date)
ORDER BY 1,2;
Output [7]:
Year | Month | Revenue |
---|---|---|
2015 | 1 | 14205.71 |
2015 | 2 | 4519.89 |
... | ... | ... |
2018 | 11 | 117938.15 |
2018 | 12 | 83030.39 |
Sales by
State
Input [8]:
SELECT State,
ROUND(SUM(sales),2) Revenue
FROM sales
GROUP BY State
ORDER BY 2 DESC;
Output [8]:
# | State | Revenue |
---|---|---|
1 | California | 446306.46 |
2 | New York | 306361.15 |
3 | Texas | 168572.53 |
... | ... | ... |
48 | West Virginia | 1209.82 |
49 | North Dakota | 919.91 |
RFM Analysis
After calculating metrics for core characteristics, the most important section of the RFM technique is grouping our customer segments and determining how much engagement our brand has with the consumer.
Input [9]:
WITH rfm AS
(
SELECT
Customer_Name,
SUM(sales) MonetaryValue,
AVG(sales) AvgMonetaryValue,
COUNT(Order_ID) Frequency,
MAX(Order_Date) last_order_date,
(SELECT MAX(Order_Date) FROM sales) max_order_date,
DATEDIFF(DD, MAX(Order_Date), (SELECT MAX(Order_Date) FROM sales)) Recency
FROM sales
GROUP BY Customer_Name
),
rfm_calc AS
(
SELECT r.*,
NTILE(5) OVER (ORDER BY Recency DESC) rfm_recency,
NTILE(5) OVER (ORDER BY Frequency) rfm_frequency,
NTILE(5) OVER (ORDER BY MonetaryValue) rfm_monetary
FROM rfm r
)
SELECT
c.*, rfm_recency+ rfm_frequency+ rfm_monetary AS rfm_cell,
CAST(rfm_recency AS VARCHAR) + CAST(rfm_frequency AS VARCHAR) + CAST(rfm_monetary AS VARCHAR)rfm_cell_string
INTO #rfm
FROM rfm_calc c
SELECT Customer_Name , rfm_recency, rfm_frequency, rfm_monetary,
CASE
WHEN rfm_cell_string IN (555, 554, 544, 545, 454, 455, 445) THEN 'Champions'
WHEN rfm_cell_string IN (543, 444, 435, 355, 354, 345, 344, 335) THEN 'Loyal customers'
WHEN rfm_cell_string IN (553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323) THEN 'Potential loyalist'
WHEN rfm_cell_string IN (512, 511, 422, 421, 412, 411, 311) THEN 'Recent customers'
WHEN rfm_cell_string IN (525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313) THEN 'Promising'
WHEN rfm_cell_string IN (535, 534, 443, 434, 343, 334, 325, 324) THEN 'Customers needing attention'
WHEN rfm_cell_string IN (331, 321, 312, 221, 213) THEN 'About to sleep'
WHEN rfm_cell_string IN (255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124) THEN 'At risk'
WHEN rfm_cell_string IN (155, 154, 144, 214, 215, 115, 114, 113) THEN 'Can’t lose them'
WHEN rfm_cell_string IN (332, 322, 231, 241, 251, 233, 232, 223, 222, 132, 123, 122, 212, 211) THEN 'Hibernating'
WHEN rfm_cell_string IN (111, 112, 121, 131, 141, 151) THEN 'Lost'
END rfm_segment
FROM #rfm
ORDER BY rfm_recency DESC, rfm_frequency DESC ,rfm_monetary DESC;
Output [9]:
# | Customer_Name | rfm_recency | rfm_frequency | rfm_monetary | rfm_segment |
---|---|---|---|---|---|
1 | Arianne Irving | 5 | 5 | 5 | Champions |
2 | Suzanne McNair | 5 | 5 | 5 | Champions |
3 | Rick Bensley | 5 | 5 | 5 | Champions |
4 | Mike Pelletier | 5 | 5 | 5 | Champions |
... | ... | ... | ... | ... | |
790 | Anna Chung | 1 | 1 | 1 | Lost |
791 | Carlos Meador | 1 | 1 | 1 | Lost |
792 | Bobby Trafton | 1 | 1 | 1 | Lost |
793 | Shirley Schmidt | 1 | 1 | 1 | Lost |
According to the characteristics of each group, let's segment them again into larger groups for easier analysis:
- Loyal Group as Group 1 includes Champions, Loyal customers.
- Vulnerable Group as Group 2 includes Potential loyalist, Promising, Customers needing action, At risk, Can't lose them and Recent customers.
- Lost Group as Group 3 includes About to sleep, Hibernating and Lost.
As the result shows, our supermarket has the number of customers in Group 1 at 32.83%, who pay over average and also shop frequently. With the highest engagement, they can be early adopters of new products, need to reward them by asking for reviews to get discounts or coupons on the next purchase. We could lessen our focus on them because they trusted to buy our products.
Instead, Group 2 is the group that needs the most concentration because their vulnerability makes it easy to get influenced or attracted by competitor products or on-sale products from others and leave our brand as presented by our characteristics. They are 51.37% of our consumer base, that much force we must focus on and take action on, such as a mix of methods like discounts on products, email marketing and social advertising to get their attention and bring them back, depending on our financial situation, for instance that group in them we may consider focusing on as Potential loyalist and At risk has over 2000 customers, which is approximately 55% of vulnerable customers.
And at least focusing is Lost Group (Group 3),which represents around 15% of our consumer base. Characteristics of this group include variety-seeking buying behavior and comparing with other products, making it challenging to re-engage them after a long period of not interacting with or trusting alternative products. 15% seems pretty large, however the reason for least focusing is because our resources must be spent on Group 2 much more than Group 3 to achieve maximum effectiveness.
Further Steps
Great results and conclusions from RFM Analysis helps us to determine which customer group we have to focus on.
Imagining after considering all of the conditions, we need the Orders data to answer the question 'Which item comes together the most when customers choose an order that has two items in?" that represents which category and sub-category have purchased the most when customers only buy exactly 2 items per order to make decisions.
Input [10]:
SELECT DISTINCT Order_ID,
STUFF(
(SELECT ',' + Category
FROM sales p
WHERE Order_ID IN (
SELECT Order_ID
FROM (SELECT DISTINCT Order_ID, COUNT(*) row_number
FROM sales
GROUP BY Order_ID
)m
WHERE row_number = 2
AND p.Order_ID = s.Order_ID
)
FOR XML PATH ('')),1,1,'') AS Category,
STUFF(
(SELECT ',' + Sub_Category
FROM sales p
WHERE Order_ID in (
SELECT Order_ID
FROM (SELECT DISTINCT Order_ID, COUNT(*) row_number
FROM sales
GROUP BY Order_ID
)m
WHERE row_number = 2
AND p.Order_ID = s.Order_ID
)
FOR XML PATH ('')),1,1,'') AS Sub_Category
FROM sales s
ORDER BY 2 DESC;
Output [10]:
# | Order_ID | Category | Sub_Category |
---|---|---|---|
1 | CA-2015-134726 | Technology,Technology | Accessories,Phones |
2 | CA-2015-135993 | Technology,Technology | Phones,Accessories |
3 | CA-2015-139192 | Technology,Technology | Phones,Accessories |
4 | CA-2015-143840 | Technology,Technology | Phones,Phones |
... | ... | ... | ... |
4919 | US-2018-168613 | NULL | NULL |
4920 | US-2018-168690 | NULL | NULL |
4921 | US-2018-168802 | NULL | NULL |
4922 | US-2018-169551 | NULL | NULL |
The length of the table is too long and a brief presentation is required, the Output [10] cannot comprehensively represent all results. Because of that, I added an extra file that contains all of the Output [10] in my repo to make it easier for analysis.
Let's query again to check how many different categories come together and which sub-category is more frequently purchased by customers at a time.
Note: As presented in Calculating Metrics, we have 3 different categories, allowing customers to buy 3 × 3 = 9 different combinations when customers buy exactly 2 items per order. However, Sub-Category isn't that easy, we have 17 different sub-categories leading to customers having 17 × 17 = 289 different combination options, it's too much that I can present it in a table, and we also brief it as previous ones.
Category
by number
of Orders
Input [11]:
SELECT category,
COUNT(*) AS count_order
FROM extra
GROUP BY category
ORDER BY 2 DESC;
Output [11]:
Category | count_order |
---|---|
Office Supplies, Office Supplies | 846 |
Office Supplies, Technology | 312 |
Furniture, Office Supplies | 296 |
Technology, Office Supplies | 266 |
Office Supplies, Furniture | 264 |
Furniture,Technology | 118 |
Furniture,Furniture | 108 |
Technology,Furniture | 108 |
Technology,Technology | 78 |
Sub-
Category
by number
of Orders
Input [12]:
SELECT category, sub_category,
COUNT(*) AS count_order
FROM extra
GROUP BY category, sub_category
ORDER BY 1, 3 DESC;
Output [12]:
# | Category | Sub Category | Count Order |
---|---|---|---|
1 | Furniture, Furniture | Furnishings, Furnishings | 28 |
2 | Furniture, Furniture | Chairs, Furnishings | 18 |
... | ... | ... | ... |
219 | Technology, Technology | Machines, Accessories | 2 |
220 | Technology, Technology | Phones, Copiers | 2 |
Rank 1 each
Sub-Category
in Category
Input [13]:
WITH rank_item AS (
SELECT category, sub_category,
COUNT(*) AS count_order,
RANK() OVER(PARTITION BY category
ORDER BY COUNT(*) DESC) AS rank
FROM extra
GROUP BY category,sub_category)
SELECT category, sub_category,
count_order, rank
FROM rank_item
WHERE rank = 1
ORDER BY 1;
Output [13]:
category | sub category | count _order | rank |
---|---|---|---|
Furniture, Furniture | Furnishings, Furnishings | 28 | 1 |
Furniture, Office Supplies | Furnishings, Binders | 44 | 1 |
Furniture, Technology | Furnishings, Phones | 30 | 1 |
Office Supplies, Furniture | Paper, Furnishings | 30 | 1 |
Office Supplies, Office Supplies | Paper, Binders | 58 | 1 |
Office Supplies, Technology | Paper, Accessories | 38 | 1 |
Office Supplies, Technology | Binders, Accessories | 53 | 1 |
Technology, Furniture | Furnishings, Phones | 24 | 1 |
Technology, Office Supplies | Phones, Binders | 34 | 1 |
Technology, Technology | Accessories, Phones | 38 | 1 |
Here is the list of every sub-category with the highest number of orders by customer in 9 combination options per category. Based on that, for each program in each different sub-category, we all have data to make decisions more accurately.
Furthermore, we can adjust or add more rank in the query to get more data in a category such as Technology-Technology, to determine which will be combination sub-category in our top selections.
Limitation
The limitation of the query is that the tables in SQL cannot detect "Phones,Accessories" and "Acessories,Phones" as the same, leading to the data being missed on the reverse side and not counted in the query. And the same thing happens with other sub-categories, one side that is not recognized by SQL is missing. This means that our data integrity was not ensured.
Because of the limitations of capabilities, I have not yet been able to dig deeper into this complex task. I'm going to upgrade my skills, complete it and update as soon as possible.