Exploring Sales data with RFM Analysis

Scrollable Code BlockFlourish Map Embed

    RFM Summary

    RFM Analysis is a measurement technique that uses existing customer behavior data, based on Recency, Frequency and Monetary value of purchases, to predict how customers will act in the future as well as to group them into behavioral segments.

    RFM can help you answer these questions and more:

    • Who can you upsell higher-value products to?
    • Who can you send personalized messages to keep them from churning?
    • Who is most likely to engage with your brand?
    • Who can you offer limited-time offers to and try to reactivate?
    • ...

    For our visualization, three dimensions of Recency, Frequency, and Monetary coincide with 3D graphics. However, it doesn't work effectively and difficult to understand the result. By combining Frequency and Monetary and turning it to 2D, it is much easier to understand because when a customer subscribes to a month for 100 dollar 12 times in 12 months, it is equivalent to when customers subscribe for $1200 per year all at once. At that time, Recency represented the customer's engagement with our brand.

    Having 11 customer segments have the same behavior as the table that we will group them into:

    Customer segmentRFM scoresActivityAction Recommend
    Champions555, 554, 544, 545, 454, 455, 445Bought recently, buy often and spend the most!Reward them. Can be early adopters for new products. Will promote your brand.
    Loyal customers543, 444, 435, 355, 354, 345, 344, 335Spend good money with us often. Responsive to promotions.Upsell higher value products. Ask for reviews. Engage them.
    Potential loyalist553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323Recent customers, but spent a good amount and bought more than once.Offer membership / loyalty program, recommend other products.
    Recent customers512, 511, 422, 421, 412, 411, 311Bought most recently, but not often.Provide on-boarding support, give them early success, start building relationship.
    Promising525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313Recent shoppers, but haven’t spent much.Create brand awareness, offer free trials
    Customers Needing Attention535, 534, 443, 434, 343, 334, 325, 324Above average recency, frequency and monetary values. May not have bought very recently though.Make limited time offers, Recommend based on past purchases. Reactivate them.
    About to sleep331, 321, 312, 221, 213Below average recency, frequency and monetary values. Will lose them if not reactivated.Share valuable resources, recommend popular products / renewals at discount, reconnect with them
    At risk255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124Spent big money and purchased often. But long time ago. Need to bring them back!Send personalized emails to reconnect, offer renewals, provide helpful resources.
    Can’t lose them155, 154, 144, 214, 215, 115, 114, 113Made biggest purchases, and often. But haven’t returned for a long time.Win them back via renewals or newer products, don’t lose them to competition, talk to them.
    Hibernating332, 322, 231, 241, 251, 233, 232, 223, 222, 132, 123, 122, 212, 211Last purchase was long back, low spenders and low number of orders.Offer other relevant products and special discounts. Recreate brand value.
    Lost111, 112, 121, 131, 141, 151Lowest recency, frequency and monetary scores.Revive interest with reach out campaign, ignore otherwise.

    Checking the data

    Here our data set for the project. And let's jump right in by checking the data

    Input [1]:

    SELECT * FROM sales;
    

    Output [1]:

    Row_IDOrder_IDOrder_DateShip_DateShip_ModeCustomer_IDCustomer_NameSegmentCountryCityStatePostal_CodeRegionProduct_IDCategorySub_CategoryProduct_NameSales
    1CA-2017-15215608-11-1711-11-17Second ClassCG-12520Claire GuteConsumerUnited StatesHendersonKentucky42420SouthFUR-BO-10001798FurnitureBookcasesBush Somerset Collection Bookcase261.9599915
    2CA-2017-15215608-11-1711-11-17Second ClassCG-12520Claire GuteConsumerUnited StatesHendersonKentucky42420SouthFUR-CH-10000454FurnitureChairsHon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back731.9400024
    3CA-2017-13868812-06-1716-06-17Second ClassDV-13045Darrin Van HuffCorporateUnited StatesLos AngelesCalifornia90036WestOFF-LA-10000240Office SuppliesLabelsSelf-Adhesive Address Labels for Typewriters by Universal14.61999989
    4US-2016-10896611-10-1618-10-16Standard ClassSO-20335Sean O'DonnellConsumerUnited StatesFort LauderdaleFlorida33311SouthFUR-TA-10000577FurnitureTablesBretford CR4500 Series Slim Rectangular Table957.5775146
    ......................................................
    9797CA-2016-12860812-01-1617-01-16Standard ClassCS-12490Cindy SchnellingCorporateUnited StatesToledoOhio43615EastOFF-AR-10001374Office SuppliesArtBIC Brite Liner Highlighters, Chisel Tip10.36800003
    9798CA-2016-12860812-01-1617-01-16Standard ClassCS-12490Cindy SchnellingCorporateUnited StatesToledoOhio43615EastTEC-PH-10004977TechnologyPhonesGE 30524EE4235.1880035
    9799CA-2016-12860812-01-1617-01-16Standard ClassCS-12490Cindy SchnellingCorporateUnited StatesToledoOhio43615EastTEC-PH-10000912TechnologyPhonesAnker 24W Portable Micro USB Car Charger26.37599945
    9800CA-2016-12860812-01-1617-01-16Standard ClassCS-12490Cindy SchnellingCorporateUnited StatesToledoOhio43615EastTEC-AC-10000487TechnologyAccessoriesSanDisk Cruzer 4 GB USB Flash Drive10.38399982
    9800 rows x 18 columns

    Before taking further steps, we need to ensure our data’s integrity by counting the null values we have missed, so as to avoid coming to the wrong conclusion about the data. 

    Input [2]:

    SELECT SUM(
        CASE WHEN CONCAT_WS('', 
        Row_ID, 
        Order_ID, 
        Order_Date, 
        Ship_Date, 
        Ship_Mode, 
        Customer_ID, 
        Customer_Name, 
        Segment, 
        Country, 
        City, 
        State, 
        Postal_Code, 
        Region, 
        Product_ID, 
        Category, 
        Sub_Category, 
        Product_Name, 
        Sales) IS NULL THEN 1 ELSE 0 END 
    ) AS count_null
    FROM sales;

    Output [2]:

    count_null
    0

    Calculating metrics

    After reviewing the data, we may realize some crucial attributes in the table for analyzing and comparing each type with each other, such as:

    • Ship mode includes Standard Class, Second Class, First Class, and Same Day, representing the delivery methods that we have served.
    • Segment includes Consumer, Corporate, and Home Office, which represent our customer segments.
    • Category includes Furniture, Office Supplies, and Technology.
    • Each Sub-category includes Chairs, Bookcases, Tables, Art, Paper, Copiers, Accessories, and Phones,... every single sub-category has numerous items from different brands.
    • State represents 49 states in the US excluding Hawaii, which our store isn't located.
    • Year: 2015 - 2018

    Let's calculate crucial metrics for the data that we have explored based on its characteristics.

    Code and Table Layout

    Sales by
    Ship Mode

    Input [3]:

    SELECT Ship_Mode, 
          ROUND(SUM(sales),2) AS shipmode_sales 
    FROM sales 
    GROUP BY Ship_Mode
    ORDER BY 2 DESC;

    Outnput [3]:

    Ship_Modeshipmode_sales
    Standard Class1340831.31
    Second Class449914.18
    First Class345572.26
    Same Day125219.04

    Sales by
    Segment

    Input [4]:

    SELECT Segment, 
          ROUND(SUM(sales),2) AS segment_sales
    FROM sales 
    GROUP BY Segment
    ORDER BY 2 DESC;

    Output [4]:

    Segmentsegment_sales
    Consumer1148060.53
    Corporate688494.07
    Home Office424982.18

    Sales by
    Category

    Input [5]:

    SELECT Category, 
          ROUND(SUM(sales),2) AS category_sales 
    FROM sales
    GROUP BY Category
    ORDER BY 2 DESC;

    Output [5]:

    Categorycategory_sales
    Technology827455.87
    Furniture728658.57
    Office Supplies705422.33

    Sales by
    Sub-
    Category

    Input [6]:

    SELECT Sub_Category, 
          ROUND(SUM(sales),2) AS sub_sales 
    FROM sales
    GROUP BY Sub_Category
    ORDER BY 2 DESC;

    Output [6]:

    #Sub
    Category
    sub_category
    _sales
    1Phones327782.45
    2Chairs322822.73
    3Storage219343.39
    .........
    16Labels12347.73
    17Fasteners3001.96

    Sales by
    Year

    Input [7]:

    SELECT	YEAR(Order_Date) as Year,
            MONTH(Order_Date) as Month, 
            ROUND(SUM(Sales),2) Revenue
    FROM sales
    GROUP BY YEAR(Order_Date), MONTH(Order_Date)
    ORDER BY 1,2;

    Output [7]:

    YearMonthRevenue
    2015114205.71
    201524519.89
    .........
    201811117938.15
    20181283030.39



    Sales by
    State

    Input [8]:

    SELECT  State, 
            ROUND(SUM(sales),2) Revenue
    FROM sales
    GROUP BY State
    ORDER BY 2 DESC;

    Output [8]:

    #StateRevenue
    1California446306.46
    2New York306361.15
    3Texas168572.53
    .........
    48West Virginia1209.82
    49North Dakota919.91

    RFM Analysis

    After calculating metrics for core characteristics, the most important section of the RFM technique is grouping our customer segments and determining how much engagement our brand has with the consumer.

    Input [9]:

    WITH rfm AS 
      (
        SELECT 
          Customer_Name, 
          SUM(sales) MonetaryValue,
          AVG(sales) AvgMonetaryValue,
          COUNT(Order_ID) Frequency,
          MAX(Order_Date) last_order_date,
          (SELECT MAX(Order_Date) FROM sales) max_order_date,
          DATEDIFF(DD, MAX(Order_Date), (SELECT MAX(Order_Date) FROM sales)) Recency
        FROM sales
        GROUP BY Customer_Name
      ),
      rfm_calc AS
      (
          SELECT r.*,
          NTILE(5) OVER (ORDER BY Recency DESC) rfm_recency,
          NTILE(5) OVER (ORDER BY Frequency) rfm_frequency,
          NTILE(5) OVER (ORDER BY MonetaryValue) rfm_monetary
        FROM rfm r
      )
      SELECT 
        c.*, rfm_recency+ rfm_frequency+ rfm_monetary AS rfm_cell,
        CAST(rfm_recency AS VARCHAR) + CAST(rfm_frequency AS VARCHAR) + CAST(rfm_monetary  AS VARCHAR)rfm_cell_string
      INTO #rfm
      FROM rfm_calc c
      
      SELECT Customer_Name , rfm_recency, rfm_frequency, rfm_monetary,
        CASE
          WHEN rfm_cell_string IN (555, 554, 544, 545, 454, 455, 445) THEN 'Champions'
          WHEN rfm_cell_string IN (543, 444, 435, 355, 354, 345, 344, 335) THEN 'Loyal customers'
          WHEN rfm_cell_string IN (553, 551, 552, 541, 542, 533, 532, 531, 452, 451, 442, 441, 431, 453, 433, 432, 423, 353, 352, 351, 342, 341, 333, 323) THEN 'Potential loyalist'
          WHEN rfm_cell_string IN (512, 511, 422, 421, 412, 411, 311) THEN 'Recent customers'
          WHEN rfm_cell_string IN (525, 524, 523, 522, 521, 515, 514, 513, 425, 424, 413, 414, 415, 315, 314, 313) THEN 'Promising'
          WHEN rfm_cell_string IN (535, 534, 443, 434, 343, 334, 325, 324) THEN 'Customers needing attention' 
          WHEN rfm_cell_string IN (331, 321, 312, 221, 213) THEN 'About to sleep' 
          WHEN rfm_cell_string IN (255, 254, 245, 244, 253, 252, 243, 242, 235, 234, 225, 224, 153, 152, 145, 143, 142, 135, 134, 133, 125, 124) THEN 'At risk'
          WHEN rfm_cell_string IN (155, 154, 144, 214, 215, 115, 114, 113) THEN 'Can’t lose them'
          WHEN rfm_cell_string IN (332, 322, 231, 241, 251, 233, 232, 223, 222, 132, 123, 122, 212, 211) THEN 'Hibernating' 
          WHEN rfm_cell_string IN (111, 112, 121, 131, 141, 151) THEN 'Lost'
        END rfm_segment
      FROM #rfm
      ORDER BY rfm_recency DESC, rfm_frequency DESC ,rfm_monetary DESC;

    Output [9]:

    #Customer_Namerfm_recencyrfm_frequencyrfm_monetaryrfm_segment
    1Arianne Irving555Champions
    2Suzanne McNair555Champions
    3Rick Bensley555Champions
    4Mike Pelletier555Champions
    ...............
    790Anna Chung111Lost
    791Carlos Meador111Lost
    792Bobby Trafton111Lost
    793Shirley Schmidt111Lost



    According to the characteristics of each group, let's segment them again into larger groups for easier analysis:

    • Loyal Group as Group 1 includes Champions, Loyal customers.
    • Vulnerable Group as Group 2 includes Potential loyalist, Promising, Customers needing action, At risk, Can't lose them and Recent customers.
    • Lost Group as Group 3 includes About to sleep, Hibernating and Lost.

    As the result shows, our supermarket has the number of customers in Group 1 at 32.83%, who pay over average and also shop frequently. With the highest engagement, they can be early adopters of new products, need to reward them by asking for reviews to get discounts or coupons on the next purchase. We could lessen our focus on them because they trusted to buy our products.

    Instead, Group 2 is the group that needs the most concentration because their vulnerability makes it easy to get influenced or attracted by competitor products or on-sale products from others and leave our brand as presented by our characteristics. They are 51.37% of our consumer base, that much force we must focus on and take action on, such as a mix of methods like discounts on products, email marketing and social advertising to get their attention and bring them back, depending on our financial situation, for instance that group in them we may consider focusing on as Potential loyalist and At risk has over 2000 customers, which is approximately 55% of vulnerable customers.

    And at least focusing is Lost Group (Group 3),which represents around 15% of our consumer base. Characteristics of this group include variety-seeking buying behavior and comparing with other products, making it challenging to re-engage them after a long period of not interacting with or trusting alternative products. 15% seems pretty large, however the reason for least focusing is because our resources must be spent on Group 2 much more than Group 3 to achieve maximum effectiveness.

    Further Steps

    Great results and conclusions from RFM Analysis helps us to determine which customer group we have to focus on.

    Imagining after considering all of the conditions, we need the Orders data to answer the question 'Which item comes together the most when customers choose an order that has two items in?" that represents which category and sub-category have purchased the most when customers only buy exactly 2 items per order to make decisions.

    Input [10]:

    SELECT DISTINCT Order_ID, 
      STUFF(
      (SELECT ',' + Category 
      FROM sales p
      WHERE Order_ID IN (
                SELECT Order_ID 
                FROM (SELECT DISTINCT Order_ID, COUNT(*) row_number
                    FROM sales 
                    GROUP BY Order_ID
                    )m 
                WHERE row_number = 2
                AND p.Order_ID = s.Order_ID
                            ) 
      FOR XML PATH ('')),1,1,'') AS Category,
      STUFF(
      (SELECT ',' + Sub_Category
      FROM sales p
      WHERE Order_ID in (
                SELECT Order_ID 
                FROM (SELECT DISTINCT Order_ID, COUNT(*) row_number
                    FROM sales 
                    GROUP BY Order_ID
                    )m 
                WHERE row_number = 2
                AND p.Order_ID = s.Order_ID
                            )
      FOR XML PATH ('')),1,1,'') AS Sub_Category
    FROM sales s
    ORDER BY 2 DESC;
    

    Output [10]:

    #Order_IDCategorySub_Category
    1CA-2015-134726Technology,TechnologyAccessories,Phones
    2CA-2015-135993Technology,TechnologyPhones,Accessories
    3CA-2015-139192Technology,TechnologyPhones,Accessories
    4CA-2015-143840Technology,TechnologyPhones,Phones
    ............
    4919US-2018-168613NULLNULL
    4920US-2018-168690NULLNULL
    4921US-2018-168802NULLNULL
    4922US-2018-169551NULLNULL

    The length of the table is too long and a brief presentation is required, the Output [10] cannot comprehensively represent all results. Because of that, I added an extra file that contains all of the Output [10] in my repo to make it easier for analysis.

    Let's query again to check how many different categories come together and which sub-category is more frequently purchased by customers at a time.

    Note: As presented in Calculating Metrics, we have 3 different categories, allowing customers to buy 3 × 3 = 9 different combinations when customers buy exactly 2 items per order. However, Sub-Category isn't that easy, we have 17 different sub-categories leading to customers having 17 × 17 = 289 different combination options, it's too much that I can present it in a table, and we also brief it as previous ones.

    Category
    by number
    of Orders

    Input [11]:

    SELECT category, 
          COUNT(*) AS count_order 
    FROM extra
    GROUP BY category
    ORDER BY 2 DESC;

    Output [11]:

    Categorycount_order
    Office Supplies,
    Office Supplies
    846
    Office Supplies,
    Technology
    312
    Furniture,
    Office Supplies
    296
    Technology,
    Office Supplies
    266
    Office Supplies,
    Furniture
    264
    Furniture,Technology118
    Furniture,Furniture108
    Technology,Furniture108
    Technology,Technology78

    Sub-
    Category
    by number
    of Orders

    Input [12]:

    SELECT category, sub_category, 
          COUNT(*) AS count_order 
    FROM extra
    GROUP BY category, sub_category
    ORDER BY 1, 3 DESC;

    Output [12]:

    #CategorySub CategoryCount Order
    1Furniture,
    Furniture
    Furnishings,
    Furnishings
    28
    2Furniture,
    Furniture
    Chairs,
    Furnishings
    18
    ............
    219Technology,
    Technology
    Machines,
    Accessories
    2
    220Technology,
    Technology
    Phones,
    Copiers
    2

    Rank 1 each
    Sub-Category
    in Category

    Input [13]:

    WITH rank_item AS (
      SELECT  category, sub_category, 
              COUNT(*) AS count_order,
              RANK() OVER(PARTITION BY category 
              ORDER BY COUNT(*) DESC) AS rank
      FROM extra
      GROUP BY category,sub_category)
    SELECT category, sub_category, 
          count_order, rank
    FROM rank_item
    WHERE rank = 1
    ORDER BY 1;

    Output [13]:

    categorysub categorycount
    _order
    rank
    Furniture,
    Furniture
    Furnishings,
    Furnishings
    281
    Furniture,
    Office Supplies
    Furnishings,
    Binders
    441
    Furniture,
    Technology
    Furnishings,
    Phones
    301
    Office Supplies,
    Furniture
    Paper,
    Furnishings
    301
    Office Supplies,
    Office Supplies
    Paper,
    Binders
    581
    Office Supplies,
    Technology
    Paper,
    Accessories
    381
    Office Supplies,
    Technology
    Binders,
    Accessories
    531
    Technology,
    Furniture
    Furnishings,
    Phones
    241
    Technology,
    Office Supplies
    Phones,
    Binders
    341
    Technology,
    Technology
    Accessories,
    Phones
    381

    Here is the list of every sub-category with the highest number of orders by customer in 9 combination options per category. Based on that, for each program in each different sub-category, we all have data to make decisions more accurately.

    Furthermore, we can adjust or add more rank in the query to get more data in a category such as Technology-Technology, to determine which will be combination sub-category in our top selections.

    Limitation

    The limitation of the query is that the tables in SQL cannot detect "Phones,Accessories" and "Acessories,Phones" as the same, leading to the data being missed on the reverse side and not counted in the query. And the same thing happens with other sub-categories, one side that is not recognized by SQL is missing. This means that our data integrity was not ensured.

    Because of the limitations of capabilities, I have not yet been able to dig deeper into this complex task. I'm going to upgrade my skills, complete it and update as soon as possible.