Advanced SQL for Cohort Analysis

This lesson dives deep into advanced SQL techniques for cohort analysis, equipping you with the tools to segment users, analyze their behavior patterns over time, and even build simple predictive models. You'll learn to move beyond basic cohort calculations and into the realm of time-series analysis and forecasting using the power of SQL.

Learning Objectives

  • Segment users based on various attributes and behavioral data using SQL.
  • Calculate cohort retention rates using advanced SQL techniques, including monthly and weekly cohort analysis.
  • Perform time-series analysis to identify trends, seasonality, and other patterns in user behavior.
  • Develop basic forecasting models using SQL to predict future user behavior and retention.

Text-to-Speech

Listen to the lesson content

Lesson Content

Advanced Segmentation in SQL

Building upon Day 2's foundation, this section focuses on more sophisticated user segmentation. We'll explore segmenting users based on multiple criteria, including recency, frequency, and monetary value (RFM) analysis.

Example: Segment users based on their purchase frequency and recency.

WITH user_purchase_summary AS (
    SELECT
        user_id,
        MAX(purchase_date) AS last_purchase_date,
        COUNT(*) AS purchase_count
    FROM
        purchases
    GROUP BY
        user_id
)
SELECT
    user_id,
    last_purchase_date,
    purchase_count,
    CASE
        WHEN last_purchase_date >= DATE('now', '-30 days') THEN 'Recent'
        WHEN last_purchase_date BETWEEN DATE('now', '-90 days') AND DATE('now', '-31 days') THEN 'Medium'
        ELSE 'Old'
    END AS recency_segment,
    CASE
        WHEN purchase_count >= 5 THEN 'High Frequency'
        WHEN purchase_count BETWEEN 2 AND 4 THEN 'Medium Frequency'
        ELSE 'Low Frequency'
    END AS frequency_segment
FROM
    user_purchase_summary;

Cohort Retention Analysis: Beyond the Basics

We'll delve deeper into cohort retention, moving beyond simple monthly cohorts to incorporate weekly cohorts and explore more nuanced retention metrics. This includes calculating retention rates by specific user segments.

Example: Calculate the 3-month retention rate for users who signed up in January 2023, broken down by their acquisition source.

WITH cohort_data AS (
    SELECT
        DATE(signup_date, 'start of month') AS cohort_month,
        DATE(signup_date) AS signup_day,
        user_id,
        acquisition_source
    FROM
        users
    WHERE
        STRFTIME('%Y', signup_date) = '2023'
        AND STRFTIME('%m', signup_date) = '01'
),
cohort_activity AS (
    SELECT
        cd.cohort_month,
        cd.signup_day,
        cd.user_id,
        cd.acquisition_source,
        DATE(activity_date, 'start of month') AS activity_month
    FROM
        cohort_data cd
    LEFT JOIN
        activities a ON cd.user_id = a.user_id
    WHERE activity_date IS NOT NULL
),
retention_table AS (
    SELECT
        cohort_month,
        signup_day,
        acquisition_source,
        activity_month,
        COUNT(DISTINCT user_id) AS retained_users,
        COUNT(DISTINCT user_id) / CAST(SUM(COUNT(DISTINCT user_id)) OVER (PARTITION BY cohort_month, acquisition_source) AS REAL) AS retention_rate
    FROM
        cohort_activity
    GROUP BY
        cohort_month, signup_day, acquisition_source, activity_month
)
SELECT
    cohort_month,
    signup_day,
    acquisition_source,
    activity_month,
    retained_users,
    retention_rate
FROM
    retention_table
ORDER BY
    cohort_month, signup_day, acquisition_source, activity_month;

Time-Series Analysis in SQL

Explore techniques for analyzing time-series data using SQL, including identifying trends (growth/decline), seasonality (cyclical patterns), and outliers. We'll utilize window functions for smoothing and calculating moving averages.

Example: Calculate a 7-day moving average of daily active users (DAU).

SELECT
    activity_date,
    COUNT(DISTINCT user_id) AS daily_active_users,
    AVG(COUNT(DISTINCT user_id)) OVER (ORDER BY activity_date ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average_dau
FROM
    activities
GROUP BY
    activity_date
ORDER BY
    activity_date;

Forecasting with SQL (Basic)

Introduce basic forecasting using SQL. This section will demonstrate how to perform simple linear regression and calculate future values based on historical trends. This is a simplified approach, suitable for educational purposes. For more robust forecasting, dedicated statistical tools are often preferred.

Example: Estimate future user signups using linear regression. (Note: This is simplified. Production forecasting would involve more complex algorithms).

-- Create a CTE to calculate signup information over time
WITH signup_data AS (
    SELECT
        DATE(signup_date) AS signup_day,
        COUNT(*) AS signup_count
    FROM
        users
    GROUP BY
        signup_day
),
-- Calculate some simple linear regression coefficients (Intercept and Slope) on the signup data
linear_regression AS (
    SELECT
        -- Calculate basic statistics needed for the slope and intercept (This is a simplified version)
        SUM(strftime('%J', signup_day) * signup_count) AS sum_xy,
        SUM(strftime('%J', signup_day)) AS sum_x,  -- Day number within the year (Julian day)  
        SUM(signup_count) AS sum_y,
        SUM(strftime('%J', signup_day) * strftime('%J', signup_day)) AS sum_x_squared,
        COUNT(*) AS n
    FROM signup_data
),
-- Calculate the slope and intercept from the sums
regression_coefficients AS (
    SELECT
        (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x * sum_x) AS slope,
        (sum_y - slope * sum_x) / n AS intercept
    FROM linear_regression
)
-- Predict signup counts for the next few days
SELECT
    DATE(DATE('now', '+' || (day_number - 1) || ' days')) AS forecast_date,
    (slope * day_number + intercept) AS predicted_signups
FROM (SELECT GENERATE_SERIES(1, 7) as day_number)  -- Generates series of days for forecast.
CROSS JOIN regression_coefficients;
Progress
0%