Advanced SQL for People Analytics – Window Functions and Common Table Expressions (CTEs)

This advanced SQL lesson dives deep into powerful techniques crucial for People Analytics: Window Functions and Common Table Expressions (CTEs). You will learn how to perform sophisticated calculations, analyze trends, and structure complex queries efficiently for insightful workforce analysis.

Learning Objectives

  • Understand and apply various window functions (e.g., RANK, DENSE_RANK, ROW_NUMBER, LAG, LEAD, SUM, AVG) for advanced data analysis.
  • Master the use of Common Table Expressions (CTEs) to simplify complex queries and improve readability.
  • Utilize CTEs and window functions to solve real-world People Analytics problems, such as identifying top performers, analyzing employee churn, and tracking salary progression.
  • Write optimized and efficient SQL queries for performance and scalability in large datasets.

Lesson Content

Introduction to Window Functions

Window functions are powerful tools in SQL that allow you to perform calculations across a set of table rows that are related to the current row. Unlike aggregate functions (e.g., SUM, AVG) that collapse multiple rows into a single output row, window functions return a value for each row, based on calculations across the defined window (partition) of rows. This capability is essential for trends analysis, ranking, and comparisons within specific groups.

Syntax:

function_name(expression) OVER (
    PARTITION BY partition_expression, ...
    ORDER BY order_expression [ASC | DESC], ...
    [frame_clause]
)
  • function_name: The window function to apply (e.g., SUM, AVG, RANK, ROW_NUMBER).
  • expression: The column or calculation to perform on the window.
  • PARTITION BY: Divides the rows into partitions to which the window function is applied separately. This is optional; if omitted, the entire table is treated as one partition.
  • ORDER BY: Specifies the order within each partition. Crucial for ranking and calculating running totals.
  • frame_clause: (Optional) Further refines the window, e.g., to calculate a moving average or sum for a specific number of rows before and after the current row.

Example: Calculating the salary rank within each department.

SELECT
    employee_id,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank
FROM
    employees;

Quick Check: Which window function assigns a unique sequential integer to each row within a partition?

Window Function Types & Use Cases

Let's explore common window functions with examples relevant to People Analytics:

  • Ranking Functions:

    • RANK(): Assigns a rank to each row within a partition based on the order. Skips ranks if there are ties.
    • DENSE_RANK(): Similar to RANK(), but doesn't skip ranks for ties (e.g., 1, 2, 2, 3).
    • ROW_NUMBER(): Assigns a unique, sequential integer to each row. Useful for numbering rows within a partition.

    Example (Top Performers):
    sql SELECT employee_id, performance_score, RANK() OVER (ORDER BY performance_score DESC) AS performance_rank FROM performance_reviews WHERE review_year = 2023;

  • Lead and Lag Functions: Access data from preceding or subsequent rows within the partition.

    • LAG(column, offset, default): Retrieves the value from a preceding row. offset specifies how many rows back (default is 1). default is the value to return if there is no preceding row.
    • LEAD(column, offset, default): Retrieves the value from a subsequent row.

    Example (Churn Analysis):
    sql SELECT employee_id, hire_date, termination_date, LAG(termination_date, 1, hire_date) OVER (PARTITION BY employee_id ORDER BY hire_date) AS previous_termination_date FROM employee_history;

  • Aggregate Functions with Windowing: SUM(), AVG(), COUNT(), etc., used in conjunction with OVER() to compute running totals, moving averages, etc.
    Example (Salary Progression):
    sql SELECT employee_id, salary, salary_date, SUM(salary) OVER (PARTITION BY employee_id ORDER BY salary_date) AS cumulative_salary FROM salary_history;

Quick Check: What is the primary benefit of using Common Table Expressions (CTEs)?

Introduction to Common Table Expressions (CTEs)

CTEs are temporary result sets defined within the scope of a single SQL statement. They improve query readability, organization, and reusability, especially when dealing with complex logic. Think of them as named subqueries that you can reference within a larger query.

Syntax:

WITH cte_name AS (
    SELECT statement
)
SELECT statement
FROM cte_name;

You can chain multiple CTEs using commas:

WITH cte1 AS (...
), cte2 AS (...
)
SELECT ...
FROM cte1
JOIN cte2 ON ...;

Benefits:
* Readability: Break down complex queries into logical blocks.
* Organization: Easier to understand and maintain.
* Reusability: Avoid repeating the same subquery multiple times within a single query.
* Optimization: The query optimizer can often optimize queries involving CTEs effectively.

Example (Calculating Bonus Based on Performance):

WITH performance_summary AS (
    SELECT
        employee_id,
        AVG(performance_score) AS avg_performance
    FROM
        performance_reviews
    GROUP BY
        employee_id
),
bonus_calculation AS (
    SELECT
        employee_id,
        CASE
            WHEN avg_performance >= 90 THEN salary * 0.10  -- 10% bonus
            WHEN avg_performance >= 70 THEN salary * 0.05  -- 5% bonus
            ELSE 0  -- No bonus
        END AS bonus_amount
    FROM
        performance_summary
    JOIN employees ON performance_summary.employee_id = employees.employee_id
)
SELECT
    employees.employee_id,
    employees.salary,
    COALESCE(bonus_calculation.bonus_amount,0) AS bonus_amount
FROM employees
LEFT JOIN bonus_calculation ON employees.employee_id = bonus_calculation.employee_id;

Quick Check: Which window function would you use to find the salary of the employee *before* the current row in a sorted salary history?

Using CTEs and Window Functions Together

CTEs and window functions are extremely powerful when combined. CTEs can be used to prepare data for window function calculations, making complex analyses manageable. This is a common pattern in People Analytics for things like identifying top performers and calculating compensation changes.

Example (Identify Top 10% Paid Employees):

WITH employee_salaries_ranked AS (
    SELECT
        employee_id,
        salary,
        RANK() OVER (ORDER BY salary DESC) as salary_rank,
        COUNT(*) OVER() AS total_employees
    FROM
        employees
),
top_10_percent AS (
    SELECT
        employee_id,
        salary
    FROM
        employee_salaries_ranked
    WHERE
        salary_rank <= (total_employees * 0.10)  --  Top 10%
)
SELECT
    * -- Consider explicitly listing columns for better readability
FROM
    top_10_percent;

Quick Check: What does the `PARTITION BY` clause do within a window function?

Progress
0%