Advanced Data Profiling and Data Quality Assessment

This advanced lesson deep dives into sophisticated data profiling and rigorous data quality assessment techniques, essential skills for any data scientist. You will learn to move beyond basic descriptive statistics and explore data distributions, identify subtle data quality issues, and understand their impact on model performance using real-world datasets.

Learning Objectives

  • Master advanced data profiling techniques, including identifying complex data types and understanding data distributions.
  • Apply specialized plots (e.g., QQ plots, KDE plots) for in-depth data exploration.
  • Analyze data quality across different dimensions (completeness, validity, accuracy, consistency, timeliness) and quantify their impact on model performance.
  • Implement custom profiling functions for specialized analysis and outlier detection.

Text-to-Speech

Listen to the lesson content

Lesson Content

Advanced Data Profiling Techniques

Beyond basic descriptive statistics, advanced profiling involves understanding data types, distributions, and potential anomalies. This requires using specialized libraries and custom functions.

1. Identifying Complex Data Types: While Pandas and other libraries automatically infer data types, manually inspecting and verifying them is crucial. This is particularly important with time series, geographical data, and unstructured data (text).

  • Example: Analyzing a dataset with a 'date' column. Initially, the column might be identified as 'object'. You'd use pd.to_datetime() to convert it and then explore its format using dt accessors (e.g., df['date'].dt.year).

2. Understanding Data Distributions: Visualizing data distributions is vital for understanding data characteristics.

  • Histograms: Useful for understanding the central tendency, spread, and shape of numerical data.
  • Kernel Density Estimation (KDE) Plots: Offer a smoother representation of the distribution than histograms, especially useful for identifying multimodal distributions.
  • QQ Plots (Quantile-Quantile Plots): Compare the distribution of your data to a theoretical distribution (e.g., normal). Deviations from the straight line indicate non-normality.

  • Example:
    ```python
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns

    Sample data (replace with your dataset)

    np.random.seed(42)
    data = pd.DataFrame({'value': np.random.normal(0, 1, 1000)})

    Histogram

    plt.figure(figsize=(8, 6))
    sns.histplot(data['value'], kde=True)
    plt.title('Histogram with KDE')
    plt.show()

    KDE Plot

    plt.figure(figsize=(8, 6))
    sns.kdeplot(data['value'])
    plt.title('KDE Plot')
    plt.show()

    QQ Plot

    import scipy.stats as stats
    plt.figure(figsize=(8, 6))
    stats.probplot(data['value'], dist="norm", plot=plt)
    plt.title('QQ Plot')
    plt.show()
    ```

3. Custom Profiling Functions: Develop functions for specific data exploration needs.

  • Example: Create a function to identify potential outliers based on the Interquartile Range (IQR).
    ```python
    def identify_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = series[(series < lower_bound) | (series > upper_bound)]
    return outliers

    Applying the function

    outliers = identify_outliers_iqr(data['value'])
    print(f'Outliers: {outliers}')
    ```

Data Quality Dimensions and Analysis

Data quality is multi-faceted. Understanding and assessing different dimensions is vital.

1. Completeness: Assessing the presence of missing values. This involves identifying the percentage of missing values in each column, understanding the reasons for missingness (MCAR, MAR, MNAR), and deciding on an imputation strategy.

  • Example: Use df.isnull().sum() and df.isnull().mean() to analyze missingness and visualize missing data patterns using missingno library (install it using pip install missingno).

2. Validity: Ensuring data conforms to defined constraints (e.g., data type, range, format). This could involve checking for invalid entries, such as negative ages or dates outside a valid range.

  • Example: Validating the 'age' column. Check if all values are non-negative and within a reasonable range (e.g., 0-120). Use conditional filtering (df[df['age'] < 0]) to find invalid values.

3. Accuracy: Evaluating the correctness of data values. This could involve comparing data to an external source or cross-validating values against each other.

  • Example: Comparing postal codes to a known database to confirm their validity. Utilize libraries like geopandas for this type of validation.

4. Consistency: Assessing the uniformity of data across different datasets or within a dataset (e.g., same units, standardized formats).

  • Example: Checking for inconsistent units in a 'temperature' column (e.g., both Celsius and Fahrenheit). You would need to convert to a consistent format.

5. Timeliness: Evaluating the age of the data and its relevance to current needs.

  • Example: Analyzing the lag between data collection and usage, particularly important for time-sensitive applications like financial modeling.

Quantifying the impact on model performance: Simulate a simple model (e.g., linear regression, classification) and inject different types of data errors (missing data, incorrect values) to assess the impact on performance metrics (e.g., RMSE, accuracy).

  • Example:
    ```python
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error
    import numpy as np
    import pandas as pd

    Generate sample data

    np.random.seed(42)
    X = np.random.rand(100, 1) * 10
    y = 2 * X.flatten() + 1 + np.random.randn(100) # Linear relationship with noise
    df = pd.DataFrame({'X': X.flatten(), 'y': y})

    Split data

    X_train, X_test, y_train, y_test = train_test_split(df[['X']], df['y'], test_size=0.2, random_state=42)

    Train a model

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse_original = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'Original RMSE: {rmse_original}')

    Inject errors (introduce missing values in X) - Example of one type of error

    X_test_missing = X_test.copy()
    missing_indices = np.random.choice(X_test_missing.index, size=10, replace=False)
    X_test_missing.loc[missing_indices, 'X'] = np.nan

    Impute missing values (using mean imputation)

    X_test_missing_imputed = X_test_missing.fillna(X_test_missing.mean())

    Make prediction with the modified test set

    y_pred_missing = model.predict(X_test_missing_imputed)
    rmse_missing = np.sqrt(mean_squared_error(y_test, y_pred_missing))
    print(f'RMSE after introducing missing data and imputing: {rmse_missing}')

    Example of another type of error injection (Incorrect value)

    X_test_incorrect = X_test.copy()
    incorrect_indices = np.random.choice(X_test_incorrect.index, size=5, replace=False) # Choose random row indexes
    X_test_incorrect.loc[incorrect_indices, 'X'] = X_test_incorrect.loc[incorrect_indices, 'X'] * 10 # Multiplying the values of chosen indexes by 10 (incorrect values)

    y_pred_incorrect = model.predict(X_test_incorrect)
    rmse_incorrect = np.sqrt(mean_squared_error(y_test, y_pred_incorrect))
    print(f'RMSE after introducing incorrect values: {rmse_incorrect}')
    ```

Progress
0%