Feature Engineering

This lesson dives into advanced feature engineering techniques, equipping you with the skills to extract more meaningful information from your data. You'll learn how to create domain-specific features, automate feature generation, and select the most impactful features for improved model performance.

Learning Objectives

  • Apply advanced feature engineering techniques to time series, text, and geospatial data.
  • Implement feature interaction and automated feature selection methods.
  • Build feature engineering pipelines using scikit-learn or featuretools.
  • Understand the advantages and limitations of various feature engineering approaches.

Text-to-Speech

Listen to the lesson content

Lesson Content

Domain-Specific Feature Engineering

Feature engineering becomes powerful when tailored to the specific domain of your data. Let's explore techniques for different data types:

1. Time Series Data:

  • Rolling Statistics: Calculate moving averages, standard deviations, and other statistics over a rolling window. For example, using the rolling() function in pandas. This helps capture trends and seasonality.
    python import pandas as pd # Assuming 'data' is a time series DataFrame with a 'value' column data['rolling_mean_7'] = data['value'].rolling(window=7).mean() # 7-day rolling mean data['rolling_std_30'] = data['value'].rolling(window=30).std() # 30-day rolling std
  • Lags: Create lagged features by shifting the data by a specified number of time periods. This allows the model to learn from past values.
    python data['lag_1'] = data['value'].shift(1) # Value from the previous time step data['lag_7'] = data['value'].shift(7) # Value from 7 time steps ago
  • Date/Time Features: Extract components from the datetime index (e.g., year, month, day, hour, day of the week). These can capture cyclical patterns.
    python data['year'] = data.index.year data['month'] = data.index.month data['dayofweek'] = data.index.dayofweek # Monday is 0, Sunday is 6
  • Exponential Weighted Moving Average (EWMA): Gives more weight to recent observations, capturing short-term trends better than simple moving averages. The ewm() function in pandas can do this.

2. Text Data:

  • TF-IDF (Term Frequency-Inverse Document Frequency): Measures the importance of a word in a document relative to a corpus of documents. Useful for capturing the relevance of words. Scikit-learn's TfidfVectorizer is your friend.
    python from sklearn.feature_extraction.text import TfidfVectorizer # Assuming 'text_data' is a list of text documents vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(text_data) tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
  • Word Embeddings (Word2Vec, GloVe, FastText): Represent words as dense vector representations, capturing semantic relationships. Pre-trained models can be used (e.g., with Gensim or SpaCy) or you can train your own.
    python # Example using Gensim for Word2Vec (requires pre-trained model or training) from gensim.models import Word2Vec # Assuming 'sentences' is a list of tokenized sentences model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) # Get vector for a word: word_vector = model.wv['word']
  • Sentiment Analysis: Use libraries like NLTK or TextBlob to extract sentiment scores (positive, negative, neutral) from text. Also, POS tagging, NER.

3. Geospatial Data:

  • Distance Calculations: Calculate distances between locations using the Haversine formula (for latitude/longitude coordinates). The geopy library is helpful. Can calculate distances to key landmarks.
    python from geopy.distance import geodesic # Assuming you have latitude/longitude coordinates for two points point1 = (latitude1, longitude1) point2 = (latitude2, longitude2) distance = geodesic(point1, point2).km # Distance in kilometers
  • Coordinate Systems: Understanding different coordinate systems (e.g., latitude/longitude, UTM) and how to convert between them is crucial.
  • Spatial Joins: Combine geospatial data with other datasets by performing spatial joins (e.g., finding the nearest city to a location).
  • Creating features using libraries like Shapely and Fiona

Feature Interaction

Feature interaction involves creating new features by combining existing ones. This captures non-linear relationships and can significantly improve model performance. Common methods include:

  • Polynomial Features: Generate combinations of features raised to various powers. Scikit-learn's PolynomialFeatures is your tool. Creates all polynomial combinations up to a specified degree.
    python from sklearn.preprocessing import PolynomialFeatures import numpy as np # Assuming 'X' is your feature matrix poly = PolynomialFeatures(degree=2, include_bias=False) # Degree 2 includes all combinations X_poly = poly.fit_transform(X)
  • Interaction Terms: Multiply two or more features together. Can be useful to represent that the effect of one variable depends on the value of another.
    python # Example: Create an interaction term for feature1 and feature2 import numpy as np data['interaction'] = data['feature1'] * data['feature2']
  • Binning/One-Hot Encoding of Interacted Features: First create interaction terms, then bin and one-hot encode. This is particularly useful when the interaction is non-linear and complex. Also, can interact binned features.
  • Ratio Features: Create the ratio of two features, if you think that the relationship between them is better represented by the ratio than the absolute values.

Automated Feature Selection

Selecting the most relevant features is essential for preventing overfitting, improving model performance, and reducing computational cost. Here are some techniques:

  • Filter Methods: These methods select features based on their statistical properties, independent of the model. Fast and computationally inexpensive.
    • Variance Threshold: Remove features with low variance (little variability). Useful for removing constant features or near-constant features.
      python from sklearn.feature_selection import VarianceThreshold threshold = 0.1 # Remove features with variance less than 0.1 selector = VarianceThreshold(threshold=threshold) X_filtered = selector.fit_transform(X)
    • Correlation-Based Selection: Remove features that are highly correlated with other features (e.g., using Pearson correlation). Can cause multicollinearity.
      python import pandas as pd corr_matrix = X.corr().abs() # Absolute values of the correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) # Get upper triangle of correlation matrix to_drop = [column for column in upper.columns if any(upper[column] > 0.95)] # Find columns with correlation > 0.95 X_filtered = X.drop(columns=to_drop)
    • Univariate Feature Selection: Select features based on their individual relationship with the target variable (e.g., using chi-squared for categorical features, f_regression for numerical). Scikit-learn provides SelectKBest with various scoring functions.
      python from sklearn.feature_selection import SelectKBest, f_regression selector = SelectKBest(score_func=f_regression, k=10) # Select the best 10 features X_selected = selector.fit_transform(X, y)
  • Wrapper Methods: These methods evaluate subsets of features by training and testing a model. Computationally more expensive, but can identify feature combinations that filter methods might miss.
    • Recursive Feature Elimination (RFE): Recursively removes features and evaluates the model's performance until a desired number of features is reached. Scikit-learn's RFE is your tool.
      python from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # Or any model model = LogisticRegression(solver='liblinear') rfe = RFE(model, n_features_to_select=10) # Select 10 features X_rfe = rfe.fit_transform(X, y)
  • Embedded Methods: These methods perform feature selection as part of the model training process. Examples include L1 regularization (Lasso regression), which shrinks the coefficients of irrelevant features to zero, and tree-based models (e.g., Random Forest, Gradient Boosting), which can provide feature importance scores.
    python from sklearn.linear_model import Lasso lasso = Lasso(alpha=0.1) # alpha is the regularization strength lasso.fit(X, y) # The coefficients of the Lasso model indicate feature importance. Features with coefficients close to zero are less important.

Feature Engineering Pipelines (scikit-learn and featuretools)

Automating feature engineering is crucial for efficiency and reproducibility.

  • Scikit-learn Pipelines: Allow you to chain multiple feature engineering steps (e.g., imputation, scaling, feature creation) with a model, ensuring consistent transformations and preventing data leakage during cross-validation.
    ```python
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.impute import SimpleImputer #for handling missing values
    from sklearn.linear_model import LogisticRegression

    Define the pipeline

    pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')), # Impute missing values with the mean
    ('scaler', StandardScaler()), # Scale the data
    ('model', LogisticRegression(solver='liblinear')) # Use a logistic regression model
    ])

    Fit and use the pipeline

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    `` You can include preprocessors, feature transformers (e.g.,PolynomialFeatures`), and models within the same pipeline.

  • Featuretools: A powerful library specifically designed for automated feature engineering, especially for relational data. It can automatically create features across multiple tables. Requires installing the library pip install featuretools. Concept of entity and entity set (similar to relational tables). Featuretools creates features that describe relationships between your entities.
    ```python
    import featuretools as ft
    # Assuming you have data in pandas DataFrames and a defined entity set (replace with your setup)
    # Example: Defining an entity set with two tables:
    # Define entities:
    entities = {
    'customers': (
    customers_df, # DataFrame
    'customer_id' # Index Column
    ),
    'transactions': (
    transactions_df, # DataFrame
    'transaction_id', # Index Column
    'customer_id' # Relationship
    )
    }
    # Create entity set
    es = ft.EntitySet(id='ecommerce')
    # Add entities to entity set
    es.add_entities(entities.values())

    Add relationships (if you have relationships between tables)

    relationships = [
    ft.Relationship(es['customers'], 'customer_id', es['transactions'], 'customer_id')
    ]
    es.add_relationships(relationships)

    Automatic feature engineering:

    feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='customers')
    print(feature_matrix.head())
    ```
    Featuretools is particularly useful for creating features like aggregate statistics (e.g., total purchase amount per customer) and time-based features.

Progress
0%