Advanced R: Machine Learning and Statistical Modeling

This lesson dives into building and deploying production-level machine learning models in R. You'll learn advanced feature engineering techniques, model evaluation strategies, and ensemble methods, culminating in the deployment of your models using tools like `plumber` and `shiny`.

Learning Objectives

  • Apply advanced feature engineering techniques to improve model performance.
  • Build and evaluate multiple machine learning models using packages like `caret`, `xgboost`, and `ranger`.
  • Implement cross-validation, hyperparameter tuning, and ensemble methods for optimal model accuracy.
  • Deploy machine learning models using `plumber` for API creation and `shiny` for interactive applications.

Text-to-Speech

Listen to the lesson content

Lesson Content

Advanced Feature Engineering

Feature engineering is crucial for model performance. We'll move beyond simple feature creation to explore more sophisticated techniques.

1. Domain Knowledge: Leverage your understanding of the data. For example, in a credit risk model, the length of employment might be transformed into categorical bins (e.g., 'less than 1 year', '1-3 years', '3+ years').

# Example: Convert continuous age into categories
data$age_category <- cut(data$age, breaks = c(0, 18, 30, 50, Inf), labels = c('Teen', 'Young Adult', 'Middle Aged', 'Senior'))

2. Interactions: Create new features by combining existing ones. For example, multiply two features together to capture their combined effect, or create ratios.

# Example: Create an interaction term
data$income_squared <- data$income^2
# Example:  Ratio of features
data$debt_to_income <- data$total_debt / data$annual_income

3. Transformations: Apply mathematical transformations (log, square root, etc.) to address skewness or improve model fit. Log transformations are common for handling right-skewed data like income.

# Example: Log transformation
data$log_income <- log(data$income + 1) # Adding 1 to handle zero values

4. Feature Selection: Select the most relevant features using techniques like Recursive Feature Elimination (RFE) or feature importance from tree-based models.

# Example: RFE using caret
library(caret)
ctrl <- rfeControl(functions = rfFuncs, method = "cv", number = 10)
rfe_result <- rfe(x, y, sizes = c(1:10), rfeControl = ctrl)

Model Building and Evaluation

We'll explore several popular machine learning algorithms in R and emphasize rigorous evaluation. The caret package is a powerful tool for this purpose.

1. Model Selection: Choose algorithms appropriate for your problem (classification, regression, etc.). Consider the data characteristics and desired performance.

2. Data Preprocessing: Ensure data is ready for modeling (handling missing values, scaling features).

# Example: Preprocessing using caret
library(caret)
preProcValues <- preProcess(train, method = c("center", "scale"))
trainTransformed <- predict(preProcValues, train)


#Example: Handling missing values with imputation
library(mice) 
# Impute missing values with mice package
md.pattern(data)  # Examine missing data patterns
imputed_data <- mice(data, method = 'pmm', m = 5, seed = 123)  # Use predictive mean matching
completed_data <- complete(imputed_data) #Get the imputed data

3. Model Training: Train models on a training dataset.

# Example: Training a Random Forest model using caret
model <- train(target ~ ., data = trainTransformed, method = "rf", trControl = trainControl(method = "cv", number = 10), metric = "Accuracy")

4. Cross-Validation: Use cross-validation (e.g., k-fold) to assess model performance more reliably.

# Example: Cross-validation
fitControl <- trainControl(method = "cv",  number = 10)
model <- train(target ~ ., data = trainTransformed, method = "glm", trControl = fitControl)

5. Hyperparameter Tuning: Optimize model parameters to improve performance (e.g., using tuneGrid in caret).

# Example: Hyperparameter tuning with caret
tuneGrid <- expand.grid(mtry = c(2, 4, 6, 8, 10))
model <- train(target ~ ., data = trainTransformed, method = "rf", trControl = trainControl(method = "cv", number = 10), tuneGrid = tuneGrid)

6. Evaluation Metrics: Select appropriate evaluation metrics based on your problem (e.g., accuracy, precision, recall, AUC, RMSE). Consider using the confusionMatrix() function.

# Example: Evaluating model performance
predictions <- predict(model, newdata = test)
confusionMatrix(predictions, test$target)

Ensemble Methods

Ensemble methods combine multiple models to improve predictive accuracy and robustness.

1. Stacking: Train a meta-learner on the predictions of base learners. Requires more sophisticated data splitting.

# Example: Stacking (simplified illustration - requires more complex implementation)
library(caret)
# Define base learners (e.g., glm and random forest)
model_glm <- train(target ~ ., data = trainTransformed, method = "glm", trControl = trainControl(method = "cv", number = 10))
model_rf <- train(target ~ ., data = trainTransformed, method = "rf", trControl = trainControl(method = "cv", number = 10))

# Get predictions from base learners on a holdout set
pred_glm <- predict(model_glm, newdata = test)
pred_rf <- predict(model_rf, newdata = test)

# Combine the predictions (as columns) into a single dataframe
stacked_data <- data.frame(glm = pred_glm, rf = pred_rf, target = test$target)

# Train a meta-learner (e.g., logistic regression) on the stacked data
meta_learner <- train(target ~ ., data = stacked_data, method = "glm", trControl = trainControl(method = "cv", number = 10))

2. Blending: Similar to stacking but simpler, involves averaging the predictions of base learners.

# Example: Blending
# Assume we have predictions from two models on the test set: pred_model1, pred_model2
# Blend the predictions by averaging
blended_predictions <- 0.5 * pred_model1 + 0.5 * pred_model2

3. Bagging: (Bootstrap Aggregating) Trains multiple instances of the same model on different subsets of the training data. Random Forest is an example.

4. Boosting: Sequentially trains models, with each model focusing on correcting the errors of the previous ones. XGBoost is a powerful boosting algorithm.

# Example: XGBoost
library(xgboost)
dtrain <- xgb.DMatrix(data = as.matrix(trainTransformed[,-which(names(trainTransformed) == "target")]), label = as.numeric(trainTransformed$target) - 1)
dtest <- xgb.DMatrix(data = as.matrix(test[, -which(names(test) == "target")]), label = as.numeric(test$target) - 1)

# Set parameters
params <- list(objective = "binary:logistic", eval_metric = "auc")

# Train model
model <- xgb.train(params = params, data = dtrain, nrounds = 100, watchlist = list(eval = dtest, train = dtrain))

Model Deployment with Plumber and Shiny

Deploying models allows you to make them accessible and useful in real-world applications. We'll cover two common deployment methods.

1. Plumber (API Creation): Transforms your R code into an API that can be accessed by other applications.

# Example: Basic Plumber API
# plumber.R
#* @get /predict
#* @param x: numeric value to predict
function(x){
  # Assuming you have a trained model called 'my_model'
  prediction <- predict(my_model, newdata = data.frame(x = x))
  return(list(prediction = prediction))
}

To deploy with plumber:
1. Save the code to a file (e.g., plumber.R).
2. Install the package: install.packages("plumber").
3. Start the API from your R console: library(plumber); pr <- plumber::plumb("plumber.R"); pr$run(port=8000).
4. Access the API (e.g., using http://localhost:8000/predict?x=5).

2. Shiny (Interactive Web Applications): Allows you to build interactive dashboards and web apps for model exploration and deployment.

# Example: Simple Shiny app
library(shiny)
# Define UI for application that draws a histogram
ui <- fluidPage(
    titlePanel("Model Prediction App"),
    sidebarLayout(
        sidebarPanel(
            numericInput("input_value", "Enter a value:", value = 0)
        ),
        mainPanel(
            textOutput("prediction")
        )
    )
)

# Define server logic required to draw a histogram
server <- function(input, output) {
    output$prediction <- renderText({
        # Assuming you have a trained model called 'my_model'
        prediction <- predict(my_model, newdata = data.frame(x = input$input_value))
        paste("Prediction: ", prediction)
    })
}

# Run the application
shinyApp(ui = ui, server = server)

To deploy with Shiny, save the code to a file (e.g., app.R) and run from your R console: shiny::runApp("app.R"). For production deployments, consider services like Shiny Server or Shinyapps.io.

3. Model Monitoring and Retraining: Set up pipelines for ongoing monitoring of model performance and retraining as needed. Use automated testing and continuous integration/continuous deployment (CI/CD) practices with tools like GitHub Actions or Jenkins.

Progress
0%