Lesson 7: **Scikit-learn and a First Simple Machine Learning Model: Linear Regression

Lesson Content

Introduction to Scikit-learn

Scikit-learn (sklearn) is a free, open-source Python library for machine learning. It provides a wide range of tools for various machine learning tasks, including classification, regression, clustering, and dimensionality reduction. It's designed to be accessible to beginners, with clear documentation and a consistent API. To get started, you'll need to install it. If you're using Anaconda, Scikit-learn is likely already installed. Otherwise, you can install it using pip: pip install scikit-learn.

Why use Scikit-learn?
* Ease of Use: Simple and consistent interface.
* Efficiency: Optimized for performance.
* Variety: Provides a large number of machine learning algorithms.
* Integration: Integrates well with other Python libraries like NumPy and Pandas.

Data Preparation: Loading and Understanding

Before building a model, you need data! For this lesson, we'll use a simple dataset representing the relationship between the number of hours studied and exam scores. You can imagine this data as a file (e.g., CSV) containing two columns: hours_studied and exam_score.

Here's how you might represent a small dataset in Python using NumPy:

import numpy as np

# Sample data (hours studied, exam score)
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
exam_score = np.array([60, 65, 70, 75, 80, 85, 90, 95, 100])

# Reshape hours_studied for the model
hours_studied = hours_studied.reshape(-1, 1)  # Important: sklearn expects a 2D array for features

Notice the reshape() method. Scikit-learn expects your input features (in this case, hours_studied) to be in a 2D array, where each row represents a sample, and each column represents a feature. reshape(-1, 1) transforms a 1D array into a column vector.

Building a Linear Regression Model

Linear regression models the relationship between a dependent variable (exam_score) and one or more independent variables (hours_studied) by fitting a linear equation to the observed data. The equation takes the form: exam_score = b0 + b1 * hours_studied, where b0 is the intercept and b1 is the coefficient.

Here's how you build and train a linear regression model using Scikit-learn:

from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Train the model (fit the model to the data)
model.fit(hours_studied, exam_score)

The fit() method trains the model using your data. The model learns the best values for the intercept (b0) and the coefficient (b1).

Making Predictions and Evaluating the Model

Once the model is trained, you can use it to make predictions on new data.

# Predict exam scores for new hours studied
new_hours = np.array([[7.5], [3.5]]) # Example new data (reshape)
predicted_scores = model.predict(new_hours)
print(f"Predicted scores for {new_hours.flatten()} hours: {predicted_scores}")

To evaluate the model's performance, you can use metrics like Mean Squared Error (MSE) or R-squared. For simplicity, we'll focus on R-squared, which represents the proportion of variance in the dependent variable that can be predicted from the independent variables.

from sklearn.metrics import r2_score

# Make predictions on the original data
predictions = model.predict(hours_studied)

# Calculate R-squared
r_squared = r2_score(exam_score, predictions)
print(f"R-squared: {r_squared}")

R-squared values range from 0 to 1, with higher values indicating a better fit of the model to the data. A value of 1 means the model perfectly predicts the data.

Splitting Data into Training and Testing Sets

To get a more realistic assessment of your model's performance, you should split your data into training and testing sets. The model is trained on the training data and then evaluated on the unseen testing data. This helps you understand how well your model generalizes to new, unseen data. Scikit-learn provides a convenient function for this.

from sklearn.model_selection import train_test_split

# Split data into training and testing sets (e.g., 80% training, 20% testing)
hours_train, hours_test, score_train, score_test = train_test_split(hours_studied, exam_score, test_size=0.2, random_state=42) # random_state for reproducibility

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(hours_train, score_train)

# Predict on the test data
score_pred = model.predict(hours_test)

# Evaluate the model on the test data
r_squared = r2_score(score_test, score_pred)
print(f"R-squared (Test Set): {r_squared}")

The train_test_split() function randomly splits your data. The test_size parameter determines the proportion of the data to use for testing. random_state ensures that the split is reproducible – using the same value will result in the same split each time you run the code. The code then trains the model on the hours_train and score_train data and calculates the R-squared score on the unseen hours_test and score_test data.

Deep Dive

Explore advanced insights, examples, and bonus exercises to deepen understanding.

Day 7: Data Scientist - Machine Learning Fundamentals - Extended Learning

Lesson Recap: Linear Regression with Scikit-learn

Today, you built your first machine learning model: a linear regression model. You used Scikit-learn to load a dataset, train the model, and evaluate its performance. You've experienced the basic workflow: data preparation, model selection, training, and evaluation.

Deep Dive Section: Beyond the Basics of Linear Regression

1. Assumptions of Linear Regression

Linear regression, while powerful, makes several assumptions about the data. Understanding these assumptions is critical for interpreting your model's results and knowing when to use it:

Linearity: The relationship between the independent and dependent variables is linear. (If it isn't, consider a transformation or a different model)
Independence of Errors: Errors (residuals) are independent of each other. (Check for patterns in your residuals.)
Homoscedasticity: The variance of the errors is constant across all levels of the independent variables. (Use a residual plot to check.)
Normality of Errors: The errors are normally distributed. (Assess using histograms or QQ-plots of residuals.)

Failing to meet these assumptions can lead to inaccurate predictions or misleading interpretations.

2. Feature Scaling (Normalization and Standardization)

Machine learning algorithms often perform better when features are scaled. This prevents features with larger values from dominating the model. Scikit-learn offers several scaling methods:

Normalization (Min-Max Scaling): Scales features to a range between 0 and 1. Useful when the data distribution is not Gaussian. Use `sklearn.preprocessing.MinMaxScaler`.
Standardization (Z-score Scaling): Scales features so they have a mean of 0 and a standard deviation of 1. This is a good choice when the data is roughly normally distributed. Use `sklearn.preprocessing.StandardScaler`.

Bonus Exercises

Exercise 1: Data Preparation and Feature Scaling

1. Load a dataset from Scikit-learn (e.g., `load_boston` or `load_iris`). 2. Divide the dataset into training and testing sets. 3. Apply either `MinMaxScaler` or `StandardScaler` to your features before training your model on the training data. 4. Train a linear regression model. 5. Evaluate the model's performance on the testing set and compare to the original model (without scaling).

Exercise 2: Residual Analysis

1. Train a linear regression model on a dataset. 2. Calculate the residuals (the difference between the actual and predicted values). 3. Create a scatter plot of the residuals against the predicted values. 4. Create a histogram or QQ-plot of the residuals. 5. Analyze the plots. Do you see any patterns suggesting violations of the linear regression assumptions? What might they imply?

Real-World Connections

Linear regression and the concepts you've learned are applied in various fields:

Predicting House Prices: Real estate agents use linear regression, considering factors like square footage, location, and number of bedrooms.
Sales Forecasting: Businesses predict future sales based on historical data and marketing spend.
Finance: Analyzing stock prices and predicting future trends.
Medical Research: Analyzing relationships between patient characteristics and treatment outcomes.
Weather Prediction: Using linear regression models to understand and predict weather conditions based on factors like pressure, temperature, etc.

Challenge Yourself

Try building a linear regression model with multiple independent variables (multiple linear regression). Investigate how the different features contribute to the prediction (coefficients of the model). Experiment with using polynomial features (e.g., squaring a feature) to capture non-linear relationships.

Further Learning

Explore these topics to deepen your understanding:

Regularization Techniques: Ridge, Lasso, and Elastic Net regression (which add penalties to the coefficients to prevent overfitting).
Polynomial Regression: Modeling non-linear relationships using polynomial features.
Model Evaluation Metrics: Explore different metrics beyond R-squared, such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).
Data Visualization: Learn more advanced plotting techniques to visualize your data and model results effectively. Consider using libraries like Matplotlib and Seaborn.

Interactive Exercises

Exercise 1: Data Exploration

Create a scatter plot visualizing the relationship between `hours_studied` and `exam_score` from the sample data. This helps you visually confirm the linear relationship. Use libraries like Matplotlib or Seaborn (if available).

Exercise 2: Model Training and Prediction

Using the sample data, train a linear regression model, make a prediction for a student studying for 6.5 hours, and print the predicted exam score.

Exercise 3: Model Evaluation and Interpretation

Calculate and print the R-squared score for the trained model. Interpret the meaning of the R-squared value in the context of the relationship between hours studied and exam score.

Exercise 4: Data Splitting and Model Evaluation

Split the sample data into training and testing sets using `train_test_split()`. Train a linear regression model using the training data, make predictions on the testing data, and then calculate and print the R-squared score using the testing data. Compare the R-squared score on the testing set to the R-squared score on the whole dataset.

Cookie Preferences

Regenerating Content

**Scikit-learn and a First Simple Machine Learning Model: Linear Regression

Learning Objectives

Text-to-Speech