**Scikit-learn and a First Simple Machine Learning Model: Linear Regression

In this lesson, you'll dive into the world of machine learning using Scikit-learn, a powerful Python library. You'll learn the fundamentals by building and evaluating a simple linear regression model, gaining practical experience in data analysis and model implementation.

Learning Objectives

  • Understand the purpose and functionality of Scikit-learn.
  • Load and prepare a simple dataset for linear regression.
  • Train a linear regression model using Scikit-learn.
  • Evaluate the performance of the trained model.

Text-to-Speech

Listen to the lesson content

Lesson Content

Introduction to Scikit-learn

Scikit-learn (sklearn) is a free, open-source Python library for machine learning. It provides a wide range of tools for various machine learning tasks, including classification, regression, clustering, and dimensionality reduction. It's designed to be accessible to beginners, with clear documentation and a consistent API. To get started, you'll need to install it. If you're using Anaconda, Scikit-learn is likely already installed. Otherwise, you can install it using pip: pip install scikit-learn.

Why use Scikit-learn?
* Ease of Use: Simple and consistent interface.
* Efficiency: Optimized for performance.
* Variety: Provides a large number of machine learning algorithms.
* Integration: Integrates well with other Python libraries like NumPy and Pandas.

Data Preparation: Loading and Understanding

Before building a model, you need data! For this lesson, we'll use a simple dataset representing the relationship between the number of hours studied and exam scores. You can imagine this data as a file (e.g., CSV) containing two columns: hours_studied and exam_score.

Here's how you might represent a small dataset in Python using NumPy:

import numpy as np

# Sample data (hours studied, exam score)
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
exam_score = np.array([60, 65, 70, 75, 80, 85, 90, 95, 100])

# Reshape hours_studied for the model
hours_studied = hours_studied.reshape(-1, 1)  # Important: sklearn expects a 2D array for features

Notice the reshape() method. Scikit-learn expects your input features (in this case, hours_studied) to be in a 2D array, where each row represents a sample, and each column represents a feature. reshape(-1, 1) transforms a 1D array into a column vector.

Building a Linear Regression Model

Linear regression models the relationship between a dependent variable (exam_score) and one or more independent variables (hours_studied) by fitting a linear equation to the observed data. The equation takes the form: exam_score = b0 + b1 * hours_studied, where b0 is the intercept and b1 is the coefficient.

Here's how you build and train a linear regression model using Scikit-learn:

from sklearn.linear_model import LinearRegression

# Create a linear regression model
model = LinearRegression()

# Train the model (fit the model to the data)
model.fit(hours_studied, exam_score)

The fit() method trains the model using your data. The model learns the best values for the intercept (b0) and the coefficient (b1).

Making Predictions and Evaluating the Model

Once the model is trained, you can use it to make predictions on new data.

# Predict exam scores for new hours studied
new_hours = np.array([[7.5], [3.5]]) # Example new data (reshape)
predicted_scores = model.predict(new_hours)
print(f"Predicted scores for {new_hours.flatten()} hours: {predicted_scores}")

To evaluate the model's performance, you can use metrics like Mean Squared Error (MSE) or R-squared. For simplicity, we'll focus on R-squared, which represents the proportion of variance in the dependent variable that can be predicted from the independent variables.

from sklearn.metrics import r2_score

# Make predictions on the original data
predictions = model.predict(hours_studied)

# Calculate R-squared
r_squared = r2_score(exam_score, predictions)
print(f"R-squared: {r_squared}")

R-squared values range from 0 to 1, with higher values indicating a better fit of the model to the data. A value of 1 means the model perfectly predicts the data.

Splitting Data into Training and Testing Sets

To get a more realistic assessment of your model's performance, you should split your data into training and testing sets. The model is trained on the training data and then evaluated on the unseen testing data. This helps you understand how well your model generalizes to new, unseen data. Scikit-learn provides a convenient function for this.

from sklearn.model_selection import train_test_split

# Split data into training and testing sets (e.g., 80% training, 20% testing)
hours_train, hours_test, score_train, score_test = train_test_split(hours_studied, exam_score, test_size=0.2, random_state=42) # random_state for reproducibility

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(hours_train, score_train)

# Predict on the test data
score_pred = model.predict(hours_test)

# Evaluate the model on the test data
r_squared = r2_score(score_test, score_pred)
print(f"R-squared (Test Set): {r_squared}")

The train_test_split() function randomly splits your data. The test_size parameter determines the proportion of the data to use for testing. random_state ensures that the split is reproducible – using the same value will result in the same split each time you run the code. The code then trains the model on the hours_train and score_train data and calculates the R-squared score on the unseen hours_test and score_test data.

Progress
0%