**Unsupervised Learning: Advanced Clustering and Dimensionality Reduction

This lesson dives deep into advanced unsupervised learning techniques, focusing on complex clustering algorithms and powerful dimensionality reduction methods. You'll learn how to apply these techniques to real-world datasets, evaluate their performance, and select the best approach for different scenarios.

Learning Objectives

  • Implement and interpret advanced clustering algorithms such as DBSCAN, OPTICS, and spectral clustering.
  • Apply t-SNE and UMAP for visualizing high-dimensional data and perform feature extraction.
  • Evaluate clustering performance using both internal and external validation metrics, and understand their limitations.
  • Experiment with data pre-processing and feature selection techniques to optimize unsupervised learning pipelines.

Text-to-Speech

Listen to the lesson content

Lesson Content

Advanced Clustering Algorithms

Beyond k-means and hierarchical clustering, explore more sophisticated algorithms.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions. It's particularly useful for identifying clusters of arbitrary shapes and handling noise. Key parameters: eps (maximum distance between two samples for them to be considered as in the same neighborhood) and min_samples (minimum number of samples in a neighborhood for a point to be considered as a core point).

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=150, noise=0.05, random_state=0)
dbs = DBSCAN(eps=0.3, min_samples=5)
clusters = dbs.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=clusters)
plt.title('DBSCAN Clustering')
plt.show()

OPTICS (Ordering Points To Identify the Clustering Structure): OPTICS is an algorithm that finds density-based clusters in data. It builds a reachability plot that helps visualize the clustering structure and can identify clusters of varying densities. It is a generalization of DBSCAN. Key parameters: min_samples (similar to DBSCAN), eps (can be omitted; it's used internally to determine the reachability distance).

from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=10)
optics.fit(X)

reachability = optics.reachability_[optics.ordering_]
labels = optics.labels_[optics.ordering_]

plt.plot(reachability)
plt.title('OPTICS Reachability Plot')
plt.show()

Spectral Clustering: Spectral clustering uses the eigenvalues of a similarity matrix to perform dimensionality reduction before clustering in a lower-dimensional space. It's effective for non-convex clusters and often performs well when data has complex structures. Requires the definition of a similarity function (e.g., Gaussian kernel) and tuning the number of clusters.

from sklearn.cluster import SpectralClustering

sc = SpectralClustering(n_clusters=2, random_state=0)
clusters = sc.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=clusters)
plt.title('Spectral Clustering')
plt.show()

Dimensionality Reduction: t-SNE and UMAP

These algorithms are designed primarily for visualization, especially of high-dimensional datasets in 2D or 3D. However, they can also be used for feature extraction.

t-SNE (t-distributed Stochastic Neighbor Embedding): t-SNE is a non-linear dimensionality reduction technique that is particularly good at visualizing clusters in high-dimensional data. It minimizes the divergence between a probability distribution over pairs of high-dimensional data points and a probability distribution over pairs of the corresponding low-dimensional points. Key parameters: perplexity (influences the local neighborhood size), n_iter (number of iterations).

from sklearn.manifold import TSNE
import numpy as np

data = np.random.rand(100, 100) # Example 100-dimensional data
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
reduced_data = tsne.fit_transform(data)

plt.scatter(reduced_data[:, 0], reduced_data[:, 1])
plt.title('t-SNE Visualization')
plt.show()

UMAP (Uniform Manifold Approximation and Projection): UMAP is a more recent algorithm that is often faster than t-SNE and tends to preserve global structure better. It builds a fuzzy topological representation of the high-dimensional data and then optimizes a low-dimensional representation to preserve this structure. Key parameters: n_neighbors (similar to perplexity in t-SNE), min_dist (controls how tightly the low-dimensional points are packed together).

import umap

reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
reduced_data = reducer.fit_transform(data)

plt.scatter(reduced_data[:, 0], reduced_data[:, 1])
plt.title('UMAP Visualization')
plt.show()

Feature Extraction using Dimensionality Reduction: After applying t-SNE or UMAP, the reduced dimensions can be used as features for other machine learning models. This can improve model performance and simplify data analysis, especially for high-dimensional data.

Clustering Evaluation Metrics

Evaluating unsupervised learning models requires different techniques than supervised learning. The choice of metric depends on the context and goal.

Internal Validation Metrics:

  • Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. Ranges from -1 to 1; higher is better. Requires labeled clusters.
    python from sklearn.metrics import silhouette_score print(f'Silhouette Score: {silhouette_score(X, clusters)}')
  • Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. Lower is better. Requires labeled clusters.
    python from sklearn.metrics import davies_bouldin_score print(f'Davies-Bouldin Index: {davies_bouldin_score(X, clusters)}')

External Validation Metrics (if ground truth labels are available):

  • Adjusted Rand Index (ARI): Measures the similarity between the clustering and the ground truth, corrected for chance. Values range from -1 to 1, with 1 indicating perfect agreement. Requires ground truth labels.
    python from sklearn.metrics import adjusted_rand_score print(f'Adjusted Rand Index: {adjusted_rand_score(y, clusters)}') # y is the ground truth labels
  • Normalized Mutual Information (NMI): Measures the mutual information between the clustering and the ground truth, normalized to be between 0 and 1. Higher is better. Requires ground truth labels.
    python from sklearn.metrics import normalized_mutual_info_score print(f'Normalized Mutual Information: {normalized_mutual_info_score(y, clusters)}') # y is the ground truth labels

Choosing the right metric: Understand the properties of each metric to assess the quality of your clustering results effectively. Internal metrics can be used when ground truth labels are unavailable, while external metrics provide a more direct assessment if labels exist. Consider the data characteristics and the desired outcome (e.g., separating well-defined clusters, identifying outliers).

Data Pre-processing and Feature Selection for Unsupervised Learning

Data preparation is critical for unsupervised learning. Appropriate pre-processing and feature selection techniques can significantly improve clustering and dimensionality reduction results.

Pre-processing:

  • Scaling: Crucial when features have different scales. Common techniques: StandardScaler, MinMaxScaler, RobustScaler (handles outliers). Standardize your features before feeding them to the clustering or dimensionality reduction algorithm.
    python from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
  • Handling Missing Values: Impute missing values using methods like mean, median, or more advanced techniques before applying unsupervised algorithms.

Feature Selection:

  • VarianceThreshold: Removes features with low variance, which might not be informative. Useful to remove features with little variation across samples. Use after data scaling.
    python from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.1) # Example threshold X_selected = selector.fit_transform(X_scaled)
  • Unsupervised Feature Selection with Mutual Information: Estimate mutual information between features and target (if it exists) to help to choose the best features for a model. This is especially useful in feature selection before downstream tasks.
    python from sklearn.feature_selection import mutual_info_classif # Or mutual_info_regression mi_scores = mutual_info_classif(X_scaled, y) # y is optional for unsupervised. feature_scores = pd.Series(mi_scores, index=X.columns) # X has column names feature_scores.sort_values(ascending=False) # display in order
  • Principal Component Analysis (PCA) for Feature Extraction: While primarily a dimensionality reduction technique, it also acts as a feature selection process. By selecting the top principal components, you're effectively selecting the most informative features.

Iterative Improvement: Experiment with different pre-processing and feature selection combinations to optimize your unsupervised learning pipeline. Use evaluation metrics to compare the results of different configurations.

Progress
0%