Spectral Clustering Explained: How Eigenvectors Reveal Complex Cluster Structures

Dataemia
14 Min Read


and eigenvectors are key concepts in linear algebra that also play an important role in data science and machine learning. Previously, we discussed how dimensionality reduction can be performed with eigenvalues and eigenvectors of the covariance matrix.

Today, we’re going to discuss another interesting application: How eigenvalues and eigenvectors can be used to perform spectral clustering, which works well with complex cluster structures.

In this article, we’ll explore how eigenvalues and eigenvectors make spectral clustering possible and why this method can outperform traditional K-means.

We’ll begin with a simple visualization that will show you the importance of spectral clustering and motivate you to continue learning how spectral clustering can be performed with eigenvalues and eigenvectors.

Motivation for Spectral Clustering

A great way to learn spectral clustering is to compare it with a traditional clustering algorithm like K-means on a dataset where K-means struggles to perform well. 

Here, we use an artificially generated two-moon dataset where clusters are curved. The Scikit-learn make_moons algorithm generates two moons in 2-dimensional space. Then, we use Scikit-learn KMeans and SpectralClustering algorithms to perform K-means and spectral clustering. Finally, we compare the cluster visualizations.

Making moon data

# Make moon data
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=400, noise=0.05,
                  random_state=0)

plt.figure(figsize=[4.2, 3])
plt.scatter(X[:,0], X[:,1], s=20)
plt.title("Original Moon Data")
plt.savefig("Moon data.png")
Original moon data (Image by author)

The original dataset has two curved cluster structures called moons. That’s why we call it moon data

Applying K-means to the moon data

# Apply K-means
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)

# Predict cluster index for each data point
labels_kmeans = kmeans.fit_predict(X)

# Visualize Clusters
plt.figure(figsize=[4.2, 3])
plt.scatter(X[:,0], X[:,1], c=labels_kmeans, s=20)
plt.title("K-Means Clustering")
plt.savefig("K-means.png")
K-means clustering on moon data (Image by author)

K-means often groups the moon data incorrectly (it incorrectly mixes the data points).

Applying spectral clustering to the moon data

# Apply spectral clustering
from sklearn.cluster import SpectralClustering

spectral = SpectralClustering(n_clusters=2,
                              affinity='nearest_neighbors',
                              random_state=0)

# Predict cluster index for each data point
labels_spectral = spectral.fit_predict(X)

# Visualize Clusters
plt.figure(figsize=[4.2, 3])
plt.scatter(X[:,0], X[:,1], c=labels_spectral, s=20)
plt.title("Spectral Clustering")
plt.savefig("Spectral.png")
Spectral clustering on moon data (Image by author)

Now the data points are correctly assigned to the moons, which look similar to the original data. Spectral clustering works well on complex cluster structures. This is because the eigenvectors of the Laplacian matrix allow it to detect complex cluster structures.

So far, we have implemented spectral clustering using the built-in SpectralClustering algorithm in Scikit-learn. Next, you will learn how to implement spectral clustering from scratch. This will help you understand how eigenvalues and eigenvectors work behind the scenes in the algorithm.

What is Spectral Clustering?

Spectral clustering groups data points based on their similarities instead of distances. This allows it to reveal non-linear, complex cluster structures without following the assumptions of traditional k-means clustering.

The intuition behind performing spectral clustering is as follows:

Steps to perform spectral clustering

  1. Get data
  2. Build the similarity matrix
  3. Build the degree matrix
  4. Build the Laplacian matrix (graph Laplacian)
  5. Find eigenvalues and eigenvectors of the Laplacian matrix. Eigenvectors reveal cluster structure (how data points group together), acting as new features, and eigenvalues indicate the strength of cluster separation
  6. Select the most important eigenvectors to embed the data in a lower dimension (dimensionality reduction)
  7. Apply K-means on the new feature space (clustering)

Spectral clustering combines dimensionality reduction and K-means clustering. We embed the data in a lower-dimensional space (where clusters are easier to separate) and then perform K-means clustering on the new feature space. In summary, K-means clustering works in the original feature space while spectral clustering works in the new reduced feature space.

Implementing Spectral Clustering — Step by Step

We’ve summarized the steps of performing spectral clustering with eigenvalues and eigenvectors of the Laplacian matrix. Let’s implement these steps with Python. 

1. Get data

We’ll use the same data as previously used.

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=400, noise=0.05,
                  random_state=0)

2. Build the similarity (affinity) matrix

Spectral clustering groups data points based on their similarities. Therefore, we need to measure similarity between data points and include these values in a matrix. This matrix is called the similarity matrix (W). Here, we measure similarity using a Gaussian kernel.

If you have n number of data points, the shape of W is (n, n). Each value represents similarity between two data points. Higher values in the matrix mean points are more similar.

from sklearn.metrics.pairwise import rbf_kernel

W = rbf_kernel(X, gamma=20)

3. Build the degree matrix

The degree matrix (D) contains the sum of similarities for each node. This is a diagonal matrix and each diagonal value shows the total similarity of that point to all other points. All off-diagonal elements are zero. The shape of the degree matrix is also (n, n).

import numpy as np

D = np.diag(np.sum(W, axis=1))

np.sum(W, axis=1)sums each row of the similarity matrix.

4. Build the Laplacian matrix

The Laplacian matrix (L) represents the structure of the similarity graph, where nodes represent each data point, and edges connect similar points. So, this matrix is also called the graph Laplacian and is defined as follows. 

Calculating the Laplacian matrix (Image by author)

In Python, it is 

L = D - W

D — W for L mathematically ensures that spectral clustering will find groups of data points that are strongly connected within the group but weakly connected to other groups.

The Laplacian matrix (L) is also an (n, n) square matrix. This property is important for L as eigendecomposition is defined only for square matrices.

5. Eigendecomposition of the Laplacian matrix

Eigendecomposition of the Laplacian matrix is the process of decomposing (factorizing) that matrix into eigenvalues and eigenvectors [ref: Eigendecomposition of a Covariance Matrix with NumPy]

If the Laplacian matrix (L) has n eigenvectors, we can decompose it as:

Eigendecomposition of L (Image by author)

Where:

  • X = matrix of eigenvectors
  • Λ = diagonal matrix of eigenvalues

The matrices X and Λ can be represented as follows:

Matrices of eigenvectors and eigenvalues (Image by author)

The vectors x1, x2 and x3 are eigenvectors and λ1, λ2 and λ3 are their corresponding eigenvalues. 

The eigenvalues and eigenvectors come in pairs. Such a pair is known as an eigenpair. So, matrix L can have multiple eigenpairs [ref: Eigendecomposition of a Covariance Matrix with NumPy]

The following eigenvalue equation shows the relationship between L and one of its eigenpairs.

Eigenvalue equation of L (Image by author)

Where:

  • L = Laplacian matrix (should be a square matrix)
  • x = eigenvector
  • λ = eigenvalue (scaling factor)

Let’s calculate all eigenpairs of the Laplacian matrix.

eigenvalues, eigenvectors = np.linalg.eigh(L)

6. Select the most important eigenvectors

In spectral clustering, the algorithm uses the smallest eigenvectors of the Laplacian matrix. So, we need to select the smallest ones in the eigenvectors matrix. 

The smallest eigenvalues correspond to the smallest eigenvectors. The eigh() function returns eigenvalues and eigenvectors in ascending order. So, we need to look at the first few values of eigenvalues vector. 

print(eigenvalues)
First few eigenvalues of L (Image by author)

We pay attention to the difference between consecutive eigenvalues. This difference is known as eigengap. We select the eigenvalue that maximizes the eigengap. It represents the number of clusters. This method is called the eigengap heuristic

According to the eigengap heuristic, the optimal number of clusters k is selected at the point where the largest jump occurs between successive eigenvalues.

If there are k very small eigenvalues, there will be k clusters! In our example, the first two small eigenvalues suggest two clusters, which is exactly what we expect. This is the role of eigenvalues in spectral clustering. They are very useful to decide the number of clusters and the smallest eigenvectors! 

We select the first two eigenvectors corresponding to these small eigenvalues.

k = 2
U = eigenvectors[:, :k]
U contains two eigenvectors (Image by author)

These two eigenvectors in the matrix U represent a new feature space called spectral embedding, where clusters become linearly separable. Here is the visualization of spectral embedding. 

import matplotlib.pyplot as plt

plt.figure(figsize=[4.2, 3])
plt.scatter(U[:,0], U[:,1], s=20)
plt.title("Spectral Embedding")
plt.xlabel("Eigenvector 1")
plt.ylabel("Eigenvector 2")
plt.savefig("Spectral embedding.png")
Visualization of spectral embedding (Image by author)

This plot shows how eigenvectors transform the original data into a new space where clusters become linearly separable. 

7. Apply K-means on spectral embedding

Now, we can simply apply K-means in spectral embedding (new eigenvector space) to get cluster lables and then we assign those labels to the original data to create clusters. K-means works well here because clusters are linearly separable in the new eigenvector space. 

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=k)
labels_spectral = kmeans.fit_predict(U)
# U represents spectral embedding

plt.figure(figsize=[4.2, 3])
# Assign cluster labels to original data
plt.scatter(X[:,0], X[:,1], c=labels_spectral, s=20)
plt.title("Spectral Clustering")
plt.savefig("Spectral Manual.png")
Spectral clustering from eigendecomposition (Image by author)

This is the same as what we got from the Scikit-learn version! 

Choosing the Right Value for Gamma

When creating the similarity matrix or measuring similarity using a Gaussian kernel, we need to define the right value for the gamma hyperparameter, which controls how quickly similarity decreases with distance between data points.

from sklearn.metrics.pairwise import rbf_kernel

W = rbf_kernel(X, gamma=?)

For small gamma values, similarity decreases slowly, and many points appear similar. Therefore, this results in incorrect cluster structures.

For large gamma values, similarity decreases very fast, and only very close points are connected. Therefore, clusters become highly separated. 

For medium values, you’ll get balanced clusters.

It is better to try multiple values, such as 0.1, 0.5, 1, 5, 10, 15, and visualize the clustering results to choose the best.

Closing Thoughts

In spectral clustering, a dataset is represented as a graph instead of a collection of points. In that graph, each data point is a node and the lines (edges) between nodes define how similar points connect together.

Moon dataset as a graph (Image by author)

The spectral clustering algorithm needs this graph representation in a mathematical form. That’s why we’ve constructed a similarity (affinity) matrix (W). Each value in that matrix measures the similarity between data points. Large values in the matrix mean two points are very similar, while small values mean two points are very different.

Next, we’ve built the degree matrix (D), which is a diagonal matrix where each diagonal value shows the total similarity of that point to all other points.

Using the degree matrix and the similarity matrix, we’ve constructed the graph Laplacian matrix, which captures the structure of the graph and is essential for spectral clustering. 

We’ve computed the eigenvalues and eigenvectors of the Laplacian matrix. The eigenvalues help to choose the best number of clusters and the smallest eigenvectors. They also indicate the strength of cluster separation. The eigenvectors reveal the cluster structure (cluster boundaries or how data points group together) and are used to obtain a new feature space where strongly-connected points in the graph become close together in this space. Clusters become easier to separate, and K-means works well in the new space.

Here is the complete workflow of spectral clustering. 

Dataset → Similarity Graph → Graph Laplacian → Eigenvectors → Clusters


This is the end of today’s article.

Please let me know if you have any questions or feedback.

See you in the next article. Happy learning to you!

Designed and written by: 
Rukshan Pramoditha

2025–03–08



Source link

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!