09 June 2023 - Raviteja Gullapalli

Mind of Machines Series: Clustering for Insights: K-Means and Hierarchical Clustering

Clustering is a powerful unsupervised learning technique used to identify natural groupings within data. Unlike supervised learning algorithms, clustering doesn’t require labeled data, making it useful for exploring datasets where the structure is not known. In this article, we’ll dive into two of the most popular clustering algorithms: K-Means and Hierarchical Clustering. Both methods are widely used for uncovering patterns in data and gaining insights into its structure.

What is Clustering?

Clustering algorithms aim to group data points into clusters such that points in the same cluster are more similar to each other than to points in other clusters. It’s a useful technique in a variety of fields, from customer segmentation to image compression.

We will explore two prominent clustering algorithms:

K-Means Clustering: A centroid-based approach that partitions the data into k clusters.
Hierarchical Clustering: A tree-like approach that builds a hierarchy of clusters either by merging or splitting them.

K-Means Clustering

K-Means clustering is one of the simplest and most commonly used clustering algorithms. It partitions the data into k clusters by minimizing the distance between data points and their respective cluster centroids. The number of clusters, k, is defined beforehand.

How K-Means Works

Choose the number of clusters k.
Initialize the centroids randomly.
Assign each data point to the nearest centroid, forming clusters.
Recalculate the centroids of the clusters.
Repeat steps 3 and 4 until the centroids no longer change or a maximum number of iterations is reached.

Let’s see how to implement K-Means clustering in Python using scikit-learn.

Example: K-Means Clustering in Python

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Create a KMeans model with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)

# Predict cluster labels for the data points
y_kmeans = kmeans.predict(X)

# Plot the clustered data
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# Plot the centroids
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.title("K-Means Clustering Example")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

In this example, we generate synthetic data and cluster it into four groups using K-Means. The centroids of each cluster are marked in red. This method works well when you know the number of clusters in advance and is efficient on large datasets.

Hierarchical Clustering

Unlike K-Means, Hierarchical Clustering doesn’t require the number of clusters to be specified in advance. It builds a hierarchy of clusters either by agglomerating smaller clusters (agglomerative clustering) or by splitting larger clusters (divisive clustering).

Agglomerative Clustering

In agglomerative clustering, each data point starts as its own cluster. The algorithm then iteratively merges the closest clusters until all data points belong to a single cluster or the desired number of clusters is reached. This method is commonly visualized using a dendrogram, which shows the hierarchy of merges.

How Hierarchical Clustering Works

Assign each data point to its own cluster.
Merge the two closest clusters based on a distance metric (e.g., Euclidean distance).
Repeat step 2 until a single cluster remains or the desired number of clusters is achieved.

Here’s how we can implement hierarchical clustering in Python using the scipy and scikit-learn libraries.

Example: Hierarchical Clustering in Python

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Perform hierarchical clustering
Z = linkage(X, 'ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.show()

In this example, we generate synthetic data and apply agglomerative hierarchical clustering. The dendrogram shows the hierarchy of clusters and helps visualize how the data points are merged at each step.

K-Means vs. Hierarchical Clustering

Both K-Means and Hierarchical Clustering are effective clustering techniques, but they have different strengths and weaknesses:

Number of Clusters: K-Means requires specifying the number of clusters beforehand, while Hierarchical Clustering does not.
Scalability: K-Means is faster and scales better with large datasets, whereas Hierarchical Clustering can become computationally expensive with large data.
Cluster Shapes: K-Means assumes that clusters are spherical, which might not always be true. Hierarchical Clustering can capture more complex relationships between data points.

Conclusion

Clustering is a valuable tool for exploring data and uncovering hidden patterns. K-Means is great for large datasets and cases where you have a sense of how many clusters exist. Hierarchical Clustering, on the other hand, is useful for smaller datasets and when you want to visualize the clustering process through dendrograms. Both methods provide important insights into the structure of your data.

Raviteja Gullapalli

Friday, 9 June 2023

Mind of Machines Series : Clustering for Insights: K-Means and Hierarchical Clustering

Mind of Machines Series: Clustering for Insights: K-Means and Hierarchical Clustering

What is Clustering?

K-Means Clustering

How K-Means Works

Example: K-Means Clustering in Python

Hierarchical Clustering

Agglomerative Clustering

How Hierarchical Clustering Works

Example: Hierarchical Clustering in Python

K-Means vs. Hierarchical Clustering

Conclusion

0 comments:

Post a Comment

Search

Popular posts

categories

Have something for me?

SAY HELLO TO ME