Friday, 9 June 2023

  • Mind of Machines Series : Clustering for Insights: K-Means and Hierarchical Clustering

    09 June 2023 - Raviteja Gullapalli



    Mind of Machines Series: Clustering for Insights: K-Means and Hierarchical Clustering

    Clustering is a powerful unsupervised learning technique used to identify natural groupings within data. Unlike supervised learning algorithms, clustering doesn’t require labeled data, making it useful for exploring datasets where the structure is not known. In this article, we’ll dive into two of the most popular clustering algorithms: K-Means and Hierarchical Clustering. Both methods are widely used for uncovering patterns in data and gaining insights into its structure.

    What is Clustering?

    Clustering algorithms aim to group data points into clusters such that points in the same cluster are more similar to each other than to points in other clusters. It’s a useful technique in a variety of fields, from customer segmentation to image compression.

    We will explore two prominent clustering algorithms:

    • K-Means Clustering: A centroid-based approach that partitions the data into k clusters.
    • Hierarchical Clustering: A tree-like approach that builds a hierarchy of clusters either by merging or splitting them.

    K-Means Clustering

    K-Means clustering is one of the simplest and most commonly used clustering algorithms. It partitions the data into k clusters by minimizing the distance between data points and their respective cluster centroids. The number of clusters, k, is defined beforehand.

    How K-Means Works

    1. Choose the number of clusters k.
    2. Initialize the centroids randomly.
    3. Assign each data point to the nearest centroid, forming clusters.
    4. Recalculate the centroids of the clusters.
    5. Repeat steps 3 and 4 until the centroids no longer change or a maximum number of iterations is reached.

    Let’s see how to implement K-Means clustering in Python using scikit-learn.

    Example: K-Means Clustering in Python

    # Import necessary libraries
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.datasets import make_blobs
    
    # Generate synthetic data
    X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
    
    # Create a KMeans model with 4 clusters
    kmeans = KMeans(n_clusters=4, random_state=42)
    kmeans.fit(X)
    
    # Predict cluster labels for the data points
    y_kmeans = kmeans.predict(X)
    
    # Plot the clustered data
    plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
    
    # Plot the centroids
    centers = kmeans.cluster_centers_
    plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
    plt.title("K-Means Clustering Example")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.show()
    

    In this example, we generate synthetic data and cluster it into four groups using K-Means. The centroids of each cluster are marked in red. This method works well when you know the number of clusters in advance and is efficient on large datasets.

    Hierarchical Clustering

    Unlike K-Means, Hierarchical Clustering doesn’t require the number of clusters to be specified in advance. It builds a hierarchy of clusters either by agglomerating smaller clusters (agglomerative clustering) or by splitting larger clusters (divisive clustering).

    Agglomerative Clustering

    In agglomerative clustering, each data point starts as its own cluster. The algorithm then iteratively merges the closest clusters until all data points belong to a single cluster or the desired number of clusters is reached. This method is commonly visualized using a dendrogram, which shows the hierarchy of merges.

    How Hierarchical Clustering Works

    1. Assign each data point to its own cluster.
    2. Merge the two closest clusters based on a distance metric (e.g., Euclidean distance).
    3. Repeat step 2 until a single cluster remains or the desired number of clusters is achieved.

    Here’s how we can implement hierarchical clustering in Python using the scipy and scikit-learn libraries.

    Example: Hierarchical Clustering in Python

    # Import necessary libraries
    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.hierarchy import dendrogram, linkage
    from sklearn.datasets import make_blobs
    
    # Generate synthetic data
    X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
    
    # Perform hierarchical clustering
    Z = linkage(X, 'ward')
    
    # Plot the dendrogram
    plt.figure(figsize=(10, 7))
    dendrogram(Z)
    plt.title("Hierarchical Clustering Dendrogram")
    plt.xlabel("Data Points")
    plt.ylabel("Distance")
    plt.show()
    

    In this example, we generate synthetic data and apply agglomerative hierarchical clustering. The dendrogram shows the hierarchy of clusters and helps visualize how the data points are merged at each step.

    K-Means vs. Hierarchical Clustering

    Both K-Means and Hierarchical Clustering are effective clustering techniques, but they have different strengths and weaknesses:

    • Number of Clusters: K-Means requires specifying the number of clusters beforehand, while Hierarchical Clustering does not.
    • Scalability: K-Means is faster and scales better with large datasets, whereas Hierarchical Clustering can become computationally expensive with large data.
    • Cluster Shapes: K-Means assumes that clusters are spherical, which might not always be true. Hierarchical Clustering can capture more complex relationships between data points.

    Conclusion

    Clustering is a valuable tool for exploring data and uncovering hidden patterns. K-Means is great for large datasets and cases where you have a sense of how many clusters exist. Hierarchical Clustering, on the other hand, is useful for smaller datasets and when you want to visualize the clustering process through dendrograms. Both methods provide important insights into the structure of your data.

  • 0 comments:

    Post a Comment

    Hey, you can share your views here!!!

    Have something for me?

    Let us have a chat, schedule a 30 min meeting with me. I am looking forward to hear from you.

    * indicates required
    / ( mm / dd )