Lecture 6 - Clustering K-Means, K-Medoids, Hierarchical & DBSCAN

Lecture 6 explains clustering in Data Mining, including K-Means, K-Medoids, hierarchical clustering, and DBSCAN with diagrams, algorithms, examples, Python code, real-world applications, and comparisons for BS CS, BS AI, BS IT & Data Science students.

Clustering is one of the most fundamental techniques in unsupervised learning. Unlike supervised models that rely on labeled datasets, clustering finds natural groups within data without prior knowledge of categories. Whether it’s grouping customers, identifying abnormal behavior, segmenting images, or analyzing biological data, clustering algorithms are used everywhere.

This lecture dives deep into the four most important clustering algorithms:

K-Means
K-Medoids
Hierarchical Clustering
DBSCAN

You will learn how they work, when to use them, their strengths, weaknesses, and real-world scenarios.

Introduction to Clustering

What Is Clustering?

Clustering is the process of grouping data points based on similarity.

Each group is called a cluster.

Examples:

Grouping customers by behavior
Grouping documents by similarity
Grouping images by features

Why Use Clustering?

Discover hidden structures
Reduce data complexity
Improve decision-making
Preprocessing step for other algorithms

Clustering vs Classification

Classification	Clustering
Labeled data	Unlabeled data
Predict categories	Discover groups
Supervised	Unsupervised
Used for known outcomes	Used for exploration

Types of Clustering Methods

Clustering algorithms are divided into three main families.

1. Partitioning Methods

They divide data into K clusters:

K-Means
K-Medoids

2. Hierarchical Methods

They build a tree of clusters:

Agglomerative
Divisive

3. Density-Based Methods

They find clusters of arbitrary shapes:

DBSCAN
OPTICS

K-Means Clustering

K-Means is the most famous clustering algorithm.

How K-Means Works (Step-by-Step)

Choose the number of clusters K
Randomly select K centroids
Assign each data point to the nearest centroid
Update centroids (mean of assigned points)
Repeat until convergence

Distance Metrics

Common metrics:

Euclidean distance
Manhattan distance

Euclidean formula:

d = √((x1 - x2)² + (y1 - y2)²)

Choosing K (Elbow Method)

Plot:

K vs. Inertia (Within-cluster sum

Look for the “elbow” the point where improvement slows.

Advantages of K-Means

Fast
Easy to understand
Works well with large datasets

Limitations

Requires K
Sensitive to outliers
Only works well with spherical clusters

Numeric Example (Simple Demonstration)

Dataset:

Points: (1,1), (2,1), (4,3), (5,4)
K = 2

Initial centroids:

C1 = (1,1)
C2 = (5,4)

Clusters form as:

Cluster A: (1,1), (2,1)
Cluster B: (4,3), (5,4)

Centroids update:

New C1 = (1.5,1)
New C2 = (4.5,3.5)

Repeat until stable.

Machine Learning Course

K-Medoids (PAM Algorithm)

K-Medoids is similar to K-Means but more robust to noise.

Difference from K-Means

K-Means	K-Medoids
Uses mean	Uses medoid (actual data point)
Sensitive to outliers	Robust to outliers

PAM Algorithm Steps

Choose K points as medoids
Assign each data point to the nearest medoid
Swap medoids with non-medoids to reduce cost
Continue until no improvement

When to Use K-Medoids

When dataset contains noise
When outliers may distort centroids
When you need guaranteed actual data representatives

scikit-learn documentation

Hierarchical Clustering

Hierarchical clustering builds a tree-like structure called a dendrogram.

Two Types

1. Agglomerative

Starts with each point as its own cluster, then merges clusters.

2. Divisive

Starts with one cluster, then splits into smaller clusters.

Dendrogram Explanation

      ┌──────────── A
   ┌──┤
   │  └──────────── B
───┤
   │   ┌────────── C
   └───┤
       └────────── D

Cutting the dendrogram at different heights gives different numbers of clusters.

DBSCAN (Density-Based Clustering)

DBSCAN is powerful for discovering clusters of any shape.

Key Concepts

Core Point: Enough neighbors
Border Point: Near a core
Noise Point: Irregular/outlier

How DBSCAN Works

Choose parameters:
- ε (epsilon) → distance
- minPts → minimum neighbors
Find core points
Form clusters around them
Mark remaining points as noise

Advantages of DBSCAN

Finds arbitrarily shaped clusters
Handles noise
No need to specify K

Limitations

Difficult to choose ε
Struggles with varying density

Comparison Table of Clustering Algorithms

Algorithm	Needs K?	Handles Noise	Cluster Shape	Speed
K-Means	Yes	Poor	Spherical	Fast
K-Medoids	Yes	Good	Spherical	Medium
Hierarchical	No	Medium	Tree structure	Slow
DBSCAN	No	Excellent	Arbitrary	Medium

Practical Case Studies

1. Customer Segmentation

K-Means is commonly used to group customers based on:

Spending habits
Demographics
Browsing behavior

2. Image Segmentation

Hierarchical clustering groups similar pixels.

3. Fraud Detection

DBSCAN identifies abnormal transactions as noise.

Python Code Examples

K-Means

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_

K-Medoids

from sklearn_extra.cluster import KMedoids

model = KMedoids(n_clusters=3)
model.fit(X)

Hierarchical

from sklearn.cluster import AgglomerativeClustering

hier = AgglomerativeClustering(n_clusters=3)
hier.fit(X)

DBSCAN

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)
db.fit(X)

Common Mistakes in Clustering

Using K-Means without scaling data
Choosing wrong K
Using K-Means for non-spherical clusters
Misinterpreting dendrograms
Incorrect DBSCAN parameter selection

Summary

Lecture 6 covered four primary clustering methods used in Data Mining: K-Means, K-Medoids, Hierarchical Clustering, and DBSCAN. You learned how each algorithm works, the mathematics behind them, their real-world applications, strengths, limitations, and Python implementations. These clustering techniques form the backbone of unsupervised learning.

Next. Lecture 7 Outlier & Anomaly Detection in Data Mining

Introduction to Clustering

What Is Clustering?

Why Use Clustering?

Clustering vs Classification

Types of Clustering Methods

1. Partitioning Methods

2. Hierarchical Methods

3. Density-Based Methods

K-Means Clustering

How K-Means Works (Step-by-Step)

Distance Metrics

Choosing K (Elbow Method)

Advantages of K-Means

Limitations

Numeric Example (Simple Demonstration)

K-Medoids (PAM Algorithm)

Difference from K-Means

PAM Algorithm Steps

When to Use K-Medoids

Hierarchical Clustering

Two Types

1. Agglomerative

2. Divisive

Dendrogram Explanation

DBSCAN (Density-Based Clustering)

Key Concepts

How DBSCAN Works

Advantages of DBSCAN

Limitations

Comparison Table of Clustering Algorithms

Practical Case Studies

1. Customer Segmentation

2. Image Segmentation

3. Fraud Detection

Python Code Examples

K-Means

K-Medoids

Hierarchical

DBSCAN

Common Mistakes in Clustering

Summary

People also ask:

Related Posts

Lecture 17 – Final Exam Bank for Data Mining (Massive Question Set)

Lecture 16 – Robotics and Automation in Data Mining

Lecture 15 – Big Data Analytics for Data Mining

Leave a ReplyCancel Reply