Lecture 7 explains Outlier and Anomaly Detection in Data Mining including statistical methods (Z-score, IQR), density-based LOF, clustering techniques, and machine learning approaches like Isolation Forest. Includes diagrams, examples, Python code, and case studies.
Outlier and Anomaly Detection plays a critical role in Data Mining and Machine Learning. While previous lectures focused on patterns and clusters, this lecture focuses on identifying data points that do NOT follow patterns. These unusual points often indicate:
- Fraud
- Cyber attacks
- Medical emergencies
- Sensor failures
- Transaction errors
- Equipment breakdowns
Understanding anomalies is crucial for real-world applications from banking security to industrial safety. This lecture explores statistical, distance-based, density-based, and machine learning approaches to detect outliers.
Introduction to Outliers & Anomalies
What Is an Outlier?
An outlier is a data point that is significantly different from the majority of data.
Examples:
- A credit card transaction of $10,000 when typical transactions are $20–$200
- A patient heart rate of 210 bpm
- A sudden spike in server traffic
hat Is an Anomaly?
An anomaly is a rare pattern that does not conform to expected behavior.
Outlier = unusual data point
Anomaly = unusual pattern / behavior
Outlier vs Noise vs Novelty
| Term | Meaning |
|---|---|
| Outlier | Data point far away from others |
| Noise | Random error; does not follow distribution |
| Novelty | A new pattern never seen before |
Example:
A new type of cyber-attack → novelty detection.
Types of Outliers
1. Point Outliers
A single abnormal point.
Example:
Temperature: 34°C, 35°C, 36°C, 80°C.
2. Contextual Outliers
Outlier depends on context.
Example:
Temperature 30°C is normal in summer but abnormal in winter.
3. Collective Outliers
Group of data points behave unusually.
Example:
10,000 failed login attempts in 5 minutes.
Why Outlier Detection Is Important
Fraud Detection
Banks detect:
- Strange credit card purchases
- Abnormal transaction patterns
Cybersecurity
Detect:
- Intrusion attempts
- Malware behavior
- Sudden spikes in traffic
Healthcare
Anomalies may indicate:
- Disease
- Sensor malfunction
- Critical emergencies
IoT & Industrial Systems
Abnormal vibration readings → machine failure
Unexpected temperature rise → fire risk
Statistical Outlier Detection
The simplest methods assume the data follows a distribution.
Z-Score Method
Z-score measures how far a data point is from the mean.
Formula:
Z = (X – μ) / σ
Rule:
- Z > 3 → Outlier
- Z < -3 → Outlier
Example:
Dataset:
10, 12, 11, 13, 50
Mean ≈ 19.2
Std dev ≈ 17
Z-score of 50:
(50 – 19.2) / 17 ≈ 1.8 (Not too extreme here)
IQR Method (Recommended for skewed data)
Steps:
- Compute Q1 (25% point)
- Compute Q3 (75% point)
- IQR = Q3 – Q1
- Outliers are:
X < Q1 – 1.5(IQR)
X > Q3 + 1.5(IQR)
Example:
Dataset:
4, 5, 6, 7, 100
Q1 = 5
Q3 = 7
IQR = 2
Upper bound:
7 + 3 = 10
100 > 10 → Outlier
Modified Z-Score
More robust for small datasets.
Distance-Based Outlier Detection
Euclidean Distance
Points far away from others → outlier.
Useful in KNN-based detection.
KNN Outlier Detection
Compute distance to K nearest neighbors.
High average distance → outlier.
Density-Based Outlier Detection
Local Outlier Factor (LOF)
LOF compares density of a point with its neighbors.
Interpretation:
- LOF ≈ 1 → normal
- LOF > 1.5 → outlier
- LOF > 2 → strong outlier
LOF Text Diagram
Dense Cluster: ●●●●●●●●
Outlier: ○ (low density)
Outlier 🔵 is far from dense cluster.
Cluster-Based Outlier Detection
K-Means Residual Outliers
After clustering:
- Points far from centroids are outliers.
DBSCAN Outliers
DBSCAN labels low-density points as:
Noise = Outlier
Machine Learning Outlier Detection
Isolation Forest
Works by randomly splitting the data.
Outliers are isolated in fewer splits.
Why it works:
- Dense points → hard to isolate
- Outliers → easy to isolate
Isolation Forest
Split 1: Divide data
Split 2: Separate dense region
Split 3: Outlier isolated early → anomaly
Autoencoders (Deep Learning)
Steps:
- Autoencoder learns to reconstruct normal data
- Outliers reconstruct poorly
- High reconstruction error → outlier
Used in:
- Fraud detection
- Network security
- Medical imaging
Text-Based Diagram Explanations
Z-Score Distribution Diagram
|--------|--------|--------|
-3σ μ +3σ
Points beyond +/-3σ → outliers
IQR Boxplot Diagram
|----[ Q1 ]-------[ Median ]-------[ Q3 ]----|
○ (outlier)
DBSCAN Diagram
####### Dense Cluster #######
●●●●●●●●●●●●●
Noise (Outlier):
○
Real-World Case Studies
1. Banking Fraud Detection
A customer normally purchases:
$20, $40, $80
Suddenly:
$5000 purchase → Outlier → Fraud alert
2. Network Intrusion Detection
Normal:
50–100 requests/min
Attack:
10,000 requests/min → Anomaly
3. Healthcare
Normal heart rate:
60–100 bpm
Reading:
185 bpm → Outlier → emergency
Python Examples for Outlier Detection
1. Z-Score
import numpy as np
data = np.array([10, 12, 11, 13, 50])
z = (data - np.mean(data)) / np.std(data)
print(z)
2. IQR
import numpy as np
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]
print(outliers)
3. LOF
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
y_pred = lof.fit_predict(X)
4. Isolation Forest
from sklearn.ensemble import IsolationForest
model = IsolationForest()
model.fit(X)
outliers = model.predict(X)
Common Mistakes in Outlier Detection
Removing all outliers blindly
Using Z-score for skewed data
Using one method for all datasets
Misinterpreting anomalies
Forgetting to scale data for distance-based methods
Summary
Lecture 7 explained Outlier & Anomaly Detection using statistical, distance-based, density-based, clustering-based, and machine learning methods. You learned about Z-score, IQR, LOF, Isolation Forest, deep learning approaches, diagrams, examples, and Python implementations. These methods are crucial for fraud detection, cybersecurity, healthcare, and IoT systems.
People also ask:
A data point very different from others.
Outlier = unusual data point
Anomaly = unusual pattern or behavior
Isolation Forest or DBSCAN.
No, use IQR or LOF instead.
Yes, especially for fraud detection.




