Lecture 7 - Outlier and Anomaly Detection in Data Mining

Lecture 7 explains Outlier and Anomaly Detection in Data Mining including statistical methods (Z-score, IQR), density-based LOF, clustering techniques, and machine learning approaches like Isolation Forest. Includes diagrams, examples, Python code, and case studies.

Outlier and Anomaly Detection plays a critical role in Data Mining and Machine Learning. While previous lectures focused on patterns and clusters, this lecture focuses on identifying data points that do NOT follow patterns. These unusual points often indicate:

Fraud
Cyber attacks
Medical emergencies
Sensor failures
Transaction errors
Equipment breakdowns

Understanding anomalies is crucial for real-world applications from banking security to industrial safety. This lecture explores statistical, distance-based, density-based, and machine learning approaches to detect outliers.

Introduction to Outliers & Anomalies

What Is an Outlier?

An outlier is a data point that is significantly different from the majority of data.

Examples:

A credit card transaction of $10,000 when typical transactions are $20–$200
A patient heart rate of 210 bpm
A sudden spike in server traffic

hat Is an Anomaly?

An anomaly is a rare pattern that does not conform to expected behavior.

Outlier = unusual data point
Anomaly = unusual pattern / behavior

Outlier vs Noise vs Novelty

Term	Meaning
Outlier	Data point far away from others
Noise	Random error; does not follow distribution
Novelty	A new pattern never seen before

Example:
A new type of cyber-attack → novelty detection.

Types of Outliers

1. Point Outliers

A single abnormal point.

Example:
Temperature: 34°C, 35°C, 36°C, 80°C.

2. Contextual Outliers

Outlier depends on context.

Example:
Temperature 30°C is normal in summer but abnormal in winter.

3. Collective Outliers

Group of data points behave unusually.

Example:
10,000 failed login attempts in 5 minutes.

Python For Data Science

Why Outlier Detection Is Important

Fraud Detection

Banks detect:

Strange credit card purchases
Abnormal transaction patterns

Cybersecurity

Detect:

Intrusion attempts
Malware behavior
Sudden spikes in traffic

Healthcare

Anomalies may indicate:

Disease
Sensor malfunction
Critical emergencies

IoT & Industrial Systems

Abnormal vibration readings → machine failure
Unexpected temperature rise → fire risk

Scikit-learn documentation

Statistical Outlier Detection

The simplest methods assume the data follows a distribution.

Z-Score Method

Z-score measures how far a data point is from the mean.

Formula:

Z = (X – μ) / σ

Rule:

Z > 3 → Outlier
Z < -3 → Outlier

Example:
Dataset:

10, 12, 11, 13, 50

Mean ≈ 19.2
Std dev ≈ 17

Z-score of 50:

(50 – 19.2) / 17 ≈ 1.8 (Not too extreme here)

IQR Method (Recommended for skewed data)

Steps:

Compute Q1 (25% point)
Compute Q3 (75% point)
IQR = Q3 – Q1
Outliers are:

X < Q1 – 1.5(IQR)
X > Q3 + 1.5(IQR)

Example:

Dataset:

4, 5, 6, 7, 100

Q1 = 5
Q3 = 7
IQR = 2

Upper bound:

7 + 3 = 10

100 > 10 → Outlier

Modified Z-Score

More robust for small datasets.

Distance-Based Outlier Detection

Euclidean Distance

Points far away from others → outlier.

Useful in KNN-based detection.

KNN Outlier Detection

Compute distance to K nearest neighbors.

High average distance → outlier.

Density-Based Outlier Detection

Local Outlier Factor (LOF)

LOF compares density of a point with its neighbors.

Interpretation:

LOF ≈ 1 → normal
LOF > 1.5 → outlier
LOF > 2 → strong outlier

LOF Text Diagram

Dense Cluster: ●●●●●●●●
Outlier:               ○ (low density)

Outlier 🔵 is far from dense cluster.

Cluster-Based Outlier Detection

K-Means Residual Outliers

After clustering:

Points far from centroids are outliers.

DBSCAN Outliers

DBSCAN labels low-density points as:

Noise = Outlier

Machine Learning Outlier Detection

Isolation Forest

Works by randomly splitting the data.
Outliers are isolated in fewer splits.

Why it works:

Dense points → hard to isolate
Outliers → easy to isolate

Isolation Forest

Split 1: Divide data
Split 2: Separate dense region
Split 3: Outlier isolated early → anomaly

Autoencoders (Deep Learning)

Steps:

Autoencoder learns to reconstruct normal data
Outliers reconstruct poorly
High reconstruction error → outlier

Used in:

Fraud detection
Network security
Medical imaging

Text-Based Diagram Explanations

Z-Score Distribution Diagram

|--------|--------|--------|
-3σ      μ        +3σ

Points beyond +/-3σ → outliers

IQR Boxplot Diagram

|----[ Q1 ]-------[ Median ]-------[ Q3 ]----|
                 ○ (outlier)

DBSCAN Diagram

####### Dense Cluster #######
●●●●●●●●●●●●●

Noise (Outlier):
           ○

Real-World Case Studies

1. Banking Fraud Detection

A customer normally purchases:

$20, $40, $80

Suddenly:

$5000 purchase → Outlier → Fraud alert

2. Network Intrusion Detection

Normal:

50–100 requests/min

Attack:

10,000 requests/min → Anomaly

3. Healthcare

Normal heart rate:

60–100 bpm

Reading:

185 bpm → Outlier → emergency

Python Examples for Outlier Detection

1. Z-Score

import numpy as np

data = np.array([10, 12, 11, 13, 50])
z = (data - np.mean(data)) / np.std(data)
print(z)

2. IQR

import numpy as np

Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

outliers = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]
print(outliers)

3. LOF

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)
y_pred = lof.fit_predict(X)

4. Isolation Forest

from sklearn.ensemble import IsolationForest

model = IsolationForest()
model.fit(X)
outliers = model.predict(X)

Common Mistakes in Outlier Detection

Removing all outliers blindly
Using Z-score for skewed data
Using one method for all datasets
Misinterpreting anomalies
Forgetting to scale data for distance-based methods

Summary

Lecture 7 explained Outlier & Anomaly Detection using statistical, distance-based, density-based, clustering-based, and machine learning methods. You learned about Z-score, IQR, LOF, Isolation Forest, deep learning approaches, diagrams, examples, and Python implementations. These methods are crucial for fraud detection, cybersecurity, healthcare, and IoT systems.

Next. Lecture 8 – Web Mining & Social Network Mining

Introduction to Outliers & Anomalies

What Is an Outlier?

hat Is an Anomaly?

Outlier vs Noise vs Novelty

Types of Outliers

1. Point Outliers

2. Contextual Outliers

3. Collective Outliers

Why Outlier Detection Is Important

Fraud Detection

Cybersecurity

Healthcare

IoT & Industrial Systems

Statistical Outlier Detection

Z-Score Method

Formula:

Rule:

IQR Method (Recommended for skewed data)

Steps:

Example:

Modified Z-Score

Distance-Based Outlier Detection

Euclidean Distance

KNN Outlier Detection

Density-Based Outlier Detection

Local Outlier Factor (LOF)

Interpretation:

LOF Text Diagram

Cluster-Based Outlier Detection

K-Means Residual Outliers

DBSCAN Outliers

Machine Learning Outlier Detection

Isolation Forest

Why it works:

Isolation Forest

Autoencoders (Deep Learning)

Text-Based Diagram Explanations

Z-Score Distribution Diagram

IQR Boxplot Diagram

DBSCAN Diagram

Real-World Case Studies

1. Banking Fraud Detection

2. Network Intrusion Detection

3. Healthcare

Python Examples for Outlier Detection

1. Z-Score

2. IQR

3. LOF

4. Isolation Forest

Common Mistakes in Outlier Detection

Summary

People also ask:

Related Posts

Lecture 17 – Final Exam Bank for Data Mining (Massive Question Set)

Lecture 16 – Robotics and Automation in Data Mining

Lecture 15 – Big Data Analytics for Data Mining

Leave a ReplyCancel Reply