Lecture 7 – Outlier and Anomaly Detection in Data Mining

Lecture 7 explains Outlier and Anomaly Detection in Data Mining including statistical methods (Z-score, IQR), density-based LOF, clustering techniques, and machine learning approaches like Isolation Forest. Includes diagrams, examples, Python code, and case studies.

Outlier and Anomaly Detection plays a critical role in Data Mining and Machine Learning. While previous lectures focused on patterns and clusters, this lecture focuses on identifying data points that do NOT follow patterns. These unusual points often indicate:

  • Fraud
  • Cyber attacks
  • Medical emergencies
  • Sensor failures
  • Transaction errors
  • Equipment breakdowns

Understanding anomalies is crucial for real-world applications from banking security to industrial safety. This lecture explores statistical, distance-based, density-based, and machine learning approaches to detect outliers.

Introduction to Outliers & Anomalies

What Is an Outlier?

An outlier is a data point that is significantly different from the majority of data.

Examples:

  • A credit card transaction of $10,000 when typical transactions are $20–$200
  • A patient heart rate of 210 bpm
  • A sudden spike in server traffic

hat Is an Anomaly?

An anomaly is a rare pattern that does not conform to expected behavior.

Outlier = unusual data point
Anomaly = unusual pattern / behavior

Outlier vs Noise vs Novelty

TermMeaning
OutlierData point far away from others
NoiseRandom error; does not follow distribution
NoveltyA new pattern never seen before

Example:
A new type of cyber-attack → novelty detection.

Types of Outliers

1. Point Outliers

A single abnormal point.

Example:
Temperature: 34°C, 35°C, 36°C, 80°C.

2. Contextual Outliers

Outlier depends on context.

Example:
Temperature 30°C is normal in summer but abnormal in winter.

3. Collective Outliers

Group of data points behave unusually.

Example:
10,000 failed login attempts in 5 minutes.

Python For Data Science

Why Outlier Detection Is Important

Fraud Detection

Banks detect:

  • Strange credit card purchases
  • Abnormal transaction patterns

Cybersecurity

Detect:

  • Intrusion attempts
  • Malware behavior
  • Sudden spikes in traffic

Healthcare

Anomalies may indicate:

  • Disease
  • Sensor malfunction
  • Critical emergencies

IoT & Industrial Systems

Abnormal vibration readings → machine failure
Unexpected temperature rise → fire risk

Scikit-learn documentation

Statistical Outlier Detection

The simplest methods assume the data follows a distribution.

Z-Score Method

Z-score measures how far a data point is from the mean.

Formula:
Z = (X – μ) / σ
Rule:
  • Z > 3 → Outlier
  • Z < -3 → Outlier

Example:
Dataset:

10, 12, 11, 13, 50

Mean ≈ 19.2
Std dev ≈ 17

Z-score of 50:

(50 – 19.2) / 17 ≈ 1.8 (Not too extreme here)

Steps:

  1. Compute Q1 (25% point)
  2. Compute Q3 (75% point)
  3. IQR = Q3 – Q1
  4. Outliers are:
X < Q1 – 1.5(IQR)
X > Q3 + 1.5(IQR)
Example:

Dataset:

4, 5, 6, 7, 100

Q1 = 5
Q3 = 7
IQR = 2

Upper bound:

7 + 3 = 10

100 > 10 → Outlier

Modified Z-Score

More robust for small datasets.

Distance-Based Outlier Detection

Euclidean Distance

Points far away from others → outlier.

Useful in KNN-based detection.

KNN Outlier Detection

Compute distance to K nearest neighbors.

High average distance → outlier.

Density-Based Outlier Detection

Local Outlier Factor (LOF)

LOF compares density of a point with its neighbors.

Interpretation:

  • LOF ≈ 1 → normal
  • LOF > 1.5 → outlier
  • LOF > 2 → strong outlier

LOF Text Diagram

Dense Cluster: ●●●●●●●●
Outlier:               ○ (low density)

Outlier 🔵 is far from dense cluster.

Cluster-Based Outlier Detection

K-Means Residual Outliers

After clustering:

  • Points far from centroids are outliers.

DBSCAN Outliers

DBSCAN labels low-density points as:

Noise = Outlier

Machine Learning Outlier Detection

Isolation Forest

Works by randomly splitting the data.
Outliers are isolated in fewer splits.

Why it works:

  • Dense points → hard to isolate
  • Outliers → easy to isolate

Isolation Forest

Split 1: Divide data
Split 2: Separate dense region
Split 3: Outlier isolated early → anomaly

Autoencoders (Deep Learning)

Steps:

  1. Autoencoder learns to reconstruct normal data
  2. Outliers reconstruct poorly
  3. High reconstruction error → outlier

Used in:

  • Fraud detection
  • Network security
  • Medical imaging

Text-Based Diagram Explanations

Z-Score Distribution Diagram

|--------|--------|--------|
-3σ      μ        +3σ

Points beyond +/-3σ → outliers

IQR Boxplot Diagram

|----[ Q1 ]-------[ Median ]-------[ Q3 ]----|
                 ○ (outlier)

DBSCAN Diagram

####### Dense Cluster #######
●●●●●●●●●●●●●

Noise (Outlier):
           ○

Real-World Case Studies

1. Banking Fraud Detection

A customer normally purchases:

$20, $40, $80

Suddenly:

$5000 purchase → Outlier → Fraud alert

2. Network Intrusion Detection

Normal:

50–100 requests/min

Attack:

10,000 requests/min → Anomaly

3. Healthcare

Normal heart rate:

60–100 bpm

Reading:

185 bpm → Outlier → emergency

Python Examples for Outlier Detection

1. Z-Score

import numpy as np

data = np.array([10, 12, 11, 13, 50])
z = (data - np.mean(data)) / np.std(data)
print(z)

2. IQR

import numpy as np

Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

outliers = data[(data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)]
print(outliers)

3. LOF

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20)
y_pred = lof.fit_predict(X)

4. Isolation Forest

from sklearn.ensemble import IsolationForest

model = IsolationForest()
model.fit(X)
outliers = model.predict(X)

Common Mistakes in Outlier Detection

Removing all outliers blindly
Using Z-score for skewed data
Using one method for all datasets
Misinterpreting anomalies
Forgetting to scale data for distance-based methods

Summary

Lecture 7 explained Outlier & Anomaly Detection using statistical, distance-based, density-based, clustering-based, and machine learning methods. You learned about Z-score, IQR, LOF, Isolation Forest, deep learning approaches, diagrams, examples, and Python implementations. These methods are crucial for fraud detection, cybersecurity, healthcare, and IoT systems.

Next. Lecture 8 – Web Mining & Social Network Mining

People also ask:

What is an outlier?

A data point very different from others.

What is the difference between anomaly and outlier?

Outlier = unusual data point
Anomaly = unusual pattern or behavior

Which method is best for big data?

Isolation Forest or DBSCAN.

Is Z-score good for skewed data?

No, use IQR or LOF instead.

Is anomaly detection used in banking?

Yes, especially for fraud detection.

Leave a Reply

Your email address will not be published. Required fields are marked *