Lecture 5 – Classification Decision Trees, Naive Bayes, KNN & Logistic Regression

Lecture 5 covers Classification in Data Mining with detailed explanations of Decision Trees, Naive Bayes, KNN, and Logistic Regression. Includes formulas, diagrams, examples, coding, real-world applications, and comparisons.

Classification is one of the most widely used techniques in Data Mining and Machine Learning. It is a supervised learning method where the goal is to learn patterns from labeled data and use those patterns to classify new, unseen data.

From email spam filtering to disease diagnosis, classification plays a critical role in decision-making and predictive analytics.

Introduction to Classification

What Is Classification?

Classification is the task of predicting the category or class label of an input based on past labeled data.

Examples:

  • Predicting whether an email is Spam or Not Spam
  • Predicting if a customer will Buy or Not Buy
  • Predicting whether a tumor is Benign or Malignant

Why Classification Is Important

  • Helps organizations make data-driven decisions
  • Automates repetitive decision tasks
  • Detects patterns that humans may overlook
  • Improves business intelligence and forecasting

Classification vs Prediction

ClassificationPrediction
Categories (Spam/Not Spam)Numeric Values (Sales, Temperature)
Output is discreteOutput is continuous

Real-World Uses of Classification

1. Healthcare Diagnosis

Algorithms classify whether a patient has a disease based on symptoms and medical tests.

2. Spam Detection

Email service providers classify messages using Naive Bayes and NLP techniques.

3. Fraud Detection

Banks classify whether a transaction is legitimate or fraudulent.

4. Customer Churn Prediction

Telecom companies classify whether a customer will leave the service.

Components of a Classification Model

Feature Space

Input variables used to train the model.

Target Variable

The label we want to predict.

Training vs Testing

  • Training set teaches the algorithm
  • Testing set evaluates performance

Decision Trees (ID3, C4.5, CART)

Decision Trees are powerful and easy-to-understand classification models.

How Decision Trees Work

The tree splits data into branches using questions like:

Is age > 30?
Is income high?
Has previous purchase history?

Python for Data Science

Key Concept: Entropy

Entropy measures impurity.

Formula:

Entropy(S) = - Σ p(i) log₂ p(i)

Information Gain

Measures how much entropy is reduced after a split.

Gain = Entropy(parent) – Σ (weighted entropy of child)

Gini Index

Used by CART algorithm.

Gini = 1 – Σ p(i)²

Example of Tree Construction

Dataset:

Age | Income | Buy?
22     High      No
45     Low       Yes
30     High      Yes

Tree:

       Income?
       /     \
   High       Low
   /           \
 No            Yes

Lecture 4 – Association Rule Mining Apriori, FP-Growth, Support, Confidence & Lift

Naive Bayes Classifier

Naive Bayes is based on probability.

Bayes Theorem

P(A|B) = [P(B|A) * P(A)] / P(B)

Why “Naive”?

Because it assumes all features are independent.

Text Classification Example

Email:

"Win a free iPhone now!"

Naive Bayes calculates:

  • P(Spam | Words)
  • P(Not Spam | Words)

Whichever probability is higher → classification.

K-Nearest Neighbors (KNN)

KNN is a simple but powerful algorithm.

How KNN Works

  1. Choose K
  2. Calculate distance between new point & all training points
  3. Select K closest points
  4. Majority class wins

Distance Metrics

  • Euclidean
  • Manhattan
  • Minkowski

Choosing K

  • Small K → Overfitting
  • Large K → Underfitting

Kaggle Machine Learning

Logistic Regression

Despite its name, logistic regression is used for classification, not regression.

Sigmoid Function

Converts linear output into probability:

σ(x) = 1 / (1 + e⁻ˣ)

Decision Boundary

If probability ≥ 0.5 → Class 1
Else → Class 0

Real-World Example

Predict whether a customer will purchase a product based on:

  • Age
  • Salary
  • Browser activity

Algorithm Comparison Table

AlgorithmTypeStrengthsWeaknesses
Decision TreeRule-basedEasy to understandOverfitting
Naive BayesProbabilisticFast, handles textAssumes independence
KNNDistance-basedSimple, no trainingSlow for large data
Logistic RegressionStatisticalWell-defined outputCannot handle non-linear patterns

Python Coding Examples

Decision Tree

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

Naive Bayes

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)

KNN

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

Logistic Regression

from sklearn.linear_model import LogisticRegression
log = LogisticRegression()
log.fit(X_train, y_train)

Common Issues in Classification

  • Imbalanced datasets
  • Multicollinearity
  • Overfitting decision trees
  • Choosing incorrect K value
  • Poor scaling for KNN

Summary

Lecture 5 explored classification in-depth, covering decision trees, naive Bayes, KNN, and logistic regression. You learned how each algorithm works, where it is used, how to calculate entropy, information gain, probabilities, distances, and logistic functions. Real-world examples and Python code made the concepts practical and applicable.

Next. Lecture 6 Clustering – K-Means, K-Medoids, Hierarchical & DBSCAN

People also ask:

What is classification in Data Mining?

It is predicting a category based on labeled training data.

Which classification algorithm is easiest for beginners?

Decision Trees.

Why is Naive Bayes called “naive”?

Because it assumes feature independence.

What is the role of K in KNN?

K determines how many neighbors influence classification.

Why is logistic regression used for classification?

Because it outputs probabilities using the sigmoid function.

Leave a Reply

Your email address will not be published. Required fields are marked *