Lecture 17 Data Mining Final Question Bank (100 MCQs + More)

Lecture 17 final question bank with 100 MCQs, 30 short questions, and 15 long questions for Data Mining. Exam-focused, solved, and syllabus-aligned.

SECTION A – MCQs (WITH ANSWERS)

Data Mining is a step within which larger process?
a) Data Warehousing
b) KDD
c) ETL
d) OLAP
The main goal of Data Mining is to:
a) Store data
b) Remove data
c) Discover patterns
d) Replicate data
The KDD process begins with:
a) Evaluation
b) Data Cleaning
c) Deployment
d) Pattern recognition
Structured data is stored in:
a) Images
b) Text files
c) Databases
d) Audio signals
Semi-structured data example:
a) JSON
b) Tables
c) Excel sheet
d) Image
Unstructured data DOES NOT include:
a) Images
b) Text
c) Videos
d) SQL tables
The main challenge in KDD is:
a) Too few tools
b) Massive noisy data
c) No storage
d) Too many algorithms
Data redundancy means:
a) Missing data
b) Repeated data
c) Noisy data
d) Scattered data
An example of a data repository is:
a) LIDAR
b) Data Warehouse
c) Drone GPS
d) RAM
Metadata describes:
a) Data about data
b) Images
c) SQL queries
d) Dashboards
KDD stands for:
a) Knowledge Discovery in Databases
b) Knowledge Data Development
c) Key Data Discovery
d) Knowledge Driven Data
Data integration merges data from:
a) One source
b) Multiple sources
c) Only sensors
d) Only databases
OLAP is used for:
a) Transactions
b) Analytics
c) Networking
d) File transfer
An example of categorical data:
a) Age
b) Height
c) Gender
d) Salary
Quantitative data means:
a) Non-numeric
b) Numeric
c) Images
d) Text
ETL stands for:
a) Extract–Translate–Load
b) Extract–Transform–Load
c) Evaluate–Transfer–Load
d) Extract–Transfer–Link
Which is NOT a stage of KDD?
a) Data selection
b) Data cleaning
c) Data loading
d) Pattern discovery
A row in a dataset is called a:
a) Feature
b) Attribute
c) Record
d) Class
A column in a dataset is called a:
a) Sample
b) Tuple
c) Feature
d) Instance
Data preprocessing is required because:
a) Data is always perfect
b) Data has issues
c) Algorithms need no cleaning
d) Storage is cheap

Removing noise from data is called:
a) Integration
b) Cleaning
c) Normalization
d) Reduction
Normalization is used to:
a) Increase noise
b) Scale data
c) Add missing values
d) Delete rows
Min-Max scaling converts values to:
a) −∞ to +∞
b) 0 to 1
c) 1 to 100
d) −1 to +1
Z-score normalization uses:
a) Mean & Std
b) Max only
c) Min only
d) Mode
Outliers are:
a) Normal data
b) Extreme values
c) Mean values
d) Balanced data
Missing values can be handled using:
a) Deletion
b) Imputation
c) Both
d) None
PCA is used for:
a) Clustering
b) Dimensionality reduction
c) Classification
d) Association
Binning is a:
a) Smoothing technique
b) Clustering technique
c) Classification technique
d) Prediction technique
Log transform reduces:
a) Variance
b) Bias
c) Noise only
d) Mean
Data reduction techniques include:
a) Scaling
b) Aggregation
c) Standardization
d) Cleaning
Normalization improves:
a) Model speed
b) Data storage
c) User interface
d) None
A common type of data cleaning:
a) Noise removal
b) Adding noise
c) Breaking data
d) Random deletion
Smoothing is used for:
a) Adding spikes
b) Removing irregularities
c) Adding labels
d) Storing data
Data transformation includes:
a) Normalization
b) Smoothing
c) Discretization
d) All options
Attribute construction means:
a) Creating new features
b) Removing features
c) Merging datasets
d) Scaling data
Missing values are common in:
a) Clean datasets
b) Real-world data
c) Random samples
d) Sensor errors only
Sampling is used for:
a) Making data larger
b) Speeding up processing
c) Removing features
d) Cleaning labels
Which is NOT a preprocessing step?
a) Cleaning
b) Integration
c) Transformation
d) Deployment
Feature extraction is needed because:
a) Models need fewer features
b) More features always slow performance
c) Both
d) None
KNN requires:
a) Categorical scaling
b) Distance-based scaling
c) No scaling
d) Random features

Association rule mining finds:
a) Sequential patterns
b) Item relationships
c) Clusters
d) Classes
Support measures:
a) Popularity
b) Confidence
c) Errors
d) Accuracy
Confidence measures:
a) Strength of implication
b) Frequency only
c) Distance
d) Noise
Lift > 1 means:
a) Strong positive association
b) Weak rule
c) Negative correlation
d) Independent variables
Apriori uses:
a) Top-down pruning
b) Bottom-up candidate generation
c) Random selection
d) None
Apriori property:
a) All subsets of frequent itemsets must be frequent
b) Subsets can be infrequent
c) Items must be numeric
d) Requires clustering
FP-growth avoids:
a) Candidate generation
b) Tree building
c) Support counting
d) Item sorting
Frequent pattern tree (FP-tree) stores:
a) Transactions
b) Patterns
c) Noise
d) None
Association rules are used in:
a) Market basket analysis
b) PCA
c) Regression
d) SVM
If Support = 0.3, it means:
a) 30% of transactions contain itemset
b) 30 items
c) 3 items
d) None
Association rules follow:
a) X → Y
b) Y → Y
c) X → X
d) None
Confidence formula:
a) support(X ∪ Y) / support(X)
b) support(X) / support(Y)
c) support(Y) / support(X)
d) length(X)/length(Y)
Apriori suffers from:
a) High time cost
b) Low memory
c) No correctness
d) No scalability
FP-Growth is faster because:
a) Compresses dataset
b) Ignores patterns
c) Removes items
d) Stores everything
An example of association:
a) If bread then butter
b) If cat then dog
c) If image then CNN
d) If data then noise
Market basket analysis is used by:
a) Banks
b) Retailers
c) Hospitals
d) All
Confidence close to 0 means:
a) Weak rule
b) Strong rule
c) No support
d) High lift
For strong rules, lift must be:
a) > 1
b) < 1
c) = 1
d) = 0
Conviction measures:
a) Correlation
b) Surprise
c) Support
d) Frequency
FP-growth uses:
a) Divide and conquer
b) Convolution
c) Linear regression
d) Random forests

Classification is used for:
a) Predicting a category
b) Finding patterns
c) Clustering data
d) None
Regression predicts:
a) Numeric values
b) Labels
c) Clusters
d) Distances
The simplest classifier is:
a) Decision tree
b) Naive Bayes
c) KNN
d) SVM
KNN relies on:
a) Nearest neighbors
b) Decision rules
c) Probability
d) Hyperplanes
SVM finds:
a) Maximum margin hyperplane
b) Minimum hyperplane
c) No separation
d) Random splits
Decision tree splits nodes using:
a) Entropy
b) Gini Index
c) Information gain
d) All
Random forest is an ensemble of:
a) Many decision trees
b) SVMs
c) Neural networks
d) PCA models
Clustering groups data by:
a) Similarity
b) Labels
c) Rules
d) Trees
KMeans minimizes:
a) Squared distances
b) Lift
c) Confidence
d) Support
KMeans requires:
a) K clusters
b) Labels
c) Prediction
d) Deep learning
Hierarchical clustering builds:
a) Dendrogram
b) FP-tree
c) CNN
d) ARIMA
DBSCAN identifies:
a) Density clusters
b) Hyperplanes
c) Neural activations
d) Trees
Outlier detection in clustering uses:
a) LOF
b) SVM
c) CNN
d) PCA
Prediction is:
a) Future estimation
b) Pattern discovery
c) Noise detection
d) Deletion
ROC curve evaluates:
a) Classification
b) Clustering
c) Association
d) FP-tree
A confusion matrix shows:
a) True/false predictions
b) Clusters
c) Patterns
d) Rules
Overfitting occurs when model:
a) Learns noise
b) Performs well
c) Ignores data
d) Removes patterns
Underfitting means:
a) Too simple
b) Too complex
c) Perfect
d) No training
Ensemble learning improves:
a) Accuracy
b) Noise
c) Class imbalance
d) None
Boosting reduces:
a) Bias
b) Variance
c) Data
d) Features

Deep learning learns:
a) Manual features
b) Automatic features
c) No features
d) Random values
CNNs are best for:
a) Images
b) Audio
c) SQL queries
d) Random data
RNNs handle:
a) Sequential data
b) Images
c) Noise
d) Random points
LSTM solves:
a) Vanishing gradients
b) Regression
c) Clustering
d) PCA
Autoencoders perform:
a) Dimensionality reduction
b) Clustering
c) Classification
d) Regression
Big Data is defined by:
a) 5 Vs
b) 2 Vs
c) 1 V
d) None
Hadoop stores data using:
a) HDFS
b) CNN
c) SQL only
d) Neural memory
Spark is faster than Hadoop because of:
a) In-memory processing
b) Disk-based mapping
c) Random access
d) No cluster
MLlib is part of:
a) Spark
b) Hadoop
c) SQL
d) Robotics
MapReduce has:
a) Map + Reduce phases
b) CNN layers
c) RNN loops
d) SQL joins
Robotics uses data mining for:
a) Perception
b) Navigation
c) Motion
d) All
SLAM stands for:
a) Simultaneous Localization and Mapping
b) Single Location Map
c) Sensor Location Analysis Model
d) Synthetic Learning Algorithm Model
LIDAR is used for:
a) 3D mapping
b) Audio
c) SQL queries
d) Classification
Drones use data mining for:
a) Obstacle avoidance
b) Route optimization
c) Detection
d) All
Edge computing helps robots by:
a) Reducing latency
b) Increasing delay
c) Storing old data
d) Removing sensors
Reinforcement Learning helps robots:
a) Learn actions
b) Learn images
c) Process SQL
d) Track metadata
Industry 4.0 uses robotics for:
a) Smart manufacturing
b) Random work
c) Storing files
d) Emails
Predictive maintenance uses:
a) Sensor mining
b) Email mining
c) Web mining
d) SQL joins
Automation increases:
a) Efficiency
b) Cost only
c) Errors
d) Waste
Intelligent robots combine:
a) AI + Sensors + Data Mining
b) Only sensors
c) Only motors
d) Only software

ANSWER KEY (MCQs)

1–20: B, C, B, C, A, D, B, B, B, A, A, B, B, C, B, B, C, C, C, B
21–40: B, B, B, A, B, C, B, A, A, B, A, A, B, D, A, B, B, D, C, B
41–60: B, A, A, A, B, A, A, B, B, A, A, A, A, A, A, D, A, A, B, A
61–80: A, A, B, A, A, D, A, A, A, A, A, A, A, A, A, A, A, A, A, A
81–100: B, A, A, A, A, A, A, A, A, A, D, A, A, D, A, A, A, A, A, A

Data Mining Lecture Notes

SECTION B – 30 Short Questions (With Answers)

1. Define Data Mining in one sentence.

Data Mining is the process of discovering meaningful patterns, trends, and relationships from large datasets.

2. What is the difference between KDD and Data Mining?

KDD is the entire process of extracting knowledge from data, while Data Mining is the specific step where algorithms discover patterns.

3. List three challenges of real-world data.

Real-world data is often noisy, incomplete, and inconsistent.

4. What is data cleaning?

Data cleaning is the process of correcting errors, removing noise, and handling missing or inconsistent values.

5. Define normalization.

Normalization is the process of scaling numeric values into a standardized range to improve model performance.

6. What is Min–Max scaling?

Min–Max scaling transforms values to a fixed range, typically 0 to 1, using the formula: $X’ = \frac{X – X_{\min}}{X_{\max} – X_{\min}}$

7. Give an example of categorical and numerical data.

Categorical: Gender (Male/Female)
Numerical: Age (25, 30, 45)

8. What is the purpose of association rule mining?

To discover relationships and co-occurrence patterns among items in transactional datasets.

9. Define support and confidence.

Support: Frequency of an itemset in the dataset.
Confidence: Strength of implication in a rule, i.e., how often Y appears when X appears.

10. What is Apriori property?

All subsets of a frequent itemset must also be frequent.

11. What is FP-growth?

FP-growth is a fast algorithm that mines frequent itemsets using a compact FP-tree without generating candidates.

12. Define classification.

Classification is the process of predicting categories or class labels based on input features.

13. What is the difference between classification and prediction?

Classification predicts categorical labels, while prediction (regression) predicts numeric values.

14. What is KNN used for?

KNN is used for classification or regression by comparing new samples to their closest neighbors.

15. Explain entropy in decision trees.

Entropy measures impurity or disorder in data; higher entropy means more mixed classes.

16. What is overfitting?

Overfitting occurs when a model learns noise instead of patterns and performs poorly on new data.

17. What is the purpose of clustering?

Clustering groups similar data points together without using predefined labels.

18. Define KMeans clustering.

KMeans partitions data into k clusters by assigning points to the nearest centroid and updating centroids iteratively.

19. What is hierarchical clustering?

A clustering technique that builds a tree-like structure (dendrogram) through merging or splitting clusters.

20. What is an outlier?

An outlier is a data point that significantly deviates from normal patterns.

21. Define deep learning.

Deep learning is a subset of AI that uses multi-layer neural networks to learn complex features automatically.

22. What is a convolutional layer?

A convolutional layer uses filters to extract spatial features from images or grid-like data.

23. What is an autoencoder?

An autoencoder is a neural network that compresses data into a lower dimension and reconstructs it to learn representations.

24. Define Big Data.

Big Data refers to extremely large and complex datasets that cannot be processed by traditional systems.

25. What is HDFS?

HDFS (Hadoop Distributed File System) is a distributed storage system that stores large data files across multiple machines.

26. What is Spark MLlib?

Spark MLlib is Apache Spark’s distributed machine-learning library for classification, clustering, regression, and feature extraction.

27. Define SLAM.

SLAM (Simultaneous Localization and Mapping) enables robots to build a map of their environment while tracking their own location.

28. What is sensor fusion?

Sensor fusion combines data from multiple sensors to improve accuracy, robustness, and perception.

29. What is reinforcement learning?

Reinforcement learning is a learning method where an agent learns optimal actions through rewards and penalties.

30. Define predictive maintenance.

Predictive maintenance uses sensor data and analytics to predict equipment failures before they occur.

Scikit-learn Classification Algorithms

SECTION C – 15 Long Questions (WITH ANSWERS)

1. Explain the full KDD process with a clean diagram.

Answer:
KDD (Knowledge Discovery in Databases) is the complete end-to-end process of transforming raw data into useful knowledge. It contains multiple stages designed to clean, prepare, transform, mine, and evaluate patterns.

Steps of KDD:

Data Selection
- Identify relevant data sources.
- Extract the subset needed for analysis.
Data Cleaning
- Handle missing values
- Remove noise
- Fix inconsistencies
- Correct errors
Data Integration
- Merge databases, files, IoT sensors, logs, or external datasets
- Resolve schema conflicts and redundancy
Data Transformation
- Normalize data
- Reduce dimensionality
- Convert data into models (scaling, aggregation)
Data Mining
- Apply algorithms such as classification, clustering, association rules, pattern extraction, etc.
Pattern Evaluation
- Evaluate interestingness
- Filter valuable patterns
- Remove irrelevant results
Knowledge Presentation
- Use visualization, dashboards, reports, and charts to deliver insights.

Diagram (Text-Based):

Data Selection → Cleaning → Integration → Transformation → Data Mining → Evaluation → Knowledge

2. Describe all data preprocessing steps with examples.

Answer:
Data preprocessing improves data quality and prepares it for mining.

Major Steps:

Data Cleaning
- Fix missing values (mean, median, interpolation)
- Remove noise using smoothing
- Correct inconsistent entries
- Example: Replace null age values with the median age.
Data Integration
- Combine multiple sources
- Resolve duplicates
- Merge tables (SQL JOIN)
- Example: Merge customer database with purchase logs.
Data Transformation
- Normalization (Min–Max scaling, Z-score)
- Aggregation (daily → weekly sales)
- Discretization (convert age → age groups)
Data Reduction
- Dimensionality reduction (PCA)
- Sampling (random sample to reduce size)
- Data compression
Data Discretization
- Convert numeric data to categorical bins
- Example: Convert income into Low, Medium, High.
Feature Engineering
- Construct new features from existing ones
- Example: BMI = weight / height².

3. Explain normalization techniques (Min–Max, Z-score, Decimal scaling).

Answer:

1. Min–Max Normalization

Scales values between 0 and 1: $x’ = \frac{x – x_{min}}{x_{max} – x_{min}}$

Use Case: Algorithms sensitive to distances (KNN, SVM).

2. Z-Score Normalization (Standardization)

Uses mean and standard deviation: $x’ = \frac{x – \mu}{\sigma}$

Use Case: When data has outliers or unknown ranges.

3. Decimal Scaling

Divides values by powers of 10: $x’ = \frac{x}{10^j}$

Use Case: When values are large (like population data).

4. Describe the working of the Apriori algorithm with an example.

Answer:
Apriori is a bottom-up algorithm that finds frequent itemsets using iterative passes.

Working Steps:

Count support of each item → keep frequent items.
Generate candidate pairs → count support → prune infrequent.
Generate 3-itemsets from frequent pairs → repeat.
Continue until no more frequent sets exist.

Example Dataset (Transactions):

TID	Items
1	A, B, C
2	A, C
3	B, C
4	A, B

If minimum support = 50%

Frequent 1-itemsets: A, B, C
Frequent 2-itemsets: AB, AC, BC
Frequent 3-itemset: ABC (appears in 50% → support = 2/4)

Apriori outputs patterns like:

A → C
B → C
A → B

5. Compare Apriori vs FP-Growth.

Feature	Apriori	FP-Growth
Method	Generates candidate itemsets	No candidate generation
Speed	Slow	Much faster
Memory	High	Low
Data Structure	Hash sets	FP-tree
Scans	Many	Two scans
Best for	Small datasets	Large datasets

Conclusion: FP-Growth is more efficient, scalable, and faster than Apriori.

6. Explain classification vs prediction with real-world examples.

Classification predicts categorical labels.

Examples:

Spam or Not Spam
Disease Yes/No
Loan Approved/Rejected

Prediction (Regression) predicts numeric values.

Examples:

Predict house price
Predict temperature
Predict sales for next month

7. Describe Decision Trees, Gini Index, and Entropy with formulas.

Decision Trees:
A supervised model that splits data based on attribute importance to create a tree structure.

Entropy (Impurity Measure)

$Entropy(S) = -p_1\log_2(p_1) – p_2\log_2(p_2)$

Used in ID3 and C4.5.

Information Gain

$IG = Entropy(parent) – \sum \frac{|child|}{|parent|} \cdot Entropy(child)$

Gini Index

Used in CART: $Gini = 1 – \sum p_i^2$

Lower Gini = purer node.

8. Explain KMeans clustering with step-by-step working.

Steps:

Choose K cluster centers (random).
Assign each point to nearest centroid.
Recalculate centroids.
Reassign points.
Repeat until centroids stabilize.

Example:

For K=2, algorithm splits data into two groups based on minimum distance to two centroids.

9. Describe DBSCAN and its advantages for anomaly detection.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups points based on density.

Key Terms:

ε (eps): radius
MinPts: minimum points in a neighborhood
Core Point: dense
Border Point
Noise Point: outliers

Advantages:

Detects arbitrary shaped clusters
Automatically detects outliers
Works well with noisy data
No need to pre-specify number of clusters

10. Explain deep learning architecture: input layer, hidden layers, activation functions.

Input Layer:

Receives raw data (pixels, words, numbers).

Hidden Layers:

Perform transformations:

Convolution
Recurrent loops
Feature extraction
Nonlinear modeling

Activation Functions:

Allow networks to learn nonlinear patterns.

Examples:

ReLU
Sigmoid
Softmax
Tanh

Deep networks stack multiple layers to learn high-level features.

11. Compare CNN, RNN, LSTM, and Autoencoders with use cases.

Model	Description	Best Use Case
CNN	Extracts spatial features	Images, vision
RNN	Sequential modeling	Text, speech
LSTM	Long-term memory	Time-series, translation
Autoencoders	Compress & reconstruct	Anomaly detection, compression

12. Describe Hadoop architecture: HDFS, NameNode, DataNode, replication.

HDFS: Distributed file system storing massive datasets.

NameNode:

Master server
Stores file metadata
Tracks file locations

DataNode:

Stores actual blocks
Handles read/write operations

Replication:

Each block stored across multiple nodes (default replication factor = 3) for fault tolerance.

13. Explain Apache Spark architecture and MLlib workflow.

Spark Architecture:

Driver Program: Controls the application
Cluster Manager: Allocates resources
Executors: Run tasks on worker nodes
RDD/DataFrames: Distributed datasets

MLlib Workflow:

Load data into DataFrames
Clean & preprocess
Build pipeline
Train model (e.g., Logistic Regression)
Evaluate
Save model

14. Describe SLAM, mapping, localization, and navigation in robotics.

SLAM = Simultaneous Localization and Mapping.

Mapping:

Building a map of surrounding environment using LIDAR, camera, or depth sensors.

Localization:

Determining the robot’s position within the map.

Navigation:

Planning a safe path from A → B while avoiding obstacles.

SLAM enables robots to operate autonomously in unknown environments.

15. Explain Industry 4.0 and the role of robotics + data mining in smart factories.

Industry 4.0 = Fourth Industrial Revolution.

Key Technologies:

IoT sensors
Cloud computing
Robotics
AI and Data Mining
Automation
Predictive maintenance

Role of Robotics + Data Mining:

Automated production lines
Quality control through image mining
Predict machine failures
Optimize workflow
Real-time analytics
Collaborative robots (cobots)

Smart factories learn, adapt, and optimize automatically.

SECTION A – MCQs (WITH ANSWERS)

ANSWER KEY (MCQs)

SECTION B – 30 Short Questions (With Answers)

1. Define Data Mining in one sentence.

2. What is the difference between KDD and Data Mining?

3. List three challenges of real-world data.

4. What is data cleaning?

5. Define normalization.

6. What is Min–Max scaling?

7. Give an example of categorical and numerical data.

8. What is the purpose of association rule mining?

9. Define support and confidence.

10. What is Apriori property?

11. What is FP-growth?

12. Define classification.

13. What is the difference between classification and prediction?

14. What is KNN used for?

15. Explain entropy in decision trees.

16. What is overfitting?

17. What is the purpose of clustering?

18. Define KMeans clustering.

19. What is hierarchical clustering?

20. What is an outlier?

21. Define deep learning.

22. What is a convolutional layer?

23. What is an autoencoder?

24. Define Big Data.

25. What is HDFS?

26. What is Spark MLlib?

27. Define SLAM.

28. What is sensor fusion?

29. What is reinforcement learning?

30. Define predictive maintenance.

SECTION C – 15 Long Questions (WITH ANSWERS)

1. Explain the full KDD process with a clean diagram.

Steps of KDD:

Diagram (Text-Based):

2. Describe all data preprocessing steps with examples.

Major Steps:

3. Explain normalization techniques (Min–Max, Z-score, Decimal scaling).

1. Min–Max Normalization

2. Z-Score Normalization (Standardization)

3. Decimal Scaling

4. Describe the working of the Apriori algorithm with an example.

Working Steps:

Example Dataset (Transactions):

If minimum support = 50%

5. Compare Apriori vs FP-Growth.

6. Explain classification vs prediction with real-world examples.

7. Describe Decision Trees, Gini Index, and Entropy with formulas.

Entropy (Impurity Measure)

Information Gain

Gini Index

8. Explain KMeans clustering with step-by-step working.

Example:

9. Describe DBSCAN and its advantages for anomaly detection.

Key Terms:

Advantages:

10. Explain deep learning architecture: input layer, hidden layers, activation functions.

Input Layer:

Hidden Layers:

Activation Functions:

11. Compare CNN, RNN, LSTM, and Autoencoders with use cases.

12. Describe Hadoop architecture: HDFS, NameNode, DataNode, replication.

NameNode:

DataNode:

Replication:

13. Explain Apache Spark architecture and MLlib workflow.

MLlib Workflow:

14. Describe SLAM, mapping, localization, and navigation in robotics.

Mapping:

Localization:

Navigation:

15. Explain Industry 4.0 and the role of robotics + data mining in smart factories.

Key Technologies:

Role of Robotics + Data Mining:

Related Posts

Lecture 16 – Robotics and Automation in Data Mining

Lecture 15 – Big Data Analytics for Data Mining

Lecture 14 – Time Series Data Mining & Sequential Pattern Analysis