Lecture 17 final question bank with 100 MCQs, 30 short questions, and 15 long questions for Data Mining. Exam-focused, solved, and syllabus-aligned.
SECTION A – MCQs (WITH ANSWERS)
- Data Mining is a step within which larger process?
a) Data Warehousing
b) KDD
c) ETL
d) OLAP - The main goal of Data Mining is to:
a) Store data
b) Remove data
c) Discover patterns
d) Replicate data - The KDD process begins with:
a) Evaluation
b) Data Cleaning
c) Deployment
d) Pattern recognition - Structured data is stored in:
a) Images
b) Text files
c) Databases
d) Audio signals - Semi-structured data example:
a) JSON
b) Tables
c) Excel sheet
d) Image - Unstructured data DOES NOT include:
a) Images
b) Text
c) Videos
d) SQL tables - The main challenge in KDD is:
a) Too few tools
b) Massive noisy data
c) No storage
d) Too many algorithms - Data redundancy means:
a) Missing data
b) Repeated data
c) Noisy data
d) Scattered data - An example of a data repository is:
a) LIDAR
b) Data Warehouse
c) Drone GPS
d) RAM - Metadata describes:
a) Data about data
b) Images
c) SQL queries
d) Dashboards - KDD stands for:
a) Knowledge Discovery in Databases
b) Knowledge Data Development
c) Key Data Discovery
d) Knowledge Driven Data - Data integration merges data from:
a) One source
b) Multiple sources
c) Only sensors
d) Only databases - OLAP is used for:
a) Transactions
b) Analytics
c) Networking
d) File transfer - An example of categorical data:
a) Age
b) Height
c) Gender
d) Salary - Quantitative data means:
a) Non-numeric
b) Numeric
c) Images
d) Text - ETL stands for:
a) Extract–Translate–Load
b) Extract–Transform–Load
c) Evaluate–Transfer–Load
d) Extract–Transfer–Link - Which is NOT a stage of KDD?
a) Data selection
b) Data cleaning
c) Data loading
d) Pattern discovery - A row in a dataset is called a:
a) Feature
b) Attribute
c) Record
d) Class - A column in a dataset is called a:
a) Sample
b) Tuple
c) Feature
d) Instance - Data preprocessing is required because:
a) Data is always perfect
b) Data has issues
c) Algorithms need no cleaning
d) Storage is cheap
- Removing noise from data is called:
a) Integration
b) Cleaning
c) Normalization
d) Reduction - Normalization is used to:
a) Increase noise
b) Scale data
c) Add missing values
d) Delete rows - Min-Max scaling converts values to:
a) −∞ to +∞
b) 0 to 1
c) 1 to 100
d) −1 to +1 - Z-score normalization uses:
a) Mean & Std
b) Max only
c) Min only
d) Mode - Outliers are:
a) Normal data
b) Extreme values
c) Mean values
d) Balanced data - Missing values can be handled using:
a) Deletion
b) Imputation
c) Both
d) None - PCA is used for:
a) Clustering
b) Dimensionality reduction
c) Classification
d) Association - Binning is a:
a) Smoothing technique
b) Clustering technique
c) Classification technique
d) Prediction technique - Log transform reduces:
a) Variance
b) Bias
c) Noise only
d) Mean - Data reduction techniques include:
a) Scaling
b) Aggregation
c) Standardization
d) Cleaning - Normalization improves:
a) Model speed
b) Data storage
c) User interface
d) None - A common type of data cleaning:
a) Noise removal
b) Adding noise
c) Breaking data
d) Random deletion - Smoothing is used for:
a) Adding spikes
b) Removing irregularities
c) Adding labels
d) Storing data - Data transformation includes:
a) Normalization
b) Smoothing
c) Discretization
d) All options - Attribute construction means:
a) Creating new features
b) Removing features
c) Merging datasets
d) Scaling data - Missing values are common in:
a) Clean datasets
b) Real-world data
c) Random samples
d) Sensor errors only - Sampling is used for:
a) Making data larger
b) Speeding up processing
c) Removing features
d) Cleaning labels - Which is NOT a preprocessing step?
a) Cleaning
b) Integration
c) Transformation
d) Deployment - Feature extraction is needed because:
a) Models need fewer features
b) More features always slow performance
c) Both
d) None - KNN requires:
a) Categorical scaling
b) Distance-based scaling
c) No scaling
d) Random features
- Association rule mining finds:
a) Sequential patterns
b) Item relationships
c) Clusters
d) Classes - Support measures:
a) Popularity
b) Confidence
c) Errors
d) Accuracy - Confidence measures:
a) Strength of implication
b) Frequency only
c) Distance
d) Noise - Lift > 1 means:
a) Strong positive association
b) Weak rule
c) Negative correlation
d) Independent variables - Apriori uses:
a) Top-down pruning
b) Bottom-up candidate generation
c) Random selection
d) None - Apriori property:
a) All subsets of frequent itemsets must be frequent
b) Subsets can be infrequent
c) Items must be numeric
d) Requires clustering - FP-growth avoids:
a) Candidate generation
b) Tree building
c) Support counting
d) Item sorting - Frequent pattern tree (FP-tree) stores:
a) Transactions
b) Patterns
c) Noise
d) None - Association rules are used in:
a) Market basket analysis
b) PCA
c) Regression
d) SVM - If Support = 0.3, it means:
a) 30% of transactions contain itemset
b) 30 items
c) 3 items
d) None - Association rules follow:
a) X → Y
b) Y → Y
c) X → X
d) None - Confidence formula:
a) support(X ∪ Y) / support(X)
b) support(X) / support(Y)
c) support(Y) / support(X)
d) length(X)/length(Y) - Apriori suffers from:
a) High time cost
b) Low memory
c) No correctness
d) No scalability - FP-Growth is faster because:
a) Compresses dataset
b) Ignores patterns
c) Removes items
d) Stores everything - An example of association:
a) If bread then butter
b) If cat then dog
c) If image then CNN
d) If data then noise - Market basket analysis is used by:
a) Banks
b) Retailers
c) Hospitals
d) All - Confidence close to 0 means:
a) Weak rule
b) Strong rule
c) No support
d) High lift - For strong rules, lift must be:
a) > 1
b) < 1
c) = 1
d) = 0 - Conviction measures:
a) Correlation
b) Surprise
c) Support
d) Frequency - FP-growth uses:
a) Divide and conquer
b) Convolution
c) Linear regression
d) Random forests
- Classification is used for:
a) Predicting a category
b) Finding patterns
c) Clustering data
d) None - Regression predicts:
a) Numeric values
b) Labels
c) Clusters
d) Distances - The simplest classifier is:
a) Decision tree
b) Naive Bayes
c) KNN
d) SVM - KNN relies on:
a) Nearest neighbors
b) Decision rules
c) Probability
d) Hyperplanes - SVM finds:
a) Maximum margin hyperplane
b) Minimum hyperplane
c) No separation
d) Random splits - Decision tree splits nodes using:
a) Entropy
b) Gini Index
c) Information gain
d) All - Random forest is an ensemble of:
a) Many decision trees
b) SVMs
c) Neural networks
d) PCA models - Clustering groups data by:
a) Similarity
b) Labels
c) Rules
d) Trees - KMeans minimizes:
a) Squared distances
b) Lift
c) Confidence
d) Support - KMeans requires:
a) K clusters
b) Labels
c) Prediction
d) Deep learning - Hierarchical clustering builds:
a) Dendrogram
b) FP-tree
c) CNN
d) ARIMA - DBSCAN identifies:
a) Density clusters
b) Hyperplanes
c) Neural activations
d) Trees - Outlier detection in clustering uses:
a) LOF
b) SVM
c) CNN
d) PCA - Prediction is:
a) Future estimation
b) Pattern discovery
c) Noise detection
d) Deletion - ROC curve evaluates:
a) Classification
b) Clustering
c) Association
d) FP-tree - A confusion matrix shows:
a) True/false predictions
b) Clusters
c) Patterns
d) Rules - Overfitting occurs when model:
a) Learns noise
b) Performs well
c) Ignores data
d) Removes patterns - Underfitting means:
a) Too simple
b) Too complex
c) Perfect
d) No training - Ensemble learning improves:
a) Accuracy
b) Noise
c) Class imbalance
d) None - Boosting reduces:
a) Bias
b) Variance
c) Data
d) Features
- Deep learning learns:
a) Manual features
b) Automatic features
c) No features
d) Random values - CNNs are best for:
a) Images
b) Audio
c) SQL queries
d) Random data - RNNs handle:
a) Sequential data
b) Images
c) Noise
d) Random points - LSTM solves:
a) Vanishing gradients
b) Regression
c) Clustering
d) PCA - Autoencoders perform:
a) Dimensionality reduction
b) Clustering
c) Classification
d) Regression - Big Data is defined by:
a) 5 Vs
b) 2 Vs
c) 1 V
d) None - Hadoop stores data using:
a) HDFS
b) CNN
c) SQL only
d) Neural memory - Spark is faster than Hadoop because of:
a) In-memory processing
b) Disk-based mapping
c) Random access
d) No cluster - MLlib is part of:
a) Spark
b) Hadoop
c) SQL
d) Robotics - MapReduce has:
a) Map + Reduce phases
b) CNN layers
c) RNN loops
d) SQL joins - Robotics uses data mining for:
a) Perception
b) Navigation
c) Motion
d) All - SLAM stands for:
a) Simultaneous Localization and Mapping
b) Single Location Map
c) Sensor Location Analysis Model
d) Synthetic Learning Algorithm Model - LIDAR is used for:
a) 3D mapping
b) Audio
c) SQL queries
d) Classification - Drones use data mining for:
a) Obstacle avoidance
b) Route optimization
c) Detection
d) All - Edge computing helps robots by:
a) Reducing latency
b) Increasing delay
c) Storing old data
d) Removing sensors - Reinforcement Learning helps robots:
a) Learn actions
b) Learn images
c) Process SQL
d) Track metadata - Industry 4.0 uses robotics for:
a) Smart manufacturing
b) Random work
c) Storing files
d) Emails - Predictive maintenance uses:
a) Sensor mining
b) Email mining
c) Web mining
d) SQL joins - Automation increases:
a) Efficiency
b) Cost only
c) Errors
d) Waste - Intelligent robots combine:
a) AI + Sensors + Data Mining
b) Only sensors
c) Only motors
d) Only software
ANSWER KEY (MCQs)
1–20: B, C, B, C, A, D, B, B, B, A, A, B, B, C, B, B, C, C, C, B
21–40: B, B, B, A, B, C, B, A, A, B, A, A, B, D, A, B, B, D, C, B
41–60: B, A, A, A, B, A, A, B, B, A, A, A, A, A, A, D, A, A, B, A
61–80: A, A, B, A, A, D, A, A, A, A, A, A, A, A, A, A, A, A, A, A
81–100: B, A, A, A, A, A, A, A, A, A, D, A, A, D, A, A, A, A, A, A
SECTION B – 30 Short Questions (With Answers)
1. Define Data Mining in one sentence.
Data Mining is the process of discovering meaningful patterns, trends, and relationships from large datasets.
2. What is the difference between KDD and Data Mining?
KDD is the entire process of extracting knowledge from data, while Data Mining is the specific step where algorithms discover patterns.
3. List three challenges of real-world data.
Real-world data is often noisy, incomplete, and inconsistent.
4. What is data cleaning?
Data cleaning is the process of correcting errors, removing noise, and handling missing or inconsistent values.
5. Define normalization.
Normalization is the process of scaling numeric values into a standardized range to improve model performance.
6. What is Min–Max scaling?
Min–Max scaling transforms values to a fixed range, typically 0 to 1, using the formula:
7. Give an example of categorical and numerical data.
Categorical: Gender (Male/Female)
Numerical: Age (25, 30, 45)
8. What is the purpose of association rule mining?
To discover relationships and co-occurrence patterns among items in transactional datasets.
9. Define support and confidence.
Support: Frequency of an itemset in the dataset.
Confidence: Strength of implication in a rule, i.e., how often Y appears when X appears.
10. What is Apriori property?
All subsets of a frequent itemset must also be frequent.
11. What is FP-growth?
FP-growth is a fast algorithm that mines frequent itemsets using a compact FP-tree without generating candidates.
12. Define classification.
Classification is the process of predicting categories or class labels based on input features.
13. What is the difference between classification and prediction?
Classification predicts categorical labels, while prediction (regression) predicts numeric values.
14. What is KNN used for?
KNN is used for classification or regression by comparing new samples to their closest neighbors.
15. Explain entropy in decision trees.
Entropy measures impurity or disorder in data; higher entropy means more mixed classes.
16. What is overfitting?
Overfitting occurs when a model learns noise instead of patterns and performs poorly on new data.
17. What is the purpose of clustering?
Clustering groups similar data points together without using predefined labels.
18. Define KMeans clustering.
KMeans partitions data into k clusters by assigning points to the nearest centroid and updating centroids iteratively.
19. What is hierarchical clustering?
A clustering technique that builds a tree-like structure (dendrogram) through merging or splitting clusters.
20. What is an outlier?
An outlier is a data point that significantly deviates from normal patterns.
21. Define deep learning.
Deep learning is a subset of AI that uses multi-layer neural networks to learn complex features automatically.
22. What is a convolutional layer?
A convolutional layer uses filters to extract spatial features from images or grid-like data.
23. What is an autoencoder?
An autoencoder is a neural network that compresses data into a lower dimension and reconstructs it to learn representations.
24. Define Big Data.
Big Data refers to extremely large and complex datasets that cannot be processed by traditional systems.
25. What is HDFS?
HDFS (Hadoop Distributed File System) is a distributed storage system that stores large data files across multiple machines.
26. What is Spark MLlib?
Spark MLlib is Apache Spark’s distributed machine-learning library for classification, clustering, regression, and feature extraction.
27. Define SLAM.
SLAM (Simultaneous Localization and Mapping) enables robots to build a map of their environment while tracking their own location.
28. What is sensor fusion?
Sensor fusion combines data from multiple sensors to improve accuracy, robustness, and perception.
29. What is reinforcement learning?
Reinforcement learning is a learning method where an agent learns optimal actions through rewards and penalties.
30. Define predictive maintenance.
Predictive maintenance uses sensor data and analytics to predict equipment failures before they occur.
SECTION C – 15 Long Questions (WITH ANSWERS)
1. Explain the full KDD process with a clean diagram.
Answer:
KDD (Knowledge Discovery in Databases) is the complete end-to-end process of transforming raw data into useful knowledge. It contains multiple stages designed to clean, prepare, transform, mine, and evaluate patterns.
Steps of KDD:
- Data Selection
- Identify relevant data sources.
- Extract the subset needed for analysis.
- Data Cleaning
- Handle missing values
- Remove noise
- Fix inconsistencies
- Correct errors
- Data Integration
- Merge databases, files, IoT sensors, logs, or external datasets
- Resolve schema conflicts and redundancy
- Data Transformation
- Normalize data
- Reduce dimensionality
- Convert data into models (scaling, aggregation)
- Data Mining
- Apply algorithms such as classification, clustering, association rules, pattern extraction, etc.
- Pattern Evaluation
- Evaluate interestingness
- Filter valuable patterns
- Remove irrelevant results
- Knowledge Presentation
- Use visualization, dashboards, reports, and charts to deliver insights.
Diagram (Text-Based):
Data Selection → Cleaning → Integration → Transformation → Data Mining → Evaluation → Knowledge
2. Describe all data preprocessing steps with examples.
Answer:
Data preprocessing improves data quality and prepares it for mining.
Major Steps:
- Data Cleaning
- Fix missing values (mean, median, interpolation)
- Remove noise using smoothing
- Correct inconsistent entries
- Example: Replace null age values with the median age.
- Data Integration
- Combine multiple sources
- Resolve duplicates
- Merge tables (SQL JOIN)
- Example: Merge customer database with purchase logs.
- Data Transformation
- Normalization (Min–Max scaling, Z-score)
- Aggregation (daily → weekly sales)
- Discretization (convert age → age groups)
- Data Reduction
- Dimensionality reduction (PCA)
- Sampling (random sample to reduce size)
- Data compression
- Data Discretization
- Convert numeric data to categorical bins
- Example: Convert income into Low, Medium, High.
- Feature Engineering
- Construct new features from existing ones
- Example: BMI = weight / height².
3. Explain normalization techniques (Min–Max, Z-score, Decimal scaling).
Answer:
1. Min–Max Normalization
Scales values between 0 and 1:
Use Case: Algorithms sensitive to distances (KNN, SVM).
2. Z-Score Normalization (Standardization)
Uses mean and standard deviation:
Use Case: When data has outliers or unknown ranges.
3. Decimal Scaling
Divides values by powers of 10:
Use Case: When values are large (like population data).
4. Describe the working of the Apriori algorithm with an example.
Answer:
Apriori is a bottom-up algorithm that finds frequent itemsets using iterative passes.
Working Steps:
- Count support of each item → keep frequent items.
- Generate candidate pairs → count support → prune infrequent.
- Generate 3-itemsets from frequent pairs → repeat.
- Continue until no more frequent sets exist.
Example Dataset (Transactions):
| TID | Items |
|---|---|
| 1 | A, B, C |
| 2 | A, C |
| 3 | B, C |
| 4 | A, B |
If minimum support = 50%
- Frequent 1-itemsets: A, B, C
- Frequent 2-itemsets: AB, AC, BC
- Frequent 3-itemset: ABC (appears in 50% → support = 2/4)
Apriori outputs patterns like:
- A → C
- B → C
- A → B
5. Compare Apriori vs FP-Growth.
| Feature | Apriori | FP-Growth |
|---|---|---|
| Method | Generates candidate itemsets | No candidate generation |
| Speed | Slow | Much faster |
| Memory | High | Low |
| Data Structure | Hash sets | FP-tree |
| Scans | Many | Two scans |
| Best for | Small datasets | Large datasets |
Conclusion: FP-Growth is more efficient, scalable, and faster than Apriori.
6. Explain classification vs prediction with real-world examples.
Classification predicts categorical labels.
Examples:
- Spam or Not Spam
- Disease Yes/No
- Loan Approved/Rejected
Prediction (Regression) predicts numeric values.
Examples:
- Predict house price
- Predict temperature
- Predict sales for next month
7. Describe Decision Trees, Gini Index, and Entropy with formulas.
Decision Trees:
A supervised model that splits data based on attribute importance to create a tree structure.
Entropy (Impurity Measure)
Used in ID3 and C4.5.
Information Gain
Gini Index
Used in CART:
Lower Gini = purer node.
8. Explain KMeans clustering with step-by-step working.
Steps:
- Choose K cluster centers (random).
- Assign each point to nearest centroid.
- Recalculate centroids.
- Reassign points.
- Repeat until centroids stabilize.
Example:
For K=2, algorithm splits data into two groups based on minimum distance to two centroids.
9. Describe DBSCAN and its advantages for anomaly detection.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups points based on density.
Key Terms:
- ε (eps): radius
- MinPts: minimum points in a neighborhood
- Core Point: dense
- Border Point
- Noise Point: outliers
Advantages:
- Detects arbitrary shaped clusters
- Automatically detects outliers
- Works well with noisy data
- No need to pre-specify number of clusters
10. Explain deep learning architecture: input layer, hidden layers, activation functions.
Input Layer:
Receives raw data (pixels, words, numbers).
Hidden Layers:
Perform transformations:
- Convolution
- Recurrent loops
- Feature extraction
- Nonlinear modeling
Activation Functions:
Allow networks to learn nonlinear patterns.
Examples:
- ReLU
- Sigmoid
- Softmax
- Tanh
Deep networks stack multiple layers to learn high-level features.
11. Compare CNN, RNN, LSTM, and Autoencoders with use cases.
| Model | Description | Best Use Case |
|---|---|---|
| CNN | Extracts spatial features | Images, vision |
| RNN | Sequential modeling | Text, speech |
| LSTM | Long-term memory | Time-series, translation |
| Autoencoders | Compress & reconstruct | Anomaly detection, compression |
12. Describe Hadoop architecture: HDFS, NameNode, DataNode, replication.
HDFS: Distributed file system storing massive datasets.
NameNode:
- Master server
- Stores file metadata
- Tracks file locations
DataNode:
- Stores actual blocks
- Handles read/write operations
Replication:
Each block stored across multiple nodes (default replication factor = 3) for fault tolerance.
13. Explain Apache Spark architecture and MLlib workflow.
Spark Architecture:
- Driver Program: Controls the application
- Cluster Manager: Allocates resources
- Executors: Run tasks on worker nodes
- RDD/DataFrames: Distributed datasets
MLlib Workflow:
- Load data into DataFrames
- Clean & preprocess
- Build pipeline
- Train model (e.g., Logistic Regression)
- Evaluate
- Save model
14. Describe SLAM, mapping, localization, and navigation in robotics.
SLAM = Simultaneous Localization and Mapping.
Mapping:
Building a map of surrounding environment using LIDAR, camera, or depth sensors.
Localization:
Determining the robot’s position within the map.
Navigation:
Planning a safe path from A → B while avoiding obstacles.
SLAM enables robots to operate autonomously in unknown environments.
15. Explain Industry 4.0 and the role of robotics + data mining in smart factories.
Industry 4.0 = Fourth Industrial Revolution.
Key Technologies:
- IoT sensors
- Cloud computing
- Robotics
- AI and Data Mining
- Automation
- Predictive maintenance
Role of Robotics + Data Mining:
- Automated production lines
- Quality control through image mining
- Predict machine failures
- Optimize workflow
- Real-time analytics
- Collaborative robots (cobots)
Smart factories learn, adapt, and optimize automatically.




