Lecture 17 – Final Exam Bank for Data Mining (Massive Question Set)

Lecture 17 final question bank with 100 MCQs, 30 short questions, and 15 long questions for Data Mining. Exam-focused, solved, and syllabus-aligned.

SECTION A – MCQs (WITH ANSWERS)

  1. Data Mining is a step within which larger process?
    a) Data Warehousing
    b) KDD
    c) ETL
    d) OLAP
  2. The main goal of Data Mining is to:
    a) Store data
    b) Remove data
    c) Discover patterns
    d) Replicate data
  3. The KDD process begins with:
    a) Evaluation
    b) Data Cleaning
    c) Deployment
    d) Pattern recognition
  4. Structured data is stored in:
    a) Images
    b) Text files
    c) Databases
    d) Audio signals
  5. Semi-structured data example:
    a) JSON
    b) Tables
    c) Excel sheet
    d) Image
  6. Unstructured data DOES NOT include:
    a) Images
    b) Text
    c) Videos
    d) SQL tables
  7. The main challenge in KDD is:
    a) Too few tools
    b) Massive noisy data
    c) No storage
    d) Too many algorithms
  8. Data redundancy means:
    a) Missing data
    b) Repeated data
    c) Noisy data
    d) Scattered data
  9. An example of a data repository is:
    a) LIDAR
    b) Data Warehouse
    c) Drone GPS
    d) RAM
  10. Metadata describes:
    a) Data about data
    b) Images
    c) SQL queries
    d) Dashboards
  11. KDD stands for:
    a) Knowledge Discovery in Databases
    b) Knowledge Data Development
    c) Key Data Discovery
    d) Knowledge Driven Data
  12. Data integration merges data from:
    a) One source
    b) Multiple sources
    c) Only sensors
    d) Only databases
  13. OLAP is used for:
    a) Transactions
    b) Analytics
    c) Networking
    d) File transfer
  14. An example of categorical data:
    a) Age
    b) Height
    c) Gender
    d) Salary
  15. Quantitative data means:
    a) Non-numeric
    b) Numeric
    c) Images
    d) Text
  16. ETL stands for:
    a) Extract–Translate–Load
    b) Extract–Transform–Load
    c) Evaluate–Transfer–Load
    d) Extract–Transfer–Link
  17. Which is NOT a stage of KDD?
    a) Data selection
    b) Data cleaning
    c) Data loading
    d) Pattern discovery
  18. A row in a dataset is called a:
    a) Feature
    b) Attribute
    c) Record
    d) Class
  19. A column in a dataset is called a:
    a) Sample
    b) Tuple
    c) Feature
    d) Instance
  20. Data preprocessing is required because:
    a) Data is always perfect
    b) Data has issues
    c) Algorithms need no cleaning
    d) Storage is cheap
  1. Removing noise from data is called:
    a) Integration
    b) Cleaning
    c) Normalization
    d) Reduction
  2. Normalization is used to:
    a) Increase noise
    b) Scale data
    c) Add missing values
    d) Delete rows
  3. Min-Max scaling converts values to:
    a) −∞ to +∞
    b) 0 to 1
    c) 1 to 100
    d) −1 to +1
  4. Z-score normalization uses:
    a) Mean & Std
    b) Max only
    c) Min only
    d) Mode
  5. Outliers are:
    a) Normal data
    b) Extreme values
    c) Mean values
    d) Balanced data
  6. Missing values can be handled using:
    a) Deletion
    b) Imputation
    c) Both
    d) None
  7. PCA is used for:
    a) Clustering
    b) Dimensionality reduction
    c) Classification
    d) Association
  8. Binning is a:
    a) Smoothing technique
    b) Clustering technique
    c) Classification technique
    d) Prediction technique
  9. Log transform reduces:
    a) Variance
    b) Bias
    c) Noise only
    d) Mean
  10. Data reduction techniques include:
    a) Scaling
    b) Aggregation
    c) Standardization
    d) Cleaning
  11. Normalization improves:
    a) Model speed
    b) Data storage
    c) User interface
    d) None
  12. A common type of data cleaning:
    a) Noise removal
    b) Adding noise
    c) Breaking data
    d) Random deletion
  13. Smoothing is used for:
    a) Adding spikes
    b) Removing irregularities
    c) Adding labels
    d) Storing data
  14. Data transformation includes:
    a) Normalization
    b) Smoothing
    c) Discretization
    d) All options
  15. Attribute construction means:
    a) Creating new features
    b) Removing features
    c) Merging datasets
    d) Scaling data
  16. Missing values are common in:
    a) Clean datasets
    b) Real-world data
    c) Random samples
    d) Sensor errors only
  17. Sampling is used for:
    a) Making data larger
    b) Speeding up processing
    c) Removing features
    d) Cleaning labels
  18. Which is NOT a preprocessing step?
    a) Cleaning
    b) Integration
    c) Transformation
    d) Deployment
  19. Feature extraction is needed because:
    a) Models need fewer features
    b) More features always slow performance
    c) Both
    d) None
  20. KNN requires:
    a) Categorical scaling
    b) Distance-based scaling
    c) No scaling
    d) Random features
  1. Association rule mining finds:
    a) Sequential patterns
    b) Item relationships
    c) Clusters
    d) Classes
  2. Support measures:
    a) Popularity
    b) Confidence
    c) Errors
    d) Accuracy
  3. Confidence measures:
    a) Strength of implication
    b) Frequency only
    c) Distance
    d) Noise
  4. Lift > 1 means:
    a) Strong positive association
    b) Weak rule
    c) Negative correlation
    d) Independent variables
  5. Apriori uses:
    a) Top-down pruning
    b) Bottom-up candidate generation
    c) Random selection
    d) None
  6. Apriori property:
    a) All subsets of frequent itemsets must be frequent
    b) Subsets can be infrequent
    c) Items must be numeric
    d) Requires clustering
  7. FP-growth avoids:
    a) Candidate generation
    b) Tree building
    c) Support counting
    d) Item sorting
  8. Frequent pattern tree (FP-tree) stores:
    a) Transactions
    b) Patterns
    c) Noise
    d) None
  9. Association rules are used in:
    a) Market basket analysis
    b) PCA
    c) Regression
    d) SVM
  10. If Support = 0.3, it means:
    a) 30% of transactions contain itemset
    b) 30 items
    c) 3 items
    d) None
  11. Association rules follow:
    a) X → Y
    b) Y → Y
    c) X → X
    d) None
  12. Confidence formula:
    a) support(X ∪ Y) / support(X)
    b) support(X) / support(Y)
    c) support(Y) / support(X)
    d) length(X)/length(Y)
  13. Apriori suffers from:
    a) High time cost
    b) Low memory
    c) No correctness
    d) No scalability
  14. FP-Growth is faster because:
    a) Compresses dataset
    b) Ignores patterns
    c) Removes items
    d) Stores everything
  15. An example of association:
    a) If bread then butter
    b) If cat then dog
    c) If image then CNN
    d) If data then noise
  16. Market basket analysis is used by:
    a) Banks
    b) Retailers
    c) Hospitals
    d) All
  17. Confidence close to 0 means:
    a) Weak rule
    b) Strong rule
    c) No support
    d) High lift
  18. For strong rules, lift must be:
    a) > 1
    b) < 1
    c) = 1
    d) = 0
  19. Conviction measures:
    a) Correlation
    b) Surprise
    c) Support
    d) Frequency
  20. FP-growth uses:
    a) Divide and conquer
    b) Convolution
    c) Linear regression
    d) Random forests
  1. Classification is used for:
    a) Predicting a category
    b) Finding patterns
    c) Clustering data
    d) None
  2. Regression predicts:
    a) Numeric values
    b) Labels
    c) Clusters
    d) Distances
  3. The simplest classifier is:
    a) Decision tree
    b) Naive Bayes
    c) KNN
    d) SVM
  4. KNN relies on:
    a) Nearest neighbors
    b) Decision rules
    c) Probability
    d) Hyperplanes
  5. SVM finds:
    a) Maximum margin hyperplane
    b) Minimum hyperplane
    c) No separation
    d) Random splits
  6. Decision tree splits nodes using:
    a) Entropy
    b) Gini Index
    c) Information gain
    d) All
  7. Random forest is an ensemble of:
    a) Many decision trees
    b) SVMs
    c) Neural networks
    d) PCA models
  8. Clustering groups data by:
    a) Similarity
    b) Labels
    c) Rules
    d) Trees
  9. KMeans minimizes:
    a) Squared distances
    b) Lift
    c) Confidence
    d) Support
  10. KMeans requires:
    a) K clusters
    b) Labels
    c) Prediction
    d) Deep learning
  11. Hierarchical clustering builds:
    a) Dendrogram
    b) FP-tree
    c) CNN
    d) ARIMA
  12. DBSCAN identifies:
    a) Density clusters
    b) Hyperplanes
    c) Neural activations
    d) Trees
  13. Outlier detection in clustering uses:
    a) LOF
    b) SVM
    c) CNN
    d) PCA
  14. Prediction is:
    a) Future estimation
    b) Pattern discovery
    c) Noise detection
    d) Deletion
  15. ROC curve evaluates:
    a) Classification
    b) Clustering
    c) Association
    d) FP-tree
  16. A confusion matrix shows:
    a) True/false predictions
    b) Clusters
    c) Patterns
    d) Rules
  17. Overfitting occurs when model:
    a) Learns noise
    b) Performs well
    c) Ignores data
    d) Removes patterns
  18. Underfitting means:
    a) Too simple
    b) Too complex
    c) Perfect
    d) No training
  19. Ensemble learning improves:
    a) Accuracy
    b) Noise
    c) Class imbalance
    d) None
  20. Boosting reduces:
    a) Bias
    b) Variance
    c) Data
    d) Features
  1. Deep learning learns:
    a) Manual features
    b) Automatic features
    c) No features
    d) Random values
  2. CNNs are best for:
    a) Images
    b) Audio
    c) SQL queries
    d) Random data
  3. RNNs handle:
    a) Sequential data
    b) Images
    c) Noise
    d) Random points
  4. LSTM solves:
    a) Vanishing gradients
    b) Regression
    c) Clustering
    d) PCA
  5. Autoencoders perform:
    a) Dimensionality reduction
    b) Clustering
    c) Classification
    d) Regression
  6. Big Data is defined by:
    a) 5 Vs
    b) 2 Vs
    c) 1 V
    d) None
  7. Hadoop stores data using:
    a) HDFS
    b) CNN
    c) SQL only
    d) Neural memory
  8. Spark is faster than Hadoop because of:
    a) In-memory processing
    b) Disk-based mapping
    c) Random access
    d) No cluster
  9. MLlib is part of:
    a) Spark
    b) Hadoop
    c) SQL
    d) Robotics
  10. MapReduce has:
    a) Map + Reduce phases
    b) CNN layers
    c) RNN loops
    d) SQL joins
  11. Robotics uses data mining for:
    a) Perception
    b) Navigation
    c) Motion
    d) All
  12. SLAM stands for:
    a) Simultaneous Localization and Mapping
    b) Single Location Map
    c) Sensor Location Analysis Model
    d) Synthetic Learning Algorithm Model
  13. LIDAR is used for:
    a) 3D mapping
    b) Audio
    c) SQL queries
    d) Classification
  14. Drones use data mining for:
    a) Obstacle avoidance
    b) Route optimization
    c) Detection
    d) All
  15. Edge computing helps robots by:
    a) Reducing latency
    b) Increasing delay
    c) Storing old data
    d) Removing sensors
  16. Reinforcement Learning helps robots:
    a) Learn actions
    b) Learn images
    c) Process SQL
    d) Track metadata
  17. Industry 4.0 uses robotics for:
    a) Smart manufacturing
    b) Random work
    c) Storing files
    d) Emails
  18. Predictive maintenance uses:
    a) Sensor mining
    b) Email mining
    c) Web mining
    d) SQL joins
  19. Automation increases:
    a) Efficiency
    b) Cost only
    c) Errors
    d) Waste
  20. Intelligent robots combine:
    a) AI + Sensors + Data Mining
    b) Only sensors
    c) Only motors
    d) Only software

ANSWER KEY (MCQs)

1–20: B, C, B, C, A, D, B, B, B, A, A, B, B, C, B, B, C, C, C, B
21–40: B, B, B, A, B, C, B, A, A, B, A, A, B, D, A, B, B, D, C, B
41–60: B, A, A, A, B, A, A, B, B, A, A, A, A, A, A, D, A, A, B, A
61–80: A, A, B, A, A, D, A, A, A, A, A, A, A, A, A, A, A, A, A, A
81–100: B, A, A, A, A, A, A, A, A, A, D, A, A, D, A, A, A, A, A, A

Data Mining Lecture Notes

SECTION B – 30 Short Questions (With Answers)

1. Define Data Mining in one sentence.

Data Mining is the process of discovering meaningful patterns, trends, and relationships from large datasets.

2. What is the difference between KDD and Data Mining?

KDD is the entire process of extracting knowledge from data, while Data Mining is the specific step where algorithms discover patterns.

3. List three challenges of real-world data.

Real-world data is often noisy, incomplete, and inconsistent.

4. What is data cleaning?

Data cleaning is the process of correcting errors, removing noise, and handling missing or inconsistent values.

5. Define normalization.

Normalization is the process of scaling numeric values into a standardized range to improve model performance.

6. What is Min–Max scaling?

Min–Max scaling transforms values to a fixed range, typically 0 to 1, using the formula:X=XXminXmaxXminX’ = \frac{X – X_{\min}}{X_{\max} – X_{\min}}

7. Give an example of categorical and numerical data.

Categorical: Gender (Male/Female)
Numerical: Age (25, 30, 45)

8. What is the purpose of association rule mining?

To discover relationships and co-occurrence patterns among items in transactional datasets.

9. Define support and confidence.

Support: Frequency of an itemset in the dataset.
Confidence: Strength of implication in a rule, i.e., how often Y appears when X appears.

10. What is Apriori property?

All subsets of a frequent itemset must also be frequent.

11. What is FP-growth?

FP-growth is a fast algorithm that mines frequent itemsets using a compact FP-tree without generating candidates.

12. Define classification.

Classification is the process of predicting categories or class labels based on input features.

13. What is the difference between classification and prediction?

Classification predicts categorical labels, while prediction (regression) predicts numeric values.

14. What is KNN used for?

KNN is used for classification or regression by comparing new samples to their closest neighbors.

15. Explain entropy in decision trees.

Entropy measures impurity or disorder in data; higher entropy means more mixed classes.

16. What is overfitting?

Overfitting occurs when a model learns noise instead of patterns and performs poorly on new data.

17. What is the purpose of clustering?

Clustering groups similar data points together without using predefined labels.

18. Define KMeans clustering.

KMeans partitions data into k clusters by assigning points to the nearest centroid and updating centroids iteratively.

19. What is hierarchical clustering?

A clustering technique that builds a tree-like structure (dendrogram) through merging or splitting clusters.

20. What is an outlier?

An outlier is a data point that significantly deviates from normal patterns.

21. Define deep learning.

Deep learning is a subset of AI that uses multi-layer neural networks to learn complex features automatically.

22. What is a convolutional layer?

A convolutional layer uses filters to extract spatial features from images or grid-like data.

23. What is an autoencoder?

An autoencoder is a neural network that compresses data into a lower dimension and reconstructs it to learn representations.

24. Define Big Data.

Big Data refers to extremely large and complex datasets that cannot be processed by traditional systems.

25. What is HDFS?

HDFS (Hadoop Distributed File System) is a distributed storage system that stores large data files across multiple machines.

26. What is Spark MLlib?

Spark MLlib is Apache Spark’s distributed machine-learning library for classification, clustering, regression, and feature extraction.

27. Define SLAM.

SLAM (Simultaneous Localization and Mapping) enables robots to build a map of their environment while tracking their own location.

28. What is sensor fusion?

Sensor fusion combines data from multiple sensors to improve accuracy, robustness, and perception.

29. What is reinforcement learning?

Reinforcement learning is a learning method where an agent learns optimal actions through rewards and penalties.

30. Define predictive maintenance.

Predictive maintenance uses sensor data and analytics to predict equipment failures before they occur.

Scikit-learn Classification Algorithms

SECTION C – 15 Long Questions (WITH ANSWERS)

1. Explain the full KDD process with a clean diagram.

Answer:
KDD (Knowledge Discovery in Databases) is the complete end-to-end process of transforming raw data into useful knowledge. It contains multiple stages designed to clean, prepare, transform, mine, and evaluate patterns.

Steps of KDD:
  1. Data Selection
    • Identify relevant data sources.
    • Extract the subset needed for analysis.
  2. Data Cleaning
    • Handle missing values
    • Remove noise
    • Fix inconsistencies
    • Correct errors
  3. Data Integration
    • Merge databases, files, IoT sensors, logs, or external datasets
    • Resolve schema conflicts and redundancy
  4. Data Transformation
    • Normalize data
    • Reduce dimensionality
    • Convert data into models (scaling, aggregation)
  5. Data Mining
    • Apply algorithms such as classification, clustering, association rules, pattern extraction, etc.
  6. Pattern Evaluation
    • Evaluate interestingness
    • Filter valuable patterns
    • Remove irrelevant results
  7. Knowledge Presentation
    • Use visualization, dashboards, reports, and charts to deliver insights.
Diagram (Text-Based):
Data Selection → Cleaning → Integration → Transformation → Data Mining → Evaluation → Knowledge
2. Describe all data preprocessing steps with examples.

Answer:
Data preprocessing improves data quality and prepares it for mining.

Major Steps:
  1. Data Cleaning
    • Fix missing values (mean, median, interpolation)
    • Remove noise using smoothing
    • Correct inconsistent entries
    • Example: Replace null age values with the median age.
  2. Data Integration
    • Combine multiple sources
    • Resolve duplicates
    • Merge tables (SQL JOIN)
    • Example: Merge customer database with purchase logs.
  3. Data Transformation
    • Normalization (Min–Max scaling, Z-score)
    • Aggregation (daily → weekly sales)
    • Discretization (convert age → age groups)
  4. Data Reduction
    • Dimensionality reduction (PCA)
    • Sampling (random sample to reduce size)
    • Data compression
  5. Data Discretization
    • Convert numeric data to categorical bins
    • Example: Convert income into Low, Medium, High.
  6. Feature Engineering
    • Construct new features from existing ones
    • Example: BMI = weight / height².
3. Explain normalization techniques (Min–Max, Z-score, Decimal scaling).

Answer:

1. Min–Max Normalization

Scales values between 0 and 1:x=xxminxmaxxminx’ = \frac{x – x_{min}}{x_{max} – x_{min}}

Use Case: Algorithms sensitive to distances (KNN, SVM).

2. Z-Score Normalization (Standardization)

Uses mean and standard deviation:x=xμσx’ = \frac{x – \mu}{\sigma}

Use Case: When data has outliers or unknown ranges.

3. Decimal Scaling

Divides values by powers of 10:x=x10jx’ = \frac{x}{10^j}

Use Case: When values are large (like population data).

4. Describe the working of the Apriori algorithm with an example.

Answer:
Apriori is a bottom-up algorithm that finds frequent itemsets using iterative passes.

Working Steps:
  1. Count support of each item → keep frequent items.
  2. Generate candidate pairs → count support → prune infrequent.
  3. Generate 3-itemsets from frequent pairs → repeat.
  4. Continue until no more frequent sets exist.
Example Dataset (Transactions):
TIDItems
1A, B, C
2A, C
3B, C
4A, B
If minimum support = 50%
  • Frequent 1-itemsets: A, B, C
  • Frequent 2-itemsets: AB, AC, BC
  • Frequent 3-itemset: ABC (appears in 50% → support = 2/4)

Apriori outputs patterns like:

  • A → C
  • B → C
  • A → B
5. Compare Apriori vs FP-Growth.
FeatureAprioriFP-Growth
MethodGenerates candidate itemsetsNo candidate generation
SpeedSlowMuch faster
MemoryHighLow
Data StructureHash setsFP-tree
ScansManyTwo scans
Best forSmall datasetsLarge datasets

Conclusion: FP-Growth is more efficient, scalable, and faster than Apriori.

6. Explain classification vs prediction with real-world examples.

Classification predicts categorical labels.

Examples:

  • Spam or Not Spam
  • Disease Yes/No
  • Loan Approved/Rejected

Prediction (Regression) predicts numeric values.

Examples:

  • Predict house price
  • Predict temperature
  • Predict sales for next month
7. Describe Decision Trees, Gini Index, and Entropy with formulas.

Decision Trees:
A supervised model that splits data based on attribute importance to create a tree structure.

Entropy (Impurity Measure)

Entropy(S)=p1log2(p1)p2log2(p2)Entropy(S) = -p_1\log_2(p_1) – p_2\log_2(p_2)

Used in ID3 and C4.5.


Information Gain

IG=Entropy(parent)childparentEntropy(child)IG = Entropy(parent) – \sum \frac{|child|}{|parent|} \cdot Entropy(child)

Gini Index

Used in CART:Gini=1pi2Gini = 1 – \sum p_i^2

Lower Gini = purer node.

8. Explain KMeans clustering with step-by-step working.

Steps:

  1. Choose K cluster centers (random).
  2. Assign each point to nearest centroid.
  3. Recalculate centroids.
  4. Reassign points.
  5. Repeat until centroids stabilize.
Example:

For K=2, algorithm splits data into two groups based on minimum distance to two centroids.

9. Describe DBSCAN and its advantages for anomaly detection.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups points based on density.

Key Terms:
  • ε (eps): radius
  • MinPts: minimum points in a neighborhood
  • Core Point: dense
  • Border Point
  • Noise Point: outliers
Advantages:
  • Detects arbitrary shaped clusters
  • Automatically detects outliers
  • Works well with noisy data
  • No need to pre-specify number of clusters
10. Explain deep learning architecture: input layer, hidden layers, activation functions.
Input Layer:

Receives raw data (pixels, words, numbers).

Hidden Layers:

Perform transformations:

  • Convolution
  • Recurrent loops
  • Feature extraction
  • Nonlinear modeling
Activation Functions:

Allow networks to learn nonlinear patterns.

Examples:

  • ReLU
  • Sigmoid
  • Softmax
  • Tanh

Deep networks stack multiple layers to learn high-level features.

11. Compare CNN, RNN, LSTM, and Autoencoders with use cases.
ModelDescriptionBest Use Case
CNNExtracts spatial featuresImages, vision
RNNSequential modelingText, speech
LSTMLong-term memoryTime-series, translation
AutoencodersCompress & reconstructAnomaly detection, compression
12. Describe Hadoop architecture: HDFS, NameNode, DataNode, replication.

HDFS: Distributed file system storing massive datasets.

NameNode:
  • Master server
  • Stores file metadata
  • Tracks file locations
DataNode:
  • Stores actual blocks
  • Handles read/write operations
Replication:

Each block stored across multiple nodes (default replication factor = 3) for fault tolerance.

13. Explain Apache Spark architecture and MLlib workflow.

Spark Architecture:

  • Driver Program: Controls the application
  • Cluster Manager: Allocates resources
  • Executors: Run tasks on worker nodes
  • RDD/DataFrames: Distributed datasets
MLlib Workflow:
  1. Load data into DataFrames
  2. Clean & preprocess
  3. Build pipeline
  4. Train model (e.g., Logistic Regression)
  5. Evaluate
  6. Save model
14. Describe SLAM, mapping, localization, and navigation in robotics.

SLAM = Simultaneous Localization and Mapping.

Mapping:

Building a map of surrounding environment using LIDAR, camera, or depth sensors.

Localization:

Determining the robot’s position within the map.

Planning a safe path from A → B while avoiding obstacles.

SLAM enables robots to operate autonomously in unknown environments.

15. Explain Industry 4.0 and the role of robotics + data mining in smart factories.

Industry 4.0 = Fourth Industrial Revolution.

Key Technologies:
  • IoT sensors
  • Cloud computing
  • Robotics
  • AI and Data Mining
  • Automation
  • Predictive maintenance
Role of Robotics + Data Mining:
  • Automated production lines
  • Quality control through image mining
  • Predict machine failures
  • Optimize workflow
  • Real-time analytics
  • Collaborative robots (cobots)

Smart factories learn, adapt, and optimize automatically.

Leave a Reply

Your email address will not be published. Required fields are marked *