Lecture 18 - Deep Learning MCQs, Short Questions & Long Questions

Solved Deep Learning MCQs, short questions, and long questions for BSCS, NCEAC. Covers CNN, RNN, LSTM, GANs, Transformers, Autoencoders, and Optimization algorithms. Perfect for university exams.

SECTION A – MCQs (WITH ANSWERS + EXPLANATIONS)

1. Deep Learning is mainly inspired from which biological system?

A. Digestive system
B. Nervous system
C. Respiratory system
D. Immune system
Answer: B
Explanation: Neural networks mimic biological neurons in the brain.

2. Which of the following represents a “deep” neural network?

A. 1 hidden layer
B. No hidden layer
C. Multiple hidden layers
D. Only output layer
Answer: C
Explanation: Deep networks have multiple stacked layers.

3. Which function introduces non-linearity into neural networks?

A. Softmax
B. Activation function
C. Pooling
D. Regularization
Answer: B
Explanation: Activation functions allow networks to learn complex non-linear relationships.

4. Which of the following is a commonly used activation function?

A. Linear
B. ReLU
C. Average
D. Median
Answer: B
Explanation: ReLU accelerates training and avoids vanishing gradients.

5. The process of adjusting weights during training is called:

A. Forward pass
B. Backpropagation
C. Initialization
D. Dropout
Answer: B
Explanation: Backpropagation calculates gradients and updates weights.

6. Gradient descent aims to:

A. Maximize loss
B. Minimize loss
C. Delete weights
D. Increase learning rate
Answer: B
Explanation: GD updates model parameters to minimize the cost function.

7. Which optimizer is most popular in deep learning?

A. Adam
B. SGD
C. Naive Bayes
D. Logistic Regression
Answer: A
Explanation: Adam combines momentum and adaptive learning rates.

8. Which regularization technique randomly disables neurons?

A. L1
B. BatchNorm
C. Dropout
D. Softmax
Answer: C
Explanation: Dropout reduces overfitting by preventing co-adaptation.

9. CNNs are mainly used for which type of data?

A. Text
B. Sequential
C. Image / spatial
D. Audio
Answer: C
Explanation: CNNs excel at spatial feature extraction.

10. Which layer reduces spatial dimensions in CNNs?

A. Convolution
B. Pooling
C. Dense
D. Normalization
Answer: B
Explanation: Pooling downsamples feature maps.

11. Which network solves vanishing gradient problem?

A. RNN
B. LSTM
C. Linear NN
D. Perceptron
Answer: B
Explanation: LSTM uses gates + cell state to preserve gradients.

12. What does GRU stand for?

A. Gradient Recurrent Unit
B. Gated Recurrent Unit
C. General Recurrent Unit
D. Global ReLU Unit
Answer: B
Explanation: GRU is a simplified gated RNN unit.

13. Seq2Seq models are mainly used for:

A. Object detection
B. Translation
C. Clustering
D. Regression
Answer: B
Explanation: Seq2Seq converts one sequence to another (e.g., English → French).

14. Attention mechanism solves which problem?

A. Bias
B. Bottleneck in Seq2Seq
C. Overfitting
D. Low memory
Answer: B
Explanation: Attention removes dependence on a single context vector.

15. Transformer architecture removes dependence on:

A. CNN
B. GANs
C. RNN
D. DBN
Answer: C
Explanation: Transformers use self-attention instead of recurrence.

16. Which component provides positional information to Transformers?

A. MLP
B. Q-K-V
C. Positional Encoding
D. Softmax
Answer: C
Explanation: It encodes sequence order.

17. Autoencoders are mainly used for:

A. Classification
B. Compression & reconstruction
C. Clustering
D. Regression
Answer: B
Explanation: AE learns latent representations and reconstructs input.

18. Which Autoencoder enforces neuron-level sparsity?

A. Denoising AE
B. Sparse AE
C. Convolutional AE
D. Variational AE
Answer: B

19. RBMs are trained using:

A. SGD
B. Contrastive Divergence
C. Reinforcement learning
D. Q-learning
Answer: B
Explanation: CD-k approximates RBM likelihood gradient.

20. GANs consist of:

A. Two discriminators
B. Two generators
C. Generator & Discriminator
D. Q and K networks
Answer: C
Explanation: GAN training is adversarial between G and D.

Deep Learning Lecture 1 — Introduction

21. GANs were introduced by:

A. Hinton
B. Bengio
C. Goodfellow
D. Vaswani
Answer: C
Explanation: Ian Goodfellow proposed GANs in 2014.

22. Latent vector in GANs is usually sampled from:

A. Uniform
B. Gaussian
C. Bernoulli
D. Laplace
Answer: B
Explanation: Noise vector z is drawn from N(0,1).

23. Which is a common GAN training issue?

A. Low bias
B. Mode collapse
C. Over-shuffling
D. Under-normalization
Answer: B

24. DCGAN uses which type of layers in Generator?

A. Dense
B. ConvTranspose
C. Recurrent
D. LSTM
Answer: B
Explanation: ConvTranspose upsamples from noise to images.

25. StyleGAN is famous for generating:

A. Text
B. CIFAR-10 images
C. Photo-realistic faces
D. Handwriting
Answer: C

26. Batch Normalization helps in:

A. Reducing memory
B. Stabilizing training
C. Increasing dataset size
D. Removing pooling
Answer: B
Explanation: BN normalizes activations, making training faster and more stable.

27. Which technique reduces model overfitting?

A. Increasing epochs
B. Dropout
C. Increasing layers
D. Using bigger model
Answer: B
Explanation: Dropout prevents co-adaptation by randomly disabling neurons.

28. Convolution filters slide over input using:

A. Pooling
B. Stride
C. Padding
D. Bias
Answer: B
Explanation: Stride controls how many pixels the filter moves each step.

29. Which network mainly uses gates to store long-term memory?

A. CNN
B. LSTM
C. MLP
D. Autoencoder
Answer: B
Explanation: LSTMs use gates to retain memory across long sequences.

30. GRU differs from LSTM by having:

A. More gates
B. Fewer gates
C. No hidden state
D. No weights
Answer: B
Explanation: GRU has 2 gates vs LSTM’s 3–4, making it simpler.

31. Which of these is a Seq2Seq application?

A. Image classification
B. Machine translation
C. Sorting
D. Clustering
Answer: B
Explanation: Seq2Seq models convert input sequences to output sequences.

32. Attention weights sum to:

A. 0
B. 1
C. Depends
D. Infinity
Answer: B
Explanation: Softmax normalizes attention weights to sum to 1.

33. The Q, K, V vectors stand for:

A. Quick, Kernel, Value
B. Query, Key, Value
C. Quantize, Keep, Verify
D. Queue, Know, Verify
Answer: B
Explanation: Q–K–V are central to self-attention computation.

34. Transformer training is faster because it:

A. Uses RNNs
B. Removes recurrence
C. Removes attention
D. Uses only pooling
Answer: B
Explanation: Without recurrence, Transformers allow full parallelism.

35. Autoencoders use:

A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None
Answer: B
Explanation: Autoencoders reconstruct input without labels.

36. Sparse Autoencoders enforce sparsity using:

A. Dropout
B. L1 penalty
C. Softmax
D. ReLU
Answer: B
Explanation: L1 regularization encourages sparse activations.

37. RBMs are:

A. Deterministic
B. Probabilistic
C. Linear
D. Clustering models
Answer: B
Explanation: RBMs use probability distributions to model data.

38. GAN Discriminator outputs:

A. Regression score
B. Probability (real/fake)
C. Clusters
D. Tokens
Answer: B
Explanation: Discriminator performs binary classification.

39. GAN Generator input is usually:

A. Image
B. Text
C. Random noise
D. Audio
Answer: C
Explanation: Noise vector (z) is mapped into generated samples.

40. Which type of GAN requires labels?

A. DCGAN
B. GAN
C. Conditional GAN
D. WGAN
Answer: C
Explanation: cGAN conditions both G and D on labels.

Neural Networks Lecture 2

41. What is mode collapse?

A. Discriminator failing
B. Generator producing limited variety
C. Dataset shrinking
D. GPU overflow
Answer: B
Explanation: Mode collapse reduces diversity of generated samples.

42. DCGAN uses which activation in generator output?

A. ReLU
B. Tanh
C. Sigmoid
D. Softmax
Answer: B
Explanation: Tanh outputs image pixels in range [−1, 1].

43. Which loss stabilizes GAN training?

A. Cross entropy
B. LSGAN
C. Wasserstein
D. Huber
Answer: C
Explanation: WGAN loss improves stability using Earth Mover’s distance.

44. Which GAN is used for unpaired image translation?

A. DCGAN
B. CycleGAN
C. WGAN
D. StyleGAN
Answer: B
Explanation: CycleGAN performs translation without paired datasets.

45. Which optimizer is best for GANs?

A. RMSProp
B. Adam
C. Adagrad
D. SGD
Answer: B
Explanation: Adam handles unstable gradients effectively.

46. Pooling reduces:

A. Channels
B. Spatial dimensions
C. Batch size
D. Output size
Answer: B
Explanation: Pooling downsamples height and width.

47. Which CNN layer increases image size?

A. Conv
B. Transposed Conv
C. Pooling
D. Flatten
Answer: B
Explanation: ConvTranspose layers upsample input.

48. Gradient clipping helps with:

A. Overfitting
B. Exploding gradients
C. Data cleaning
D. Hardware acceleration
Answer: B
Explanation: Clipping prevents gradients from becoming excessively large.

49. Which normalization technique is used in StyleGAN?

A. BatchNorm
B. LayerNorm
C. InstanceNorm
D. AdaIN
Answer: D
Explanation: Adaptive Instance Normalization controls styles in StyleGAN.

50. Transformers outperform RNNs due to:

A. More gates
B. Parallelism
C. Higher dropout
D. More pooling
Answer: B
Explanation: Parallel training makes Transformers faster and more scalable.

51. Softmax is used to convert logits into:

A. Scores
B. Probabilities
C. Features
D. Noise
Answer: B
Explanation: Softmax normalizes logits into probability distribution.

52. MLP stands for:

A. Multi-Level Perceptron
B. Multi-Layer Perceptron
C. Multi-Loop Processing
D. Multi-Layer Pool
Answer: B
Explanation: MLP is a fully connected feedforward neural network.

53. Training/Validation split helps prevent:

A. Underfitting
B. Overfitting
C. ReLU
D. Noise
Answer: B
Explanation: Validation detects overfitting early by measuring generalization.

54. Dropout probability generally falls in range:

A. 0.1–0.5
B. 0.8–1.0
C. 1.0–1.5
D. 0.01
Answer: A
Explanation: Typical dropout values are 10–50%.

55. Residual blocks are key part of:

A. VGG
B. ResNet
C. Inception v1
D. DBN
Answer: B
Explanation: Skip connections in ResNet solve vanishing gradients.

56. Which model uses skip connections in a multi-path architecture?

A. LeNet
B. ResNet
C. GoogLeNet
D. AlexNet
Answer: C
Explanation: GoogLeNet uses Inception blocks with multi-path flow.

57. LSTM has how many gates?

A. 1
B. 2
C. 3
D. 4
Answer: D
Explanation: Input, forget, output and candidate gates.

58. GAN’s discriminator uses which activation at last layer?

A. ReLU
B. Leaky ReLU
C. Sigmoid
D. Tanh
Answer: C
Explanation: Sigmoid outputs probability of real/fake.

59. GAN objective is a:

A. Regression
B. Minimax game
C. Reinforcement
D. Clustering
Answer: B
Explanation: G and D play a zero-sum minimax game.

60. Which is a generative model?

A. CNN
B. RNN
C. GAN
D. SVM
Answer: C
Explanation: GANs generate new samples similar to training data.

Transformers Lecture 15

61. Which algorithm trains RBM?

A. Q-Learning
B. Contrastive Divergence
C. Policy Gradient
D. RMSProp
Answer: B
Explanation: CD-k approximates gradient for RBM training.

62. Autoencoders minimize:

A. Classification error
B. Reconstruction error
C. BLEU score
D. F1-score
Answer: B
Explanation: AE learns to reconstruct input as closely as possible.

63. Transformer’s attention is:

A. Local
B. Temporal
C. Global
D. Random
Answer: C
Explanation: Each token attends to all other tokens.

64. Which one is NOT an unsupervised model?

A. Autoencoder
B. RBM
C. DBN
D. LSTM
Answer: D
Explanation: LSTM is supervised/recurrent, not unsupervised.

65. Which improves GAN stability?

A. BatchNorm
B. Dropout
C. Higher LR
D. Spectral Norm
Answer: D
Explanation: Spectral normalization stabilizes discriminator training.

66. GAN’s generator uses which activation mostly internally?

A. Tanh
B. Leaky ReLU
C. Softmax
D. Sigmoid
Answer: B
Explanation: Leaky ReLU avoids dead neurons and improves gradient flow.

67. Which architecture uses multi-head attention?

A. CNN
B. RNN
C. Transformer
D. DBN
Answer: C
Explanation: Multi-head attention is the heart of Transformer models.

68. Which GAN maps noise → image?

A. Discriminator
B. Generator
C. Transformer
D. RNN
Answer: B
Explanation: Generator transforms latent noise vector into synthetic images.

69. CycleGAN requires:

A. Labeled pairs
B. Unpaired data
C. Only noise
D. Only labels
Answer: B
Explanation: CycleGAN performs unpaired image-to-image translation.

70. Which term describes the network learning noise patterns?

A. Generalization
B. Overfitting
C. Underfitting
D. Max pooling
Answer: B
Explanation: Overfitting means model memorizes training noise.

71. Which convolution reduces resolution?

A. 1×1 conv
B. 3×3 stride 2
C. ConvTranspose
D. Depthwise conv
Answer: B
Explanation: Stride 2 halves spatial dimensions.

72. Which GAN is known for interpolation in latent space?

A. DCGAN
B. WGAN
C. StyleGAN
D. Conditional GAN
Answer: C
Explanation: StyleGAN produces smooth, meaningful latent interpolations.

73. “Attention is All You Need” introduced:

A. LSTM
B. GAN
C. Transformer
D. Autoencoder
Answer: C
Explanation: Vaswani et al. proposed Transformers in 2017.

74. Which layer helps avoid covariate shift?

A. BatchNorm
B. Pooling
C. Conv
D. Flatten
Answer: A
Explanation: BN normalizes layer inputs, reducing internal covariate shift.

75. GAN’s training alternates between:

A. G → G
B. D → D
C. G → D
D. G ↔ D
Answer: D
Explanation: Generator and Discriminator are trained alternately.

76. In CNNs, padding is used to:

A. Increase channels
B. Preserve spatial dimensions
C. Reduce depth
D. Convert images to grayscale
Answer: B
Explanation: Padding keeps the output size same as input after convolution.

77. Exploding gradients mainly occur in:

A. CNN
B. RNN
C. Autoencoders
D. GANs
Answer: B
Explanation: Deep RNNs often accumulate very large gradients during BPTT.

78. Which solves exploding gradients?

A. Data augmentation
B. Gradient clipping
C. Pooling
D. BatchNorm
Answer: B
Explanation: Gradients are clipped to prevent instability.

79. Word embeddings convert words into:

A. One-hot
B. Dense vectors
C. Images
D. Pixel grids
Answer: B
Explanation: Embeddings capture semantic meaning in dense vector form.

80. Which model uses attention inside RNNs?

A. LSTM only
B. GRU only
C. Seq2Seq with Attention
D. DBN
Answer: C
Explanation: Attention improves Seq2Seq performance for long sequences.

Ian Goodfellow – Deep Learning Book

81. Which of the following normalizes each feature of batch independently?

A. LayerNorm
B. InstanceNorm
C. BatchNorm
D. GroupNorm
Answer: C
Explanation: BatchNorm computes statistics over batches.

82. Which term refers to pixel-wise reconstruction loss?

A. Cross entropy
B. IoU
C. MSE
D. BLEU
Answer: C
Explanation: AE training often minimizes Mean Squared Error.

83. WGAN uses which concept to measure distance?

A. Hinge loss
B. KL divergence
C. Earth Mover’s Distance
D. JS divergence
Answer: C
Explanation: EMD stabilizes training by providing smooth gradients.

84. Training a GAN is considered:

A. Stable
B. Easy
C. Hard
D. Unsupervised but stable
Answer: C
Explanation: GANs suffer instability, mode collapse, and gradient issues.

85. Which layer controls image resolution in GAN generators?

A. Pooling
B. ConvTranspose
C. Dense
D. Softmax
Answer: B
Explanation: Transposed convolutions upsample images.

86. In Transformers, attention complexity is:

A. Linear
B. Quadratic
C. Constant
D. Cubic
Answer: B
Explanation: Self-attention scales as O(n²).

87. CNN filters learn:

A. Labels
B. Local patterns
C. Sentences
D. Audio
Answer: B
Explanation: Convolutions extract features like edges and shapes.

88. Which is used to prevent overfitting?

A. High learning rate
B. Large batch size
C. Data augmentation
D. No regularization
Answer: C
Explanation: Augmentation increases dataset diversity.

89. L2 regularization penalizes:

A. Large weights
B. Noise
C. Labels
D. Gradients
Answer: A
Explanation: L2 shrinks large weights to improve generalization.

90. Which is a universal approximation model?

A. Perceptron
B. 2-layer MLP
C. Linear model
D. GAN
Answer: B
Explanation: 2-layer neural networks can approximate any function.

91. Embedding dimension determines:

A. Sequence length
B. Vocabulary size
C. Feature richness
D. Output classes
Answer: C
Explanation: Bigger embedding size = more expressive word features.

92. Which term describes the training signal in GANs?

A. Reward
B. Gradient
C. Real/fake probability
D. Reconstruction
Answer: C
Explanation: D outputs real/fake which guides G.

93. Skip connections help by:

A. Reducing filters
B. Passing gradients directly
C. Removing layers
D. Increasing dropout
Answer: B
Explanation: Residual paths improve gradient flow and stability.

94. Which optimizer adapts learning rate per weight?

A. SGD
B. Momentum
C. Adam
D. None
Answer: C
Explanation: Adam combines momentum + adaptive LR.

95. Which CNN layer is used last in classification?

A. Conv
B. Pool
C. Fully Connected
D. MaxPool
Answer: C
Explanation: FC layers map features to class outputs.

96. GANs learn through:

A. Direct supervision
B. Adversarial feedback
C. Predefined labels
D. Clustering
Answer: B
Explanation: Generator learns to fool the discriminator.

97. In NLP, BiLSTM reads sequence:

A. Forward only
B. Backward only
C. Both directions
D. Randomly
Answer: C
Explanation: BiLSTM captures context from both directions.

98. The purpose of Flatten layer:

A. Increase dimensions
B. Convert 2D → 1D
C. Reduce learning rate
D. Add regularization
Answer: B

99. Which improves GAN diversity?

A. Higher dropout
B. Mini-batch discrimination
C. Lower LR
D. Remove noise
Answer: B
Explanation: It ensures generator produces diverse samples.

100. The “Generator loss” improves when:

A. D rejects fake images
B. D accepts fake images
C. G produces noise
D. Training stops
Answer: B
Explanation: G succeeds when fake images fool the discriminator.

Chollet – Deep Learning with Python

SECTION B – 30 Short Questions (With Answers)

Q1. What is Deep Learning?

Answer:
Deep learning is a subset of machine learning that uses multi-layer neural networks to automatically learn hierarchical features from data.

Q2. Define a perceptron.

Answer:
A perceptron is the simplest neural unit that computes a weighted sum of inputs and applies an activation function.

Q3. What is an activation function?

Answer:
An activation function introduces non-linearity, allowing networks to learn complex patterns.

Q4. What is backpropagation?

Answer:
Backpropagation is an algorithm that computes gradients of loss with respect to weights and updates them using gradient descent.

Q5. What causes vanishing gradients?

Answer:
When gradients become extremely small during backpropagation through deep networks, preventing effective learning.

Q6. What is the purpose of Dropout?

Answer:
Dropout randomly deactivates neurons during training to reduce overfitting.

Q7. What is a convolution in CNNs?

Answer:
It is an operation where filters slide over the input to extract spatial features.

Q8. Define pooling in CNNs.

Answer:
Pooling reduces spatial dimensions to compress features and increase invariance.

Q9. What is the main benefit of using CNNs?

Answer:
CNNs capture spatial and local patterns effectively in image data.

Q10. What is an RNN?

Answer:
Recurrent Neural Networks process sequential data by maintaining hidden states across time steps.

Q11. Why were LSTMs introduced?

Answer:
To solve vanishing and exploding gradient problems in standard RNNs.

Q12. What is a GRU?

Answer:
A GRU is a simplified LSTM with fewer gates, reducing complexity while preserving performance.

Q13. What is a Seq2Seq model used for?

Answer:
Sequence-to-Sequence models convert an input sequence into an output sequence, e.g., translation and summarization.

Q14. What is the bottleneck in Seq2Seq models?

Answer:
Compression of the entire input sequence into a single fixed-size vector.

Q15. Define Attention Mechanism.

Answer:
Attention assigns weights to input tokens based on relevance, allowing models to focus on important information.

Q16. What is Self-Attention?

Answer:
A mechanism where each token attends to all other tokens in the same sequence.

Q17. What is Multi-Head Attention?

Answer:
Parallel attention layers that allow the model to learn information from different representation subspaces.

Q18. What is Positional Encoding?

Answer:
A technique to inject sequence order information into Transformer models.

Q19. What is an Autoencoder?

Answer:
A neural network that learns compressed representations and reconstructs input from them.

Q20. What is a Sparse Autoencoder?

Answer:
An autoencoder that uses regularization to ensure only a few neurons activate.

Q21. Define RBM (Restricted Boltzmann Machine).

Answer:
A probabilistic two-layer model using visible and hidden units trained with Contrastive Divergence.

Q22. What is a Deep Belief Network (DBN)?

Answer:
A stack of RBMs trained layer-wise for unsupervised feature learning.

Q23. What is a GAN?

Answer:
A Generative Adversarial Network consists of a generator and discriminator competing in a minimax game to generate realistic data.

Q24. What is the role of the Generator in GANs?

Answer:
To generate realistic fake samples from random noise.

Q25. What is the role of the Discriminator in GANs?

Answer:
To classify inputs as real or fake.

Q26. What is Mode Collapse?

Answer:
When a GAN produces limited variety of outputs despite diverse training data.

Q27. What is DCGAN?

Answer:
A GAN variant using convolution and transposed convolution layers for stable image generation.

Q28. What is StyleGAN?

Answer:
An advanced GAN architecture enabling high-resolution face generation with style-based controls.

Q29. What is Wasserstein Loss?

Answer:
A stable GAN loss based on Earth Mover’s Distance that improves training stability.

Q30. Why are GANs difficult to train?

Answer:
Because training requires balancing two competing networks; instability arises if one becomes too strong.

SECTION C – 15 Long Questions (WITH DETAILED ANSWERS)

Q1. Explain the complete architecture and working of a Convolutional Neural Network (CNN). Discuss all major layers with diagrams and use-cases.

Answer:
A Convolutional Neural Network (CNN) is a deep learning architecture designed to process data with grid-like topology such as images. Unlike fully connected networks, CNNs exploit spatial locality through convolutional operations, significantly reducing parameters and improving learning efficiency.

A CNN architecture consists of:

1. Convolutional Layer

Applies multiple learnable filters that slide across the input, extracting local features such as edges, corners, textures, and patterns.
Mathematically:
feature = filter ⊗ input + bias.

2. Activation Layer (ReLU)

Introduces non-linearity by applying f(x) = max(0, x).
Helps CNN learn complex transformations.

3. Pooling Layer

Downsamples spatial dimensions using max or average pooling.
Benefits:
Reduces computation
Provides translational invariance
Avoids overfitting

4. Batch Normalization

Normalizes intermediate outputs, stabilizing and speeding up training.

5. Fully Connected Layer

After flattening, FC layers combine extracted features to perform final classification.

6. Softmax Layer

Outputs probability distribution over classes.

Use-Cases:

Image classification
Object detection (YOLO, Faster R-CNN)
Medical imaging
Face recognition
Satellite imaging

CNNs are the backbone of modern computer vision due to their ability to learn hierarchical feature representations.

Q2. Describe the vanishing gradient problem and explain how LSTMs solve this issue.

Answer:
The vanishing gradient problem occurs when gradients shrink exponentially during backpropagation through time (BPTT).
In deep or recurrent networks, small gradients fail to update weights, preventing learning of long-term dependencies.

Why it happens:

Sigmoid/tanh activations push gradients towards zero
Repeated multiplication across many time steps reduces gradient magnitude

How LSTMs solve it:

LSTM networks introduce gating mechanisms:

1. Forget Gate:

Determines which information to discard.

2. Input Gate:

Controls how much new information to store.

3. Output Gate:

Decides what to send to the next state.

4. Cell State (Cₜ): The Key

The cell state acts as a memory conveyor belt, allowing gradients to pass backward with minimal decay.
The additive update rule prevents multiplicative shrinking, maintaining stable gradients.

Result:

Long-term dependencies learned
Stable training
No exploding or vanishing gradients

Q3. Explain Seq2Seq Model Architecture and how Attention improves its performance.

Answer:
A Seq2Seq model consists of:

1. Encoder:

Processes the input sequence and compresses it into a context vector.

2. Decoder:

Uses the context vector to generate output tokens one by one.

Limitation (Bottleneck):

All sequence information is compressed into a single vector → harmful for long sentences.

Attention Mechanism:

Attention allows the decoder to “look back” at all encoder states instead of relying on a single vector.
The attention module computes relevance scores between decoder hidden state and every encoder output.

Mathematically:

score = softmax(QKᵀ)
context = Σ(score × value)

Benefits:

Removes bottleneck
Allows long-sequence handling
Improves translation accuracy
Enables parallelism
Key building block of Transformers

Q4. Explain the Transformer architecture in detail. Discuss Self-Attention, Multi-Head Attention, and Positional Encoding.

Answer:
Transformers replace recurrence with self-attention, enabling full parallelism and exceptional performance on NLP tasks.

1. Self-Attention:

Each token interacts with every other token through Q, K, and V vectors.
Equation:
attention = softmax(QKᵀ / √dₖ) V.

2. Multi-Head Attention:

Uses multiple parallel self-attention layers (heads) to learn different relationships simultaneously.
Benefits:
Captures multiple semantic patterns
Better representation learning

3. Positional Encoding:

Since Transformers do not use recurrence, positional encoding adds sequence order using:
sin and cos functions of varying frequencies.

4. Feed Forward Network:

Applied to each token independently.

5. Residual Connections + Layer Normalization:

Prevent vanishing gradients and stabilize training.

Outcome:

Transformers provide state-of-the-art accuracy in translation, summarization, LLMs, and more.

Q5. Explain Autoencoders, their architecture, loss functions, and main applications.

Answer:
Autoencoders are unsupervised neural networks designed to reconstruct their input.

Architecture:

Encoder: Compresses input into low-dimensional latent space.
Decoder: Reconstructs input from latent vector.
Loss = ||x − x̂||² (MSE) or Binary Cross-Entropy.

Variants:

Denoising Autoencoder removes noise
Sparse Autoencoder encourages sparse activations
Convolutional Autoencoder image-specific reconstruction
Variational Autoencoder probabilistic variant

Applications:

Image denoising
Dimensionality reduction
Anomaly detection
Feature learning
Pretraining deep models

Q6. Describe Restricted Boltzmann Machines (RBMs) and their training using Contrastive Divergence.

Answer:
RBM is a bipartite probabilistic model with visible and hidden units.

Energy Function:

E(v, h) = −aᵀv − bᵀh − vᵀWh.

Training using CD-k:

Sample hidden units given visible units
Reconstruct visible units
Sample hidden units again
Update weights based on difference between positive and negative phases

CD-k approximates the gradient of log-likelihood efficiently.

Uses:

Pretraining DBNs
Feature extraction
Collaborative filtering

Q7. What are Deep Belief Networks (DBNs) and how are they trained?

Answer:
DBNs are deep generative models composed of stacked RBMs.

Training Process:

1. Layer-wise Pretraining (Unsupervised):
Each RBM is trained independently using CD.

2. Fine-tuning (Supervised):
Whole network trained using backpropagation.

Advantages:

Efficient initialization
Captures hierarchical features
Useful for unsupervised tasks

Q8. Describe GAN architecture and the adversarial training process.

Answer:
GANs consist of:

1. Generator (G):

Takes random noise z and produces realistic samples.

2. Discriminator (D):

Classifies inputs as real or fake.

Training Process:

D is trained to maximize probability of correct classification
G is trained to fool the discriminator
Objective:
min_G max_D [log D(x) + log(1 − D(G(z)))]

Outcome:

The competition leads to improved generation quality.

Q9. What is Mode Collapse? Explain reasons and solutions.

Answer:
Mode collapse occurs when G learns a small set of outputs that consistently fool D, producing low diversity.

Causes:

Poor training balance
Strong discriminator
Non-stable loss functions

Solutions:

Mini-batch discrimination
Feature matching
WGAN and WGAN-GP
Spectral normalization
Unrolled GANs

Q10. Explain DCGAN architecture in detail.

Answer:
DCGAN introduced architectural improvements for stable GAN training.

Generator:

Starts from random noise
Dense → reshape → ConvTranspose layers
BatchNorm + ReLU
Tanh output

Discriminator:

Conv layers + LeakyReLU
BatchNorm
Sigmoid output

Key Innovations:

Use of ConvTranspose
Removal of pooling and FC layers
Stable training at 64×64 resolution

Q11. Explain StyleGAN and the concept of style-based generation.

Answer:
StyleGAN introduced a mapping network + adaptive instance normalization (AdaIN).

Features:

Latent vector first passed through MLP mapping network
Style vectors control different layers
High-resolution synthesis (1024×1024)
Fine-grained control over hair, smile, lighting, etc.

Advantages:

Superior realism
Smooth latent interpolation
Disentangled control over attributes

Q12. Discuss overfitting in deep learning and methods to prevent it.

Answer:
Overfitting occurs when a model memorizes training data, performing poorly on unseen data.

Causes:

Too many parameters
Small dataset
Noise in training

Prevention:

Dropout
L1/L2 regularization
Early stopping
Data augmentation
Batch normalization
Transfer learning

Q13. Explain Optimization Algorithms: SGD, Momentum, RMSProp, and Adam.

Answer:

SGD:

Updates weights in direction of negative gradient.

Momentum:

Adds moving average of gradients → faster convergence.

RMSProp:

Normalizes learning rate using moving average of squared gradients.

Adam:

Combines Momentum + RMSProp → most widely used.

Q14. Explain the role of Data Augmentation. Provide examples.

Answer:
Data augmentation artificially increases dataset diversity by applying transformations, improving generalization.

Examples:

Rotation
Flipping
Cropping
Color jittering
Gaussian noise

Benefits:

Reduces overfitting
Improves robustness
Helps small datasets

Q15. Compare CNN, RNN, Autoencoders, and GANs in terms of applications and strengths.

Answer:

Model	Strengths	Applications
CNN	Spatial feature extraction	Images, detection, segmentation
RNN/LSTM/GRU	Temporal sequence learning	Translation, speech, time-series
Autoencoders	Compression + reconstruction	Denoising, anomaly detection
GANs	Realistic sample generation	Deepfakes, art, augmentation

Each architecture addresses different data modalities and problem types, making them essential in a complete deep learning curriculum.