Introduction
Attention mechanisms dramatically changed the trajectory of deep learning. Before attention, models like RNNs, GRUs, and LSTMs struggled with long-term dependencies, losing information from early parts of a sequence. Attention solved this by giving models the ability to focus selectively on different parts of the input sequence allowing neural networks to process long sentences, understand context better, and produce far more accurate outputs.
This lecture explains attention deeply how it works, why it was invented, how it’s used inside Seq2Seq models, and why attention is the foundation of today’s most powerful architectures, including Transformers, BERT, GPT-models, and modern generative AI.
Why Attention Was Needed
Before attention, Seq2Seq models suffered from the fixed context vector limitation:
- Long sentences compressed into one vector
- Model forgot early input tokens
- Translations became inaccurate
- Summaries lacked clarity
- RNNs struggled with long-range dependencies
Attention removed this bottleneck by giving the model freedom to look at all encoder hidden states instead of depending on just one.
What Is the Attention Mechanism?
The attention mechanism allows the decoder to:
Look at every hidden state of the encoder
Assign relevance weights to each token
Focus on the most relevant parts when generating each output token
In short, attention means:
“Don’t memorize everything just look where needed.”
This results in:
- Better translation
- Smoother summarization
- More accurate question answering
- Stronger alignment between input and output tokens
How Attention Works (Step-by-Step)
Let’s break it down into simple but technical steps.
Encoder Produces Hidden States
For each token in the input sequence:
h1, h2, h3 … hn
These represent the meaning of each word in context.
Decoder Hidden State (Query)
At each output step t, the decoder produces a state:
s_t
This state acts as a query.
Calculate Alignment (Score Function)
The model computes a score between the decoder state s_t and each encoder hidden state h_i:
score(s_t, h_i)
This score indicates:
“How important is this encoder token for generating the next output?”
Softmax Normalization
All scores are converted into weights:
α_i = softmax(score(s_t, h_i))
Weights sum to 1.
These are attention weights.
Weighted Sum → Context Vector
A dynamic context vector is generated:
c_t = Σ (α_i * h_i)
This context vector changes for every output token.
Decoder Generates the Next Output
The decoder uses:
- Context vector
c_t - Previous predictions
- Its hidden state
To generate the next token.
This process continues until the output sequence is complete.
Types of Attention Mechanisms
Attention comes in multiple forms. Below are the core types every DL student must understand.
Bahdanau Attention (Additive Attention)
Introduced in 2014, this method uses a small neural network to compute attention scores.
Formula:
score = vᵀ tanh(W₁ h_i + W₂ s_t)
Characteristics:
Works well with RNNs, GRUs, LSTMs
More flexible
Slightly heavier computationally
Great for NLP
This was the first widely-used attention method
Luong Attention (Multiplicative Attention)
Luong attention (2015) is faster and simpler.
Scoring Methods:
Dot:
score = s_t · h_i
General:
score = s_tᵀ W h_i
Characteristics:
Computationally efficient
Faster at inference
Popular in production models
Works well in translation tasks
Self-Attention (Modern Attention)
Self-attention allows each token to attend to other tokens in the same sequence.
This is the mechanism used in:
- Transformers
- BERT
- GPT
- T5
- Vision Transformers (ViT)
Why it is powerful?
Parallelizable
Tracks long dependencies
Learns context globally
Eliminates RNN limitations
Scales extremely well
Seq2Seq attention paved the way for self-attention architectures.
Soft vs Hard Attention
Soft Attention
- Differentiable
- Trained end-to-end using backpropagation
- Most widely used
- Faster
Nearly every modern model (Transformers, LSTMs with attention) uses soft attention.
Hard Attention
- Samples a single token to attend to
- Non-differentiable
- Trained using reinforcement learning
- More computationally expensive
Rarely used, but helpful in specific visual tasks.
Scoring Functions (Important for Exams)
1. Dot-Product Score
Fastest and most common.
2. Additive Score
Better for smaller models.
3. Scaled Dot-Product Score
Used in Transformers:
score = (QKᵀ) / √d_k
Scaling stabilizes gradients.
Visual Example of Attention
Suppose the input is:"He is playing football"
Translated into German:"Er spielt Fußball"
When generating Fußball, the model attends strongly to football in the input.
Attention map (conceptually):
| Input Word | Attention Weight |
|---|---|
| He | 0.01 |
| is | 0.03 |
| playing | 0.12 |
| football | 0.84 |
The decoder “knows” what to look at.
Applications of Attention Mechanisms
Attention is used in:
Neural Machine Translation
Speech Recognition
Question Answering
Summarization
Image Captioning
Video Understanding
Generative AI
Chatbots
Medical Report Generation
Time Series Forecasting
And all modern LLMs.
Why Attention Leads to Better Results
Attention improves results because:
- It removes fixed-size bottlenecks
- It handles long sequences better
- It dynamically focuses on relevant information
- It enhances fluency in translated sentences
- It boosts accuracy in summarization
In modern AI, attention is responsible for:
- More coherent text
- Better contextual understanding
- Improved accuracy in reasoning
Transition Toward Transformers
Transformers rely entirely on self-attention, removing RNNs completely.
Seq2Seq attention → Self-attention → Multi-head attention → Transformers → GPT → Modern AI
This evolution started from the same concepts you are learning here.
People also ask:
To focus on relevant parts of the input sequence at every step of decoding.
Bahdanau uses additive scoring; Luong uses multiplicative scoring. Bahdanau is more flexible; Luong is faster.
Without attention, they rely on a single context vector, causing information loss in long sequences.
Transformers, BERT, GPT-3.5, GPT-4, T5, ViT, and most modern generative AI models.
Yes it is the foundation of all state-of-the-art language models and vision transformers.




