Lecture 14 - Attention Mechanisms in Deep Learning: Architecture, Types, Scoring Functions & Applications

Introduction

Attention mechanisms dramatically changed the trajectory of deep learning. Before attention, models like RNNs, GRUs, and LSTMs struggled with long-term dependencies, losing information from early parts of a sequence. Attention solved this by giving models the ability to focus selectively on different parts of the input sequence allowing neural networks to process long sentences, understand context better, and produce far more accurate outputs.

This lecture explains attention deeply how it works, why it was invented, how it’s used inside Seq2Seq models, and why attention is the foundation of today’s most powerful architectures, including Transformers, BERT, GPT-models, and modern generative AI.

Why Attention Was Needed

Before attention, Seq2Seq models suffered from the fixed context vector limitation:

Long sentences compressed into one vector
Model forgot early input tokens
Translations became inaccurate
Summaries lacked clarity
RNNs struggled with long-range dependencies

Attention removed this bottleneck by giving the model freedom to look at all encoder hidden states instead of depending on just one.

What Is the Attention Mechanism?

The attention mechanism allows the decoder to:

Look at every hidden state of the encoder
Assign relevance weights to each token
Focus on the most relevant parts when generating each output token

In short, attention means:

“Don’t memorize everything just look where needed.”

This results in:

Better translation
Smoother summarization
More accurate question answering
Stronger alignment between input and output tokens

How Attention Works (Step-by-Step)

Let’s break it down into simple but technical steps.

Encoder Produces Hidden States

For each token in the input sequence:

h1, h2, h3 … hn

These represent the meaning of each word in context.

Decoder Hidden State (Query)

At each output step t, the decoder produces a state:

s_t

This state acts as a query.

Calculate Alignment (Score Function)

The model computes a score between the decoder state s_t and each encoder hidden state h_i:

score(s_t, h_i)

This score indicates:
“How important is this encoder token for generating the next output?”

Softmax Normalization

All scores are converted into weights:

α_i = softmax(score(s_t, h_i))

Weights sum to 1.

These are attention weights.

Weighted Sum → Context Vector

A dynamic context vector is generated:

c_t = Σ (α_i * h_i)

This context vector changes for every output token.

Lecture 13 – Sequence-to-Sequence Models with Attention: Architecture, Workflow & Real-World Applications

Decoder Generates the Next Output

The decoder uses:

Context vector c_t
Previous predictions
Its hidden state

To generate the next token.

This process continues until the output sequence is complete.

Types of Attention Mechanisms

Attention comes in multiple forms. Below are the core types every DL student must understand.

Bahdanau Attention (Additive Attention)

Introduced in 2014, this method uses a small neural network to compute attention scores.

Formula:

score = vᵀ tanh(W₁ h_i + W₂ s_t)

Characteristics:

Works well with RNNs, GRUs, LSTMs
More flexible
Slightly heavier computationally
Great for NLP
This was the first widely-used attention method

Luong Attention (Multiplicative Attention)

Luong attention (2015) is faster and simpler.

Scoring Methods:

Dot:

score = s_t · h_i

General:

score = s_tᵀ W h_i

Characteristics:

Computationally efficient
Faster at inference
Popular in production models
Works well in translation tasks

Self-Attention (Modern Attention)

Self-attention allows each token to attend to other tokens in the same sequence.

This is the mechanism used in:

Transformers
BERT
GPT
T5
Vision Transformers (ViT)

Why it is powerful?

Parallelizable
Tracks long dependencies
Learns context globally
Eliminates RNN limitations
Scales extremely well

Seq2Seq attention paved the way for self-attention architectures.

Soft vs Hard Attention

Soft Attention

Differentiable
Trained end-to-end using backpropagation
Most widely used
Faster

Nearly every modern model (Transformers, LSTMs with attention) uses soft attention.

Hard Attention

Samples a single token to attend to
Non-differentiable
Trained using reinforcement learning
More computationally expensive

Rarely used, but helpful in specific visual tasks.

https://arxiv.org/abs/1409.0473

Scoring Functions (Important for Exams)

1. Dot-Product Score

Fastest and most common.

2. Additive Score

Better for smaller models.

3. Scaled Dot-Product Score

Used in Transformers:

score = (QKᵀ) / √d_k

Scaling stabilizes gradients.

Visual Example of Attention

Suppose the input is:
"He is playing football"
Translated into German:
"Er spielt Fußball"

When generating Fußball, the model attends strongly to football in the input.

Attention map (conceptually):

Input Word	Attention Weight
He	0.01
is	0.03
playing	0.12
football	0.84

The decoder “knows” what to look at.

Applications of Attention Mechanisms

Attention is used in:

Neural Machine Translation
Speech Recognition
Question Answering
Summarization
Image Captioning
Video Understanding
Generative AI
Chatbots
Medical Report Generation
Time Series Forecasting

And all modern LLMs.

Why Attention Leads to Better Results

Attention improves results because:

It removes fixed-size bottlenecks
It handles long sequences better
It dynamically focuses on relevant information
It enhances fluency in translated sentences
It boosts accuracy in summarization

In modern AI, attention is responsible for:

More coherent text
Better contextual understanding
Improved accuracy in reasoning

Transition Toward Transformers

Transformers rely entirely on self-attention, removing RNNs completely.

Seq2Seq attention → Self-attention → Multi-head attention → Transformers → GPT → Modern AI

This evolution started from the same concepts you are learning here.

Lecture 14 – Attention Mechanisms in Deep Learning: Architecture, Types, Scoring Functions & Applications

Introduction

Why Attention Was Needed

What Is the Attention Mechanism?

“Don’t memorize everything just look where needed.”

How Attention Works (Step-by-Step)

Encoder Produces Hidden States

Decoder Hidden State (Query)

Calculate Alignment (Score Function)

Softmax Normalization

Weighted Sum → Context Vector

Decoder Generates the Next Output

Types of Attention Mechanisms

Bahdanau Attention (Additive Attention)

Formula:

Characteristics:

Luong Attention (Multiplicative Attention)

Scoring Methods:

Characteristics:

Self-Attention (Modern Attention)

Why it is powerful?

Soft vs Hard Attention

Soft Attention

Hard Attention

Scoring Functions (Important for Exams)

1. Dot-Product Score

2. Additive Score

3. Scaled Dot-Product Score

Visual Example of Attention

Applications of Attention Mechanisms

Why Attention Leads to Better Results

Transition Toward Transformers

People also ask:

Leave a ReplyCancel Reply

Contact Us

Introduction

Why Attention Was Needed

What Is the Attention Mechanism?

“Don’t memorize everything just look where needed.”

How Attention Works (Step-by-Step)

Encoder Produces Hidden States

Decoder Hidden State (Query)

Calculate Alignment (Score Function)

Softmax Normalization

Weighted Sum → Context Vector

Decoder Generates the Next Output

Types of Attention Mechanisms

Bahdanau Attention (Additive Attention)

Formula:

Characteristics:

Luong Attention (Multiplicative Attention)

Scoring Methods:

Characteristics:

Self-Attention (Modern Attention)

Why it is powerful?

Soft vs Hard Attention

Soft Attention

Hard Attention

Scoring Functions (Important for Exams)

1. Dot-Product Score

2. Additive Score

3. Scaled Dot-Product Score

Visual Example of Attention

Applications of Attention Mechanisms

Why Attention Leads to Better Results

Transition Toward Transformers

People also ask:

Related Posts

Lecture 18 – Deep Learning MCQs, Short Questions & Long Questions

Lecture 16 – Autoencoders, Sparse Coding, Restricted Boltzmann Machines & Deep Belief Networks

Lecture 15 – Transformers in Deep Learning: Architecture, Self-Attention, Multi-Head Attention & Positional Encoding

Leave a ReplyCancel Reply