Lecture 14 – Attention Mechanisms in Deep Learning: Architecture, Types, Scoring Functions & Applications

Introduction

Attention mechanisms dramatically changed the trajectory of deep learning. Before attention, models like RNNs, GRUs, and LSTMs struggled with long-term dependencies, losing information from early parts of a sequence. Attention solved this by giving models the ability to focus selectively on different parts of the input sequence allowing neural networks to process long sentences, understand context better, and produce far more accurate outputs.

This lecture explains attention deeply how it works, why it was invented, how it’s used inside Seq2Seq models, and why attention is the foundation of today’s most powerful architectures, including Transformers, BERT, GPT-models, and modern generative AI.

Why Attention Was Needed

Before attention, Seq2Seq models suffered from the fixed context vector limitation:

  • Long sentences compressed into one vector
  • Model forgot early input tokens
  • Translations became inaccurate
  • Summaries lacked clarity
  • RNNs struggled with long-range dependencies

Attention removed this bottleneck by giving the model freedom to look at all encoder hidden states instead of depending on just one.

What Is the Attention Mechanism?

The attention mechanism allows the decoder to:

Look at every hidden state of the encoder
Assign relevance weights to each token
Focus on the most relevant parts when generating each output token

In short, attention means:

“Don’t memorize everything just look where needed.”

This results in:

  • Better translation
  • Smoother summarization
  • More accurate question answering
  • Stronger alignment between input and output tokens

How Attention Works (Step-by-Step)

Let’s break it down into simple but technical steps.

Encoder Produces Hidden States

For each token in the input sequence:

h1, h2, h3 … hn

These represent the meaning of each word in context.

Decoder Hidden State (Query)

At each output step t, the decoder produces a state:

s_t

This state acts as a query.

Calculate Alignment (Score Function)

The model computes a score between the decoder state s_t and each encoder hidden state h_i:

score(s_t, h_i)

This score indicates:
“How important is this encoder token for generating the next output?”

Softmax Normalization

All scores are converted into weights:

α_i = softmax(score(s_t, h_i))

Weights sum to 1.

These are attention weights.

Weighted Sum → Context Vector

A dynamic context vector is generated:

c_t = Σ (α_i * h_i)

This context vector changes for every output token.

Lecture 13 – Sequence-to-Sequence Models with Attention: Architecture, Workflow & Real-World Applications

Decoder Generates the Next Output

The decoder uses:

  • Context vector c_t
  • Previous predictions
  • Its hidden state

To generate the next token.

This process continues until the output sequence is complete.

Types of Attention Mechanisms

Attention comes in multiple forms. Below are the core types every DL student must understand.

Bahdanau Attention (Additive Attention)

Introduced in 2014, this method uses a small neural network to compute attention scores.

Formula:
score = vᵀ tanh(W₁ h_i + W₂ s_t)

Characteristics:

Works well with RNNs, GRUs, LSTMs
More flexible
Slightly heavier computationally
Great for NLP
This was the first widely-used attention method

Luong Attention (Multiplicative Attention)

Luong attention (2015) is faster and simpler.

Scoring Methods:

Dot:

score = s_t · h_i

General:

score = s_tᵀ W h_i

Characteristics:

Computationally efficient
Faster at inference
Popular in production models
Works well in translation tasks

Self-Attention (Modern Attention)

Self-attention allows each token to attend to other tokens in the same sequence.

This is the mechanism used in:

  • Transformers
  • BERT
  • GPT
  • T5
  • Vision Transformers (ViT)

Why it is powerful?

Parallelizable
Tracks long dependencies
Learns context globally
Eliminates RNN limitations
Scales extremely well

Seq2Seq attention paved the way for self-attention architectures.

Soft vs Hard Attention

Soft Attention

  • Differentiable
  • Trained end-to-end using backpropagation
  • Most widely used
  • Faster

Nearly every modern model (Transformers, LSTMs with attention) uses soft attention.

Hard Attention

  • Samples a single token to attend to
  • Non-differentiable
  • Trained using reinforcement learning
  • More computationally expensive

Rarely used, but helpful in specific visual tasks.

https://arxiv.org/abs/1409.0473

Scoring Functions (Important for Exams)

1. Dot-Product Score

Fastest and most common.

2. Additive Score

Better for smaller models.

3. Scaled Dot-Product Score

Used in Transformers:

score = (QKᵀ) / √d_k

Scaling stabilizes gradients.

Visual Example of Attention

Suppose the input is:
"He is playing football"
Translated into German:
"Er spielt Fußball"

When generating Fußball, the model attends strongly to football in the input.

Attention map (conceptually):

Input WordAttention Weight
He0.01
is0.03
playing0.12
football0.84

The decoder “knows” what to look at.

Applications of Attention Mechanisms

Attention is used in:

Neural Machine Translation
Speech Recognition
Question Answering
Summarization
Image Captioning
Video Understanding
Generative AI
Chatbots
Medical Report Generation
Time Series Forecasting

And all modern LLMs.

Why Attention Leads to Better Results

Attention improves results because:

  • It removes fixed-size bottlenecks
  • It handles long sequences better
  • It dynamically focuses on relevant information
  • It enhances fluency in translated sentences
  • It boosts accuracy in summarization

In modern AI, attention is responsible for:

  • More coherent text
  • Better contextual understanding
  • Improved accuracy in reasoning

Transition Toward Transformers

Transformers rely entirely on self-attention, removing RNNs completely.

Seq2Seq attention → Self-attention → Multi-head attention → Transformers → GPT → Modern AI

This evolution started from the same concepts you are learning here.

People also ask:

What is the main purpose of attention in deep learning?

To focus on relevant parts of the input sequence at every step of decoding.

What is the difference between Bahdanau and Luong attention?

Bahdanau uses additive scoring; Luong uses multiplicative scoring. Bahdanau is more flexible; Luong is faster.

Why do Seq2Seq models need attention?

Without attention, they rely on a single context vector, causing information loss in long sequences.

What models use self-attention?

Transformers, BERT, GPT-3.5, GPT-4, T5, ViT, and most modern generative AI models.

Is attention still used today?

Yes it is the foundation of all state-of-the-art language models and vision transformers.

Leave a Reply

Your email address will not be published. Required fields are marked *