A complete introduction to parallel and distributed computing covering system evolution, parallel vs distributed systems, Flynn’s taxonomy (SIMD, MIMD), and key performance motivations.

Parallel and Distributed Computing is the foundation of modern high-performance systems. Whether you’re running apps on a smartphone, training AI models, streaming videos, simulating climate systems, or serving millions of web users, you’re depending on computation happening simultaneously either inside one machine (parallel) or across many machines (distributed).

This lecture builds the core understanding you need before learning OpenMP, MPI, GPU programming, and performance tuning. We’ll go deep into the evolution of computing, parallel vs distributed systems, Flynn’s taxonomy (SIMD, MIMD), and the main motivations: performance, scalability, and energy efficiency.

1) Evolution of Computing Systems

1.1 Early Era: Single Processor (Sequential Computing)

Initially, computers were designed to execute one instruction at a time. The classical model is Von Neumann architecture, where:

Instructions and data share memory.
A single CPU fetches and executes instructions sequentially.
One “instruction stream” processes one “data stream.”

This approach shaped early programming:

One CPU core
One control flow
One program counter

1.2 Performance Growth Through Frequency Scaling

For many years, CPUs improved primarily by:

Increasing clock speed (MHz → GHz)
Improving instruction pipelines
Using caches
Adding instruction-level parallelism (ILP), such as pipelining and superscalar execution

But this approach hit limits.

1.3 Why Clock Speed Couldn’t Keep Increasing (The Wall)

Increasing frequency increases:

Power consumption
Heat production
Leakage current in transistors

This led to major barriers:

Power wall: power becomes too high to cool economically.
Thermal wall: chips overheat.
Memory wall: CPU became faster than memory; CPU waits for data.
ILP wall: compilers and hardware can’t extract unlimited parallelism from single instruction streams.

So instead of pushing frequency, industry shifted to parallelism.

1.4 Multicore Revolution

Modern CPUs evolved into:

Dual-core, quad-core, 8-core, 16-core, 64-core…
Each core can run independent threads.

This created a new reality:

Performance improvements require parallel programs, not just faster hardware.

1.5 Distributed Computing Evolution (From Clusters to Cloud)

As problems grew larger:

One machine was not enough (memory capacity, compute power, storage).
Multiple machines connected via networks formed:
- Clusters (co-located machines, high-speed interconnects)
- Grids (sharing resources across domains)
- Clouds (virtualized, scalable, on-demand infrastructure)

Distributed computing became essential for:

Big data
Global web services
Resilience and fault tolerance
Large-scale scientific computation

2) The Problem with Purely Sequential Computing

Even the fastest single-core machine has fundamental limitations.

2.1 Execution Time Bottleneck

A sequential program must do:

Step 1 → Step 2 → Step 3 → Step 4
No overlap.

For large tasks (e.g., video encoding, deep learning training), sequential execution time becomes impractical.

2.2 Memory and I/O Bottlenecks

A CPU can compute quickly, but if data isn’t available:

It stalls waiting for memory.
Cache misses become expensive.
Disk/network I/O delays dominate performance.

Parallel and distributed systems use:

Caches
Multiple memory channels
Overlapped communication and computation
Data partitioning

2.3 Limits of Hardware Optimization Alone

Even with:

Pipelining
Branch prediction
Out-of-order execution

A single instruction stream can’t match the throughput of multiple cores/GPUs.

3) What is Parallel Computing?

3.1 Definition

Parallel computing means solving a problem by dividing it into parts and executing those parts simultaneously using multiple processing elements.

3.2 Where Parallelism Exists

Parallelism can happen at many levels:

(A) Bit-Level Parallelism

Early CPUs increased word size (8-bit → 16-bit → 32-bit → 64-bit).
Operating on more bits per instruction gives more performance.

(B) Instruction-Level Parallelism (ILP)

Inside a CPU core:

Pipeline stages execute overlapping instructions.
Superscalar CPUs execute multiple instructions per cycle.

You get parallelism without changing code much, but it has limits.

(C) Data Parallelism

Same operation on many data items:

Add two large arrays element-by-element
Apply filter to every pixel in an image

This is ideal for SIMD and GPUs.

(D) Task Parallelism

Different tasks run concurrently:

One thread handles user input
One thread processes data
One thread writes output

This is common in MIMD multicore CPUs.

IEEE (Parallel Computing Definition & Research)

3.3 Shared Memory Parallel Systems

Many parallel machines use shared memory:

Threads share address space.
Communication occurs via shared variables.

Advantages:

Easy to share data
Faster than networking

Challenges:

Race conditions
Synchronization overhead
Cache coherence complexity

Example: A typical multicore CPU running OpenMP threads.

4) What is Distributed Computing?

4.1 Definition

Distributed computing uses multiple independent computers connected via a network to solve a problem.

Each machine (node):

Has its own CPU(s)
Has its own memory
Has its own operating system instance
Communicates using messages

4.2 Message Passing and Communication

Since memory is not shared:

Node A can’t directly read Node B’s memory.
Data must be sent via:
- sockets
- RPC (remote procedure calls)
- message passing frameworks (MPI)
- distributed data engines (e.g., MapReduce model)

4.3 Key Properties of Distributed Systems

Distributed systems add complexity but provide huge benefits:

Concurrency

Many nodes run at the same time.

No Global Clock

Different machines have different clocks, leading to challenges in ordering events.

Partial Failures

In a distributed system:

one node can fail while others remain active
network links can fail
messages can be delayed or lost

This requires fault tolerance mechanisms.

5) Parallel vs Distributed Systems

5.1 Core Difference: Memory and Communication

Parallel system (shared memory):

Communication through memory reads/writes
Very fast inter-core communication

Distributed system (distributed memory):

Communication through network messages
Higher latency, lower bandwidth than memory

5.2 Latency and Bandwidth

Memory access latency: nanoseconds
Network latency: microseconds to milliseconds

This makes distributed computing harder:

Communication cost is high
Algorithms must minimize messaging

5.3 Fault Tolerance

Parallel computing typically assumes:

one machine, fewer failures

Distributed computing must assume:

failures are normal
systems must recover automatically

5.4 Scaling

Parallel systems scale up to:

tens/hundreds of cores per machine

Distributed systems scale to:

thousands/millions of machines globally

5.5 Examples

Parallel example:

16-core CPU rendering a video frame using threads

Distributed example:

A cloud cluster processing petabytes of logs using many machines

ACM Digital Library resources on distributed systems

6) Flynn’s Taxonomy (SIMD, MIMD)

Flynn’s taxonomy classifies systems by:

Instruction stream: how many instruction sequences are executed
Data stream: how many data sequences are processed

6.1 SISD (Single Instruction, Single Data)

Classic sequential machine
One core, one instruction flow

Example:

Simple embedded processors
Basic single-thread execution

Find your Desired Course

6.2 SIMD (Single Instruction, Multiple Data)

One instruction applied to many data items simultaneously
Great for vectorizable computations

Modern SIMD appears in:

CPU vector extensions (SSE, AVX)
GPUs executing many threads in lockstep style

Best suited for:

Image processing (apply filter to all pixels)
Linear algebra
Physics simulations
Audio/video encoding

Limitations:

Works best when operations are uniform across data
Branch divergence reduces efficiency (especially on GPUs)

6.3 MISD (Multiple Instruction, Single Data)

Rare in practice.
Sometimes appears in:

Fault-tolerant systems (multiple computations verifying same data)

6.4 MIMD (Multiple Instruction, Multiple Data)

Most common modern architecture:

Multiple cores execute different programs or threads
Each can operate on different data

Examples:

Multicore CPUs
Multi-socket servers
Distributed clusters (each node is independent)

MIMD is flexible, but requires:

synchronization tools (locks, barriers)
careful algorithm design

7) Motivation: Why Parallel and Distributed Computing?

7.1 Motivation 1: Performance

The simplest motivation: finish tasks faster.

If a computation can be split into 8 equal independent parts:

1 core: 80 seconds
8 cores (ideal): 10 seconds

But real speedup is limited by:

communication overhead
synchronization delays
serial parts of code
load imbalance

This leads to a core principle:

Parallelism is powerful, but overhead and serial code limit gains.

7.2 Motivation 2: Scalability (Handle Bigger Problems)

Some problems cannot fit on one machine due to:

memory limits
compute limits
storage limits

Distributed systems scale by adding more machines:

more memory
more compute
more storage

Example:
A machine learning pipeline on massive datasets needs:

distributed storage
distributed processing
distributed training

7.3 Motivation 3: Energy Efficiency (Performance per Watt)

Higher performance used to mean:

higher clock speed
much higher power consumption

Now the goal is:

maximize performance per watt

Using multiple slower cores can be more energy efficient than one very fast core:

lower voltage
less heat
better throughput

This is critical for:

mobile devices
data centers (electricity cost)
sustainability goals

8) Key Terms You Must Know (Glossary)

Concurrency: multiple tasks in progress (not always simultaneous)
Parallelism: tasks executed simultaneously
Thread: lightweight execution unit within a process
Process: program instance with separate memory space
Shared memory: multiple threads access same memory
Distributed memory: each node has its own memory
Speedup: performance improvement compared to sequential execution
Overhead: extra cost due to coordination/communication
Synchronization: coordination to avoid incorrect memory access conflicts

Summary

Parallel and Distributed Computing evolved to overcome the limitations of single-core, sequential systems. Parallel computing uses multiple processors within one machine to execute tasks simultaneously, while distributed computing connects multiple machines over a network to work together. Flynn’s taxonomy classifies systems like SIMD and MIMD, which form the basis of modern multicore CPUs and GPUs. The main goals of these approaches are improved performance, scalability, and energy efficiency, making them essential for today’s high-performance and cloud-based applications.

Multiple Choice Questions (MCQs)

1. Which architecture executes one instruction on one data stream?
A) SIMD
B) MISD
C) SISD
D) MIMD
Answer: C) SISD

2. The main reason for shifting from single-core to multicore processors was:
A) Software limitations
B) Power and thermal constraints
C) Lack of memory
D) Network latency
Answer: B) Power and thermal constraints

3. In parallel computing, processors typically communicate through:
A) Internet
B) Message brokers
C) Shared memory
D) Satellites
Answer: C) Shared memory

4. In distributed computing, nodes communicate using:
A) Shared registers
B) Message passing
C) Cache coherence
D) CPU pipeline
Answer: B) Message passing

5. SIMD stands for:
A) Single Instruction Multiple Data
B) Sequential Instruction Multiple Data
C) Single Integrated Memory Device
D) System Integrated Multi Device
Answer: A) Single Instruction Multiple Data

6. Which architecture is most common in modern multicore systems?
A) SISD
B) SIMD only
C) MIMD
D) MISD
Answer: C) MIMD

7. GPUs are best categorized under:
A) SISD
B) SIMD
C) MISD
D) None
Answer: B) SIMD

8. Which is NOT a motivation for parallel computing?
A) Performance
B) Scalability
C) Energy efficiency
D) Reducing storage permanently
Answer: D) Reducing storage permanently

9. Distributed systems are typically:
A) Tightly coupled
B) Loosely coupled
C) Single-core
D) Cache dependent
Answer: B) Loosely coupled

10. The term “speedup” refers to:
A) Increasing clock frequency
B) Performance improvement compared to sequential execution
C) Reducing RAM
D) Increasing voltage
Answer: B) Performance improvement compared to sequential execution

11. A cluster of computers connected via a network is an example of:
A) Sequential system
B) Parallel shared memory system
C) Distributed system
D) Single-core system
Answer: C) Distributed system

12. Which of the following is rare in practical systems?
A) SISD
B) SIMD
C) MISD
D) MIMD
Answer: C) MISD

13. The “power wall” refers to:
A) Lack of RAM
B) Increasing power consumption limiting CPU speed
C) Slow networks
D) Low storage
Answer: B) Increasing power consumption limiting CPU speed

14. In distributed systems, partial failure means:
A) Entire system crashes
B) One node fails while others continue
C) CPU overheats
D) Cache misses
Answer: B) One node fails while others continue

15. Applying the same operation to all pixels of an image is an example of:
A) Task parallelism
B) Data parallelism
C) Sequential processing
D) MISD
Answer: B) Data parallelism

Answers to Short Answer Questions

1. Define parallel computing.

Parallel computing is a computing model in which multiple processors or cores execute different parts of a program simultaneously to reduce execution time and improve performance.

2. Define distributed computing.

Distributed computing is a system where multiple independent computers communicate over a network and coordinate their actions by passing messages to solve a common problem.

3. What is Flynn’s taxonomy?

Flynn’s taxonomy is a classification system for computer architectures based on the number of instruction streams and data streams. It includes four categories: SISD, SIMD, MISD, and MIMD.

4. State two differences between parallel and distributed systems.

Parallel systems often share memory, while distributed systems have separate memory for each node.
Parallel systems are tightly coupled within a single machine, while distributed systems are loosely coupled across multiple machines connected via a network.

5. What is scalability in computing systems?

Scalability is the ability of a computing system to handle increased workload or problem size efficiently by adding more resources such as processors, memory, or machines.

Answers to Long Answer Questions

1. Explain the evolution of computing systems from sequential to parallel and distributed systems.

Initially, computing systems followed a sequential model based on the Von Neumann architecture, where a single processor executed one instruction at a time. Performance improvements were achieved mainly by increasing clock speed and improving instruction-level parallelism. However, increasing frequency led to higher power consumption and heat generation, creating physical limitations known as the power wall and thermal wall.

To overcome these limitations, the industry shifted toward multicore processors, where multiple cores execute tasks simultaneously within the same system. This marked the beginning of widespread parallel computing. As computational demands continued to grow beyond the capacity of a single machine, distributed computing emerged. Distributed systems connect multiple independent machines through networks, allowing them to collaborate on large-scale problems. This evolution enabled higher performance, improved scalability, and better resource utilization.

2. Describe Flynn’s taxonomy in detail.

Flynn’s taxonomy classifies computer architectures based on instruction and data streams:

SISD (Single Instruction, Single Data): A traditional sequential computer with one instruction stream operating on one data stream. Example: Single-core processor running one program.
SIMD (Single Instruction, Multiple Data): A single instruction is applied to multiple data elements simultaneously. This architecture is efficient for data-parallel tasks such as image processing and scientific computations. GPUs are a common example.
MISD (Multiple Instruction, Single Data): Multiple instructions operate on the same data stream. This architecture is rare and mainly used in fault-tolerant systems.
MIMD (Multiple Instruction, Multiple Data): Multiple processors execute different instructions on different data independently. This is the most common architecture today and is used in multicore CPUs and distributed systems.

MIMD is widely used because it provides flexibility and supports both task parallelism and data parallelism.

3. Discuss the motivation for parallel and distributed computing.

The primary motivations for parallel and distributed computing are performance, scalability, and energy efficiency.

Performance: Parallel execution reduces execution time by dividing tasks among multiple processors. This improves speedup and throughput for computationally intensive applications.

Scalability: Distributed systems allow workloads to grow by adding more machines. This enables large-scale data processing, cloud computing, and big data analytics.

Energy Efficiency: Instead of increasing clock speeds, using multiple lower-frequency cores improves performance per watt. This reduces power consumption and heat generation, making systems more sustainable and cost-effective.

Modern applications such as artificial intelligence, weather forecasting, financial modeling, and cloud services depend heavily on parallel and distributed computing to meet performance and scalability requirements.

Next Lecture: Parallel Computer Architectures