Learn parallel computer architectures including shared vs distributed memory, multicore and manycore systems, GPU design, cluster and cloud computing models.

1. Introduction: Why Architecture Matters in Parallel Computing

Parallel computer architecture defines how processors, memory systems, interconnect networks, and storage components are organized to execute multiple computations simultaneously.

Architecture determines:

How fast processors communicate
How memory is accessed
How scalable the system can become
How difficult programming will be
How energy efficient the system is

A poorly matched architecture can reduce performance even if many processors are available.

Parallel architectures are primarily categorized based on memory organization and processor organization.

2. Shared Memory Architectures

2.1 Conceptual Model

In shared memory systems, all processors access a single global address space.

This means:

Any processor can load/store any memory location.
Communication occurs through shared variables.
Synchronization is required to prevent race conditions.

Programming examples:

OpenMP
POSIX threads
Java multithreading

2.2 Physical Organization

Although logically memory appears unified, physically it may be organized differently.

Key Components:

Multiple CPU cores
Private L1/L2 caches
Shared L3 cache (often)
Main memory
Cache coherence protocol

2.3 Cache Coherence Problem

When multiple cores cache the same memory block:

What if one core modifies it?

Without coordination:

Other cores may read stale values.

To solve this, hardware uses cache coherence protocols like:

MESI
MOESI

However, coherence traffic increases with core count.

This limits scalability.

2.4 Types of Shared Memory Systems

A) Uniform Memory Access (UMA)

Equal memory access time for all processors.
Centralized memory controller.
Simple architecture.

Limitation:

Memory bus becomes bottleneck.
Not scalable beyond moderate core counts.

B) Non-Uniform Memory Access (NUMA)

Each processor has:

Local memory (fast access)
Remote memory (slower access)

Access time depends on memory location.

Advantages:

Scales better than UMA
Higher aggregate bandwidth

Challenges:

Data placement matters
Poor locality reduces performance

NUMA-aware programming is essential for optimization.

2.5 Performance Characteristics of Shared Memory

Advantages:

Low latency communication
Easier programming model
Efficient fine-grained parallelism

Limitations:

Memory contention
False sharing
Coherence overhead
Scalability ceiling

Shared memory systems scale well up to dozens of cores, but struggle at thousands.

3. Distributed Memory Architectures

3.1 Conceptual Model

In distributed memory systems:

Each processor has its own private memory.
No global shared address space.
Communication occurs explicitly through message passing.

Programming example:

MPI (Message Passing Interface)

AWS Architecture:

3.2 System Organization

Each node contains:

One or more processors
Local RAM
Local storage
Network interface

Nodes are connected via:

Ethernet
InfiniBand
High-speed interconnects

3.3 Communication Mechanisms

Data exchange involves:

Sending messages
Receiving messages
Synchronizing communication

Communication cost depends on:

Latency (startup cost)
Bandwidth (data transfer rate)

Unlike shared memory, communication is explicit.

3.4 Scalability Advantages

Distributed systems scale because:

Memory grows with nodes
Bandwidth scales with interconnect
No global coherence overhead

This is why supercomputers use distributed memory models.

3.5 Challenges

Complex programming
Load balancing difficulty
Communication overhead
Deadlocks in message passing

Performance depends heavily on minimizing communication.

4. Multicore Systems

4.1 Definition

A multicore processor integrates multiple independent cores on a single chip.

Each core:

Executes its own instruction stream
Has private caches
Shares memory hierarchy

4.2 Why Multicore Emerged

Due to:

Power wall
Thermal constraints
Frequency scaling limits

Instead of increasing clock speed, manufacturers increased core count.

4.3 Architectural Features

Shared L3 cache
Hardware threads (SMT/Hyper-threading)
Integrated memory controllers
High-speed on-chip interconnect

Multicore systems excel at:

Moderate parallel workloads
General-purpose computing
Server environments

5. Manycore Systems

5.1 Definition

Manycore systems contain tens to thousands of simpler cores.

Unlike multicore:

Cores are lightweight
Optimized for throughput
Designed for massive parallelism

5.2 Architectural Design Philosophy

Manycore systems prioritize:

Throughput over latency
Energy efficiency
Massive data parallelism

6. GPU Architecture

GPUs are specialized manycore architectures.

6.1 Execution Model

GPUs use:

SIMT (Single Instruction Multiple Thread)
Threads grouped into warps
Warps executed on Streaming Multiprocessors (SMs)

6.2 Memory Hierarchy in GPU

Registers (fastest)
Shared memory (on-chip)
Global memory (high latency)
Constant memory
Texture memory

Performance depends heavily on:

Memory coalescing
Minimizing global memory access
Avoiding branch divergence

6.3 Why GPUs Achieve High Throughput

GPUs hide memory latency by:

Running thousands of threads
Switching between warps
Maximizing arithmetic intensity

However:

Not ideal for sequential tasks
Poor performance for branch-heavy logic

7. Cluster Architecture

A cluster is a collection of interconnected computers functioning as a unified resource.

Characteristics:

Distributed memory
High-speed network
Parallel file systems
Job scheduling systems

Used in:

Scientific computing
Weather forecasting
Molecular modeling

8. Cloud Architecture

Cloud computing builds on distributed clusters but adds:

Virtualization
Elastic resource provisioning
Multi-tenancy
Service abstraction

Cloud advantages:

Scalability on demand
Cost efficiency
Fault tolerance

Cloud introduces:

Network variability
Virtualization overhead
Distributed storage complexity

9. Comparative Architectural Analysis

Architecture	Scalability	Latency	Programming Complexity	Use Case
Shared Memory	Low–Moderate	Low	Easier	Desktop, small servers
NUMA	Moderate	Medium	Medium	Enterprise servers
Distributed Memory	Very High	Higher	Complex	Supercomputers
GPU	Massive Throughput	High memory latency	Specialized	AI, ML, Graphics
Cloud	Elastic	Network dependent	High	Web-scale systems

10. Architectural Trade-Offs

When designing parallel systems:

Increasing cores increases contention.
Increasing distribution increases communication cost.
GPUs increase throughput but reduce flexibility.
Distributed systems increase scalability but add latency.

No architecture is universally optimal.

The right architecture depends on:

Workload type
Data size
Latency requirements
Energy constraints

11. Conclusion

Parallel computer architectures define how computation scales beyond sequential limits. Shared memory systems provide low-latency communication but are limited by coherence overhead. Distributed memory architectures enable massive scalability at the cost of explicit communication complexity. Multicore processors dominate general-purpose computing, while GPUs power high-throughput data-parallel workloads. Clusters and cloud architectures extend distributed models to global-scale systems.

Understanding these architectural models allows us to analyze performance, scalability, and system design trade-offs forming the foundation for advanced topics such as memory consistency, MPI programming, GPU optimization, and distributed system reliability.

Introduction to Parallel and Distributed Computing

Multiple Choice Questions

1. In a shared memory architecture, processors communicate using:
A) Network packets
B) Message queues
C) Shared variables
D) External storage
Answer: C) Shared variables

2. Which architecture provides a single global address space?
A) Distributed memory
B) Shared memory
C) Cluster-based
D) Cloud-native
Answer: B) Shared memory

3. In NUMA systems, memory access time:
A) Is always constant
B) Depends on memory location
C) Is zero
D) Is unrelated to processor location
Answer: B) Depends on memory location

4. MPI is primarily used in:
A) Shared memory systems
B) Distributed memory systems
C) Single-core systems
D) Cache-only systems
Answer: B) Distributed memory systems

5. A multicore processor typically contains:
A) One processor only
B) Multiple processors on separate machines
C) Multiple cores on a single chip
D) Unlimited memory
Answer: C) Multiple cores on a single chip

6. Manycore systems are optimized for:
A) Sequential execution
B) Massive parallel throughput
C) Disk storage
D) Low memory access
Answer: B) Massive parallel throughput

7. GPUs primarily follow which execution model?
A) SISD
B) MISD
C) SIMT
D) Sequential
Answer: C) SIMT

8. Cache coherence protocols are mainly required in:
A) Distributed memory systems
B) Shared memory systems
C) Cloud storage
D) External databases
Answer: B) Shared memory systems

9. Which of the following improves scalability in large systems?
A) Centralized memory
B) Shared bus
C) Distributed memory
D) Single-core processing
Answer: C) Distributed memory

10. In distributed systems, communication cost is mainly influenced by:
A) CPU clock speed
B) Network latency and bandwidth
C) Cache size
D) Compiler version
Answer: B) Network latency and bandwidth

11. UMA stands for:
A) Unified Memory Architecture
B) Uniform Memory Access
C) Universal Memory Allocation
D) User Memory Access
Answer: B) Uniform Memory Access

12. NUMA architecture improves scalability by:
A) Eliminating memory
B) Using separate local memory per processor
C) Removing caches
D) Using only one core
Answer: B) Using separate local memory per processor

13. A cluster typically consists of:
A) Single CPU
B) Networked independent computers
C) One GPU
D) External storage only
Answer: B) Networked independent computers

14. Cloud architecture differs from clusters mainly because of:
A) Lack of network
B) Virtualization and elasticity
C) No storage
D) No scalability
Answer: B) Virtualization and elasticity

15. False sharing occurs in:
A) Distributed systems
B) GPU-only systems
C) Shared memory systems
D) Cloud-only systems
Answer: C) Shared memory systems

Short Answer Questions

1. Define shared memory architecture.
Shared memory architecture is a parallel system where multiple processors access a common global memory space and communicate by reading and writing shared variables.

2. Define distributed memory architecture.
Distributed memory architecture is a system where each processor has its own private memory and communicates with other processors using message passing.

3. Differentiate between UMA and NUMA.
In UMA, all processors have equal memory access time. In NUMA, memory access time depends on whether the memory is local or remote to the processor.

4. What is the difference between multicore and manycore systems?
Multicore systems have a small number of powerful cores, while manycore systems have a large number of simpler cores optimized for massive parallelism.

5. What role does interconnection network play in distributed systems?
The interconnection network enables communication between nodes, and its latency and bandwidth significantly affect system performance.

Long Answer Questions

Compare Shared Memory and Distributed Memory Architectures

Shared memory and distributed memory architectures differ fundamentally in how processors access memory, communicate, and scale.

(A) Memory Organization

Shared Memory Architecture

All processors share a single global address space.
Any processor can directly read or write any memory location.
Communication occurs implicitly through shared variables.

Distributed Memory Architecture

Each processor has its own private memory.
No global address space exists.
Data must be explicitly exchanged via message passing.

(B) Communication Mechanism

Shared Memory

Communication through memory loads and stores.
Requires synchronization mechanisms such as locks and barriers.
Cache coherence protocols (e.g., MESI) maintain consistency.

Problem:

Cache coherence overhead increases with core count.
False sharing can degrade performance.

Distributed Memory

Communication via explicit message passing (e.g., MPI).
Involves send/receive operations.
Communication cost modeled as: $T_{comm} = \alpha + \beta n$
Where:
α = latency
β = transfer time per byte
$n$ = message size

(C) Scalability

Shared Memory

Limited scalability due to:
- Memory bandwidth saturation
- Coherence traffic
- Contention
Typically scales to dozens or a few hundred cores efficiently.

Distributed Memory

Highly scalable.
Memory capacity grows linearly with nodes.
No global coherence overhead.
Used in supercomputers and large HPC clusters.

(D) Performance Characteristics

Shared memory:

Low latency communication.
Suitable for fine-grained parallelism.
Performance limited by memory bandwidth.

Distributed memory:

Higher communication latency.
Suitable for coarse-grained parallelism.
Scales better for large data problems.

(E) Programming Complexity

Shared memory:

Easier programming model.
Implicit communication.
Risk of race conditions.

Distributed memory:

More complex programming.
Explicit data movement required.
Harder debugging but better scalability.

Conclusion

Shared memory systems provide ease of programming and low latency but suffer scalability limits due to coherence overhead. Distributed memory systems require explicit communication and more complex programming but scale efficiently to thousands of nodes, making them ideal for large-scale HPC applications.

Multicore vs Manycore Architectures

Multicore and manycore architectures differ in core complexity, performance goals, and target applications.

(A) Multicore Architecture

Design Philosophy

Few powerful cores.
Deep pipelines.
Large caches.
Strong single-thread performance.

Performance Focus

Optimized for low latency.
Handles mixed workloads efficiently.
Suitable for operating systems and general-purpose tasks.

Characteristics

2–32 cores (typical range).
High clock speeds.
Sophisticated branch prediction.

Application Domains

Desktop computing
Enterprise servers
Web servers
Database systems

(B) Manycore Architecture

Design Philosophy

Hundreds to thousands of simpler cores.
Emphasis on throughput over latency.
Designed for massive parallelism.

GPU as Example

GPUs:

Use SIMT execution model.
Execute thousands of threads.
Hide memory latency through concurrency.

Performance Characteristics

Extremely high FLOPS.
Optimized for data-parallel workloads.
Lower per-core performance compared to CPUs.

Energy Efficiency

Simpler cores consume less power.
High FLOPS per watt.
Widely used in modern supercomputers.

(C) Key Differences

Feature	Multicore	Manycore
Core Count	Few	Hundreds/Thousands
Core Complexity	High	Low
Latency	Low	Higher
Throughput	Moderate	Very High
Best For	General tasks	Data-parallel tasks

Conclusion

Multicore systems are optimized for general-purpose computing and low-latency tasks, while manycore systems such as GPUs are designed for massive parallel throughput and high energy efficiency, making them ideal for AI, scientific simulations, and large matrix computations.

Cluster and Cloud Architectures

Cluster and cloud architectures extend distributed computing principles to large-scale systems.

(A) Cluster Architecture

Structure

Multiple independent computers (nodes).
Connected via high-speed interconnect.
Each node has private memory.
Follows distributed memory model.

Key Components

Compute nodes (CPU/GPU)
High-speed network (InfiniBand, Ethernet)
Parallel file system (Lustre)

Performance Influence

Network latency affects communication-heavy workloads.
Bisection bandwidth determines scalability.
Strong scaling limited by communication overhead.

Clusters are widely used in:

Scientific simulations
Weather forecasting
Molecular modeling

(B) Cloud Architecture

Cloud computing builds on cluster principles but introduces:

Virtualization

Multiple virtual machines share physical hardware.
Enables resource isolation.
Adds virtualization overhead.

Elasticity

Resources scale dynamically.
Suitable for variable workloads.
Pay-as-you-go model.

Multi-Tenancy

Multiple users share infrastructure.
May introduce performance variability.

(C) Interconnection Networks

Performance in clusters and clouds depends heavily on:

Network topology
Latency
Bandwidth
Congestion behavior

Poor interconnect design can eliminate parallel speedup.

(D) Scalability Comparison

Cluster:

Fixed infrastructure.
High performance.
Lower variability.

Cloud:

Elastic scaling.
Virtualization overhead.
Slightly less predictable performance.

Conclusion

Clusters provide high-performance distributed computing with predictable performance, while cloud architectures offer elastic scalability and resource flexibility through virtualization. However, virtualization overhead and network variability can impact performance in cloud environments compared to dedicated HPC clusters.

1. Introduction: Why Architecture Matters in Parallel Computing

2. Shared Memory Architectures

2.1 Conceptual Model

2.2 Physical Organization

Key Components:

2.3 Cache Coherence Problem

2.4 Types of Shared Memory Systems

A) Uniform Memory Access (UMA)

B) Non-Uniform Memory Access (NUMA)

2.5 Performance Characteristics of Shared Memory

3. Distributed Memory Architectures

3.1 Conceptual Model

3.2 System Organization

3.3 Communication Mechanisms

3.4 Scalability Advantages

3.5 Challenges

4. Multicore Systems

4.1 Definition

4.2 Why Multicore Emerged

4.3 Architectural Features

5. Manycore Systems

5.1 Definition

5.2 Architectural Design Philosophy

6. GPU Architecture

6.1 Execution Model

6.2 Memory Hierarchy in GPU

6.3 Why GPUs Achieve High Throughput

7. Cluster Architecture

8. Cloud Architecture

9. Comparative Architectural Analysis

10. Architectural Trade-Offs

11. Conclusion

Multiple Choice Questions

Short Answer Questions

Long Answer Questions

Compare Shared Memory and Distributed Memory Architectures

(A) Memory Organization

Shared Memory Architecture

Distributed Memory Architecture

(B) Communication Mechanism

Shared Memory

Distributed Memory

(C) Scalability

Shared Memory

Distributed Memory

(D) Performance Characteristics

(E) Programming Complexity

Conclusion

Multicore vs Manycore Architectures

(A) Multicore Architecture

Design Philosophy

Performance Focus

Characteristics

Application Domains

(B) Manycore Architecture

Design Philosophy

GPU as Example

Performance Characteristics

Energy Efficiency

(C) Key Differences

Conclusion

Cluster and Cloud Architectures

(A) Cluster Architecture

Structure

Key Components

Performance Influence

(B) Cloud Architecture

Virtualization

Elasticity

Multi-Tenancy

(C) Interconnection Networks

(D) Scalability Comparison

Conclusion

Related Posts

Lecture 1: Introduction to Parallel and Distributed Computing

Leave a ReplyCancel Reply