Learn parallel computer architectures including shared vs distributed memory, multicore and manycore systems, GPU design, cluster and cloud computing models.
1. Introduction: Why Architecture Matters in Parallel Computing
Parallel computer architecture defines how processors, memory systems, interconnect networks, and storage components are organized to execute multiple computations simultaneously.
Architecture determines:
- How fast processors communicate
- How memory is accessed
- How scalable the system can become
- How difficult programming will be
- How energy efficient the system is
A poorly matched architecture can reduce performance even if many processors are available.
Parallel architectures are primarily categorized based on memory organization and processor organization.
2. Shared Memory Architectures
2.1 Conceptual Model
In shared memory systems, all processors access a single global address space.
This means:
- Any processor can load/store any memory location.
- Communication occurs through shared variables.
- Synchronization is required to prevent race conditions.
Programming examples:
- OpenMP
- POSIX threads
- Java multithreading
2.2 Physical Organization
Although logically memory appears unified, physically it may be organized differently.
Key Components:
- Multiple CPU cores
- Private L1/L2 caches
- Shared L3 cache (often)
- Main memory
- Cache coherence protocol
2.3 Cache Coherence Problem
When multiple cores cache the same memory block:
What if one core modifies it?
Without coordination:
- Other cores may read stale values.
To solve this, hardware uses cache coherence protocols like:
- MESI
- MOESI
However, coherence traffic increases with core count.
This limits scalability.
2.4 Types of Shared Memory Systems
A) Uniform Memory Access (UMA)
- Equal memory access time for all processors.
- Centralized memory controller.
- Simple architecture.
Limitation:
- Memory bus becomes bottleneck.
- Not scalable beyond moderate core counts.
B) Non-Uniform Memory Access (NUMA)
Each processor has:
- Local memory (fast access)
- Remote memory (slower access)
Access time depends on memory location.
Advantages:
- Scales better than UMA
- Higher aggregate bandwidth
Challenges:
- Data placement matters
- Poor locality reduces performance
NUMA-aware programming is essential for optimization.
2.5 Performance Characteristics of Shared Memory
Advantages:
- Low latency communication
- Easier programming model
- Efficient fine-grained parallelism
Limitations:
- Memory contention
- False sharing
- Coherence overhead
- Scalability ceiling
Shared memory systems scale well up to dozens of cores, but struggle at thousands.
3. Distributed Memory Architectures
3.1 Conceptual Model
In distributed memory systems:
- Each processor has its own private memory.
- No global shared address space.
- Communication occurs explicitly through message passing.
Programming example:
- MPI (Message Passing Interface)
3.2 System Organization
Each node contains:
- One or more processors
- Local RAM
- Local storage
- Network interface
Nodes are connected via:
- Ethernet
- InfiniBand
- High-speed interconnects
3.3 Communication Mechanisms
Data exchange involves:
- Sending messages
- Receiving messages
- Synchronizing communication
Communication cost depends on:
- Latency (startup cost)
- Bandwidth (data transfer rate)
Unlike shared memory, communication is explicit.
3.4 Scalability Advantages
Distributed systems scale because:
- Memory grows with nodes
- Bandwidth scales with interconnect
- No global coherence overhead
This is why supercomputers use distributed memory models.
3.5 Challenges
- Complex programming
- Load balancing difficulty
- Communication overhead
- Deadlocks in message passing
Performance depends heavily on minimizing communication.
4. Multicore Systems
4.1 Definition
A multicore processor integrates multiple independent cores on a single chip.
Each core:
- Executes its own instruction stream
- Has private caches
- Shares memory hierarchy
4.2 Why Multicore Emerged
Due to:
- Power wall
- Thermal constraints
- Frequency scaling limits
Instead of increasing clock speed, manufacturers increased core count.
4.3 Architectural Features
- Shared L3 cache
- Hardware threads (SMT/Hyper-threading)
- Integrated memory controllers
- High-speed on-chip interconnect
Multicore systems excel at:
- Moderate parallel workloads
- General-purpose computing
- Server environments
5. Manycore Systems
5.1 Definition
Manycore systems contain tens to thousands of simpler cores.
Unlike multicore:
- Cores are lightweight
- Optimized for throughput
- Designed for massive parallelism
5.2 Architectural Design Philosophy
Manycore systems prioritize:
- Throughput over latency
- Energy efficiency
- Massive data parallelism
6. GPU Architecture
GPUs are specialized manycore architectures.
6.1 Execution Model
GPUs use:
- SIMT (Single Instruction Multiple Thread)
- Threads grouped into warps
- Warps executed on Streaming Multiprocessors (SMs)
6.2 Memory Hierarchy in GPU
- Registers (fastest)
- Shared memory (on-chip)
- Global memory (high latency)
- Constant memory
- Texture memory
Performance depends heavily on:
- Memory coalescing
- Minimizing global memory access
- Avoiding branch divergence
6.3 Why GPUs Achieve High Throughput
GPUs hide memory latency by:
- Running thousands of threads
- Switching between warps
- Maximizing arithmetic intensity
However:
- Not ideal for sequential tasks
- Poor performance for branch-heavy logic
7. Cluster Architecture
A cluster is a collection of interconnected computers functioning as a unified resource.
Characteristics:
- Distributed memory
- High-speed network
- Parallel file systems
- Job scheduling systems
Used in:
- Scientific computing
- Weather forecasting
- Molecular modeling
8. Cloud Architecture
Cloud computing builds on distributed clusters but adds:
- Virtualization
- Elastic resource provisioning
- Multi-tenancy
- Service abstraction
Cloud advantages:
- Scalability on demand
- Cost efficiency
- Fault tolerance
Cloud introduces:
- Network variability
- Virtualization overhead
- Distributed storage complexity
9. Comparative Architectural Analysis
| Architecture | Scalability | Latency | Programming Complexity | Use Case |
|---|---|---|---|---|
| Shared Memory | Low–Moderate | Low | Easier | Desktop, small servers |
| NUMA | Moderate | Medium | Medium | Enterprise servers |
| Distributed Memory | Very High | Higher | Complex | Supercomputers |
| GPU | Massive Throughput | High memory latency | Specialized | AI, ML, Graphics |
| Cloud | Elastic | Network dependent | High | Web-scale systems |
10. Architectural Trade-Offs
When designing parallel systems:
- Increasing cores increases contention.
- Increasing distribution increases communication cost.
- GPUs increase throughput but reduce flexibility.
- Distributed systems increase scalability but add latency.
No architecture is universally optimal.
The right architecture depends on:
- Workload type
- Data size
- Latency requirements
- Energy constraints
11. Conclusion
Parallel computer architectures define how computation scales beyond sequential limits. Shared memory systems provide low-latency communication but are limited by coherence overhead. Distributed memory architectures enable massive scalability at the cost of explicit communication complexity. Multicore processors dominate general-purpose computing, while GPUs power high-throughput data-parallel workloads. Clusters and cloud architectures extend distributed models to global-scale systems.
Understanding these architectural models allows us to analyze performance, scalability, and system design trade-offs forming the foundation for advanced topics such as memory consistency, MPI programming, GPU optimization, and distributed system reliability.
Multiple Choice Questions
1. In a shared memory architecture, processors communicate using:
A) Network packets
B) Message queues
C) Shared variables
D) External storage
Answer: C) Shared variables
2. Which architecture provides a single global address space?
A) Distributed memory
B) Shared memory
C) Cluster-based
D) Cloud-native
Answer: B) Shared memory
3. In NUMA systems, memory access time:
A) Is always constant
B) Depends on memory location
C) Is zero
D) Is unrelated to processor location
Answer: B) Depends on memory location
4. MPI is primarily used in:
A) Shared memory systems
B) Distributed memory systems
C) Single-core systems
D) Cache-only systems
Answer: B) Distributed memory systems
5. A multicore processor typically contains:
A) One processor only
B) Multiple processors on separate machines
C) Multiple cores on a single chip
D) Unlimited memory
Answer: C) Multiple cores on a single chip
6. Manycore systems are optimized for:
A) Sequential execution
B) Massive parallel throughput
C) Disk storage
D) Low memory access
Answer: B) Massive parallel throughput
7. GPUs primarily follow which execution model?
A) SISD
B) MISD
C) SIMT
D) Sequential
Answer: C) SIMT
8. Cache coherence protocols are mainly required in:
A) Distributed memory systems
B) Shared memory systems
C) Cloud storage
D) External databases
Answer: B) Shared memory systems
9. Which of the following improves scalability in large systems?
A) Centralized memory
B) Shared bus
C) Distributed memory
D) Single-core processing
Answer: C) Distributed memory
10. In distributed systems, communication cost is mainly influenced by:
A) CPU clock speed
B) Network latency and bandwidth
C) Cache size
D) Compiler version
Answer: B) Network latency and bandwidth
11. UMA stands for:
A) Unified Memory Architecture
B) Uniform Memory Access
C) Universal Memory Allocation
D) User Memory Access
Answer: B) Uniform Memory Access
12. NUMA architecture improves scalability by:
A) Eliminating memory
B) Using separate local memory per processor
C) Removing caches
D) Using only one core
Answer: B) Using separate local memory per processor
13. A cluster typically consists of:
A) Single CPU
B) Networked independent computers
C) One GPU
D) External storage only
Answer: B) Networked independent computers
14. Cloud architecture differs from clusters mainly because of:
A) Lack of network
B) Virtualization and elasticity
C) No storage
D) No scalability
Answer: B) Virtualization and elasticity
15. False sharing occurs in:
A) Distributed systems
B) GPU-only systems
C) Shared memory systems
D) Cloud-only systems
Answer: C) Shared memory systems
Short Answer Questions
1. Define shared memory architecture.
Shared memory architecture is a parallel system where multiple processors access a common global memory space and communicate by reading and writing shared variables.
2. Define distributed memory architecture.
Distributed memory architecture is a system where each processor has its own private memory and communicates with other processors using message passing.
3. Differentiate between UMA and NUMA.
In UMA, all processors have equal memory access time. In NUMA, memory access time depends on whether the memory is local or remote to the processor.
4. What is the difference between multicore and manycore systems?
Multicore systems have a small number of powerful cores, while manycore systems have a large number of simpler cores optimized for massive parallelism.
5. What role does interconnection network play in distributed systems?
The interconnection network enables communication between nodes, and its latency and bandwidth significantly affect system performance.
Long Answer Questions
Compare Shared Memory and Distributed Memory Architectures
Shared memory and distributed memory architectures differ fundamentally in how processors access memory, communicate, and scale.
(A) Memory Organization
Shared Memory Architecture
- All processors share a single global address space.
- Any processor can directly read or write any memory location.
- Communication occurs implicitly through shared variables.
Distributed Memory Architecture
- Each processor has its own private memory.
- No global address space exists.
- Data must be explicitly exchanged via message passing.
(B) Communication Mechanism
Shared Memory
- Communication through memory loads and stores.
- Requires synchronization mechanisms such as locks and barriers.
- Cache coherence protocols (e.g., MESI) maintain consistency.
Problem:
- Cache coherence overhead increases with core count.
- False sharing can degrade performance.
Distributed Memory
- Communication via explicit message passing (e.g., MPI).
- Involves send/receive operations.
- Communication cost modeled as:
- Where:
- α = latency
- β = transfer time per byte
- = message size
(C) Scalability
Shared Memory
- Limited scalability due to:
- Memory bandwidth saturation
- Coherence traffic
- Contention
- Typically scales to dozens or a few hundred cores efficiently.
Distributed Memory
- Highly scalable.
- Memory capacity grows linearly with nodes.
- No global coherence overhead.
- Used in supercomputers and large HPC clusters.
(D) Performance Characteristics
Shared memory:
- Low latency communication.
- Suitable for fine-grained parallelism.
- Performance limited by memory bandwidth.
Distributed memory:
- Higher communication latency.
- Suitable for coarse-grained parallelism.
- Scales better for large data problems.
(E) Programming Complexity
Shared memory:
- Easier programming model.
- Implicit communication.
- Risk of race conditions.
Distributed memory:
- More complex programming.
- Explicit data movement required.
- Harder debugging but better scalability.
Conclusion
Shared memory systems provide ease of programming and low latency but suffer scalability limits due to coherence overhead. Distributed memory systems require explicit communication and more complex programming but scale efficiently to thousands of nodes, making them ideal for large-scale HPC applications.
Multicore vs Manycore Architectures
Multicore and manycore architectures differ in core complexity, performance goals, and target applications.
(A) Multicore Architecture
Design Philosophy
- Few powerful cores.
- Deep pipelines.
- Large caches.
- Strong single-thread performance.
Performance Focus
- Optimized for low latency.
- Handles mixed workloads efficiently.
- Suitable for operating systems and general-purpose tasks.
Characteristics
- 2–32 cores (typical range).
- High clock speeds.
- Sophisticated branch prediction.
Application Domains
- Desktop computing
- Enterprise servers
- Web servers
- Database systems
(B) Manycore Architecture
Design Philosophy
- Hundreds to thousands of simpler cores.
- Emphasis on throughput over latency.
- Designed for massive parallelism.
GPU as Example
GPUs:
- Use SIMT execution model.
- Execute thousands of threads.
- Hide memory latency through concurrency.
Performance Characteristics
- Extremely high FLOPS.
- Optimized for data-parallel workloads.
- Lower per-core performance compared to CPUs.
Energy Efficiency
- Simpler cores consume less power.
- High FLOPS per watt.
- Widely used in modern supercomputers.
(C) Key Differences
| Feature | Multicore | Manycore |
|---|---|---|
| Core Count | Few | Hundreds/Thousands |
| Core Complexity | High | Low |
| Latency | Low | Higher |
| Throughput | Moderate | Very High |
| Best For | General tasks | Data-parallel tasks |
Conclusion
Multicore systems are optimized for general-purpose computing and low-latency tasks, while manycore systems such as GPUs are designed for massive parallel throughput and high energy efficiency, making them ideal for AI, scientific simulations, and large matrix computations.
Cluster and Cloud Architectures
Cluster and cloud architectures extend distributed computing principles to large-scale systems.
(A) Cluster Architecture
Structure
- Multiple independent computers (nodes).
- Connected via high-speed interconnect.
- Each node has private memory.
- Follows distributed memory model.
Key Components
- Compute nodes (CPU/GPU)
- High-speed network (InfiniBand, Ethernet)
- Parallel file system (Lustre)
Performance Influence
- Network latency affects communication-heavy workloads.
- Bisection bandwidth determines scalability.
- Strong scaling limited by communication overhead.
Clusters are widely used in:
- Scientific simulations
- Weather forecasting
- Molecular modeling
(B) Cloud Architecture
Cloud computing builds on cluster principles but introduces:
Virtualization
- Multiple virtual machines share physical hardware.
- Enables resource isolation.
- Adds virtualization overhead.
Elasticity
- Resources scale dynamically.
- Suitable for variable workloads.
- Pay-as-you-go model.
Multi-Tenancy
- Multiple users share infrastructure.
- May introduce performance variability.
(C) Interconnection Networks
Performance in clusters and clouds depends heavily on:
- Network topology
- Latency
- Bandwidth
- Congestion behavior
Poor interconnect design can eliminate parallel speedup.
(D) Scalability Comparison
Cluster:
- Fixed infrastructure.
- High performance.
- Lower variability.
Cloud:
- Elastic scaling.
- Virtualization overhead.
- Slightly less predictable performance.
Conclusion
Clusters provide high-performance distributed computing with predictable performance, while cloud architectures offer elastic scalability and resource flexibility through virtualization. However, virtualization overhead and network variability can impact performance in cloud environments compared to dedicated HPC clusters.


