04 - Threads
Problem 1: Thread vs Process
Briefly explain how a thread differs from a process.
Student Solution
Answer by Justin:
A process is a currently executing program, with its own PCB, that defines the address space and general process resources. A thread exists within a process and is a sequential unit of execution within a process, sharing the PCB. The only thing it doesn’t share is its PC, general-purpose registers, stack, and SP.
Elaboration (by LLM):
The key distinction between processes and threads lies in resource isolation vs. resource sharing:
Process Characteristics:
- Isolated Address Space: Each process has its own complete memory layout (code, data, heap, stack)
- Independent Resources: Separate file descriptors, signal handlers, current working directory
- PCB (Process Control Block): OS maintains unique PCB for each process containing execution context
- Heavyweight: Creating a process is expensive (requires memory allocation, PCB setup)
- Context Switch Overhead: High overhead when OS switches between processes (must save/restore entire address space)
Thread Characteristics:
- Shared Address Space: All threads in a process share code, data, and heap
- Thread-Local Resources: Each thread has its own:
- PC (Program Counter) - where to execute next instruction
- SP (Stack Pointer) - points to thread’s stack
- Stack - local variables, function parameters, return addresses
- General-purpose registers - temporary values
- Shared PCB: Threads in same process share one PCB (hence shared memory management)
- Lightweight: Creating a thread is inexpensive (minimal OS allocation)
- Context Switch Overhead: Low overhead (only need to save/restore registers and stack pointer)
Memory Layout Comparison:
Process 1 has isolated memory, Process 2 has isolated memory. Single process with 2 threads shares code, data, heap but has separate stacks.
Why Threads Share Address Space:
Thread cooperation becomes simpler - threads can directly access/modify shared variables without IPC (pipes, sockets, shared memory mechanisms required for processes).
Why Each Thread Has Private Execution State:
Each thread needs independent execution flow - different PC, SP, registers enable each thread to execute different code at different points simultaneously (or appear to via time-slicing).
Problem 2: Shared vs Private Thread State
What state (information) does a thread share with other threads in the same process and what state (information) is private/specific to a thread? Be explicit in your answer.
Student Solution
Answer by Justin:
A thread shares with other threads in the same process the same resources (code, data, and heap), but what they don’t share (i.e., what is private to a thread) is the thread’s state (PC, SP, stack, general-purpose registers).
Elaboration (by LLM):
Shared Thread State (Visible to All Threads):
| Resource | Shared? | Details |
|---|---|---|
| Code Segment | Yes | All threads execute same program instructions |
| Global Variables | Yes | int g = 0; at file scope accessible to all |
| Static Variables | Yes | static int x; shared across threads |
| Heap Memory | Yes | malloc(), new allocations visible to all |
| File Descriptors | Yes | Open files/sockets shared |
| Process ID (PID) | Yes | All threads belong to same process |
Private Thread State (Unique to Each Thread):
| Resource | Private? | Details |
|---|---|---|
| Program Counter (PC) | Yes | Each thread at different instruction |
| Stack Pointer (SP) | Yes | Points to thread’s own stack |
| Stack | Yes | Local variables, function parameters, return addresses |
| Registers | Yes | EAX, EBX, ECX, … saved per thread |
| Thread ID (TID) | Yes | Unique identifier per thread |
Practical Example:
int global = 0; // SHARED - all threads see updates
void* thread_func(void* arg) {
int local = 5; // PRIVATE - each thread has own copy
global++; // SHARED - all threads see this change
}Critical Implication - Data Races:
Because global variables are shared, concurrent access can cause data races:
int x = 0; // SHARED
Thread1: x++; // Read x (0), increment, write (1)
Thread2: x++; // Can interleave, causing lost updates
// Without synchronization: x might be 1 instead of expected 2Solution requires synchronization (mutexes, semaphores) to protect shared state.
Problem 3: OS Resource Allocation for Threads
What resources does the OS allocate for a thread during thread creation? How do they differ from those allocated when a process is created?
Student Solution
Answer by Justin:
The OS allocates for a thread during thread creation its stack, PC, SP, and general-purpose registers in the same address space as the parent process. This differs from the resources allocated when a process is created, because when this occurs, an entirely new address space is allocated, new execution state (PC, registers, stack, and SP), and a new PCB (threads share the same PCB as the parent process).
Elaboration (by LLM):
Resources Allocated for Thread Creation:
Thread creation allocates:
- Stack (typically 1-8 MB, OS-dependent)
- Thread-Local Storage (TLS)
- PC, SP, Registers (initial values)
- TCB (Thread Control Block)
What’s NOT allocated for thread:
- New address space (reuses parent’s)
- New PCB (shares parent’s)
- New code segment
- New data/heap
Comparison: Thread Creation vs Process Creation
| Resource | Thread Creation | Process Creation |
|---|---|---|
| Address Space | Reused (shared) | NEW (isolated) |
| Stack | NEW (thread-local) | NEW (process-local) |
| Heap | Shared | NEW |
| Code Segment | Shared | NEW copy |
| PCB | SHARED | NEW |
| Memory Overhead | ~1-8 MB (stack) | ~Hundreds MB |
| Creation Time | Microseconds | Milliseconds |
Why Threads Are Lightweight:
- Shared Memory: No need to copy entire address space
- Minimal Allocation: Only stack + TCB needed
- Cheap Context Switch: Just save/restore registers + SP, not entire memory
Cost Comparison:
Create 100 processes: ~100 * 500 MB = 50 GB potential
Create 100 threads: ~100 * 2 MB = 200 MB
Problem 4: Advantages of Multi-Threading
What are the advantages of multi-threading? Give an example when multi-threading is preferrable to a single-threaded processes?
Student Solution
Answer by Justin:
An advantage of multi-threading is that you can break a problem into smaller parts for faster execution. An example of this would be matrix multiplication. You can have numerous threads execute different parts of the problem “at the same time” for faster execution, but in a single-threaded process, you would just have one thread to handle the entire multiplication. This becomes tedious when the matrix becomes REALLY large, and will cause the process to last a very long time.
Elaboration (by LLM):
Key Advantages of Multi-Threading:
1. Parallelism (Multi-Core Systems)
On systems with multiple CPU cores, threads execute truly in parallel:
Single-threaded (1 core):
Thread: [====Task====] (must complete sequentially)
Total time: T
Multi-threaded (4 cores):
Thread1: [==Task1==]
Thread2: [==Task2==]
Thread3: [==Task3==]
Thread4: [==Task4==]
Total time: ~T/4 (speedup)
2. Responsiveness
Long-running operations don’t block entire application. Web servers handle multiple clients without blocking.
3. Shared Resources
Threads efficiently share memory without IPC overhead. Simple direct access to shared variables.
4. Resource Efficiency
Threads cheaper than processes in terms of memory and CPU overhead.
Matrix Multiplication Example:
// Single-threaded: O(N³) operations sequential
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
// Time: 100% CPU for entire duration
// Multi-threaded: Divide work
Thread1 computes C rows 0-N/4
Thread2 computes C rows N/4-N/2
Thread3 computes C rows N/2-3N/4
Thread4 computes C rows 3N/4-N
// On 4-core system: ~4x speedup
// All threads access shared arrays A, B, CWhen Multi-Threading Wins:
- Computationally Intensive: Large matrix operations, image processing
- I/O Bound: One thread blocks on disk/network, others proceed
- Server Applications: Multiple clients needing simultaneous service
- Interactive Applications: UI thread stays responsive while background work happens
Performance Comparison (Hypothetical):
Matrix: 1000x1000 (1 billion operations)
Single-threaded (1 core): 10 seconds
Multi-threaded (4 cores): 2.5 seconds (4x speedup)
Multi-threaded (8 cores): 1.25 seconds (8x speedup)
Limitations:
- Speedup limited by number of available cores (Amdahl’s Law)
- Sequential parts cannot be parallelized
- Synchronization adds overhead
Problem 5: Shared and Private Thread Items
List 1 item that is shared by all threads, and 1 item that is private to each thread.
Student Solution
Answer by Justin:
One item that is shared by all threads is the address space, and one item that is private to each thread is its program counters (PC).
Elaboration (by LLM):
Shared: Address Space
The address space refers to the entire virtual memory layout of a process:
Virtual Address Space (Shared by all threads):
Code Segment (Shared)
Data Segment (Shared)
Heap (Shared, grows up)
Free Memory
Stack (Each thread has own stack here)
Kernel Space
All threads within same process can:
- Read/write to same global variables
- Access same heap allocations
- Call same functions
- Access same files/sockets
Private: Program Counter (PC)
The PC (Program Counter) is a CPU register that points to the next instruction to execute:
void func() {
int x = 5; // Line 1 - Thread1 PC might be here
x++; // Line 2 - Thread2 PC might be here simultaneously
printf("%d", x); // Line 3 - Thread3 PC might be here
}Each thread has:
- Its own PC value: Points to different instruction
- Independent execution: Threads progress independently
- Concurrent advancement: Each thread increments PC at own pace
Why PC Must Be Private:
- Independent Execution: Threads execute asynchronously
- Different Functions: Threads may call different functions
- Loop Progress: Threads in same loop at different iterations
Example Demonstrating Private PC:
void thread_func() {
printf("Start"); // Instruction A
sleep(1); // Instruction B
printf("End"); // Instruction C
}
// Two threads:
Thread1 PC: A -> B -> (sleeping) -> C
Thread2 PC: A -> B -> C -> (already done)
// At time T:
Thread1 PC: points to sleep() call
Thread2 PC: points to printf("End")Other Commonly Private Items:
Besides PC, each thread also has private:
- SP (Stack Pointer): Points to thread’s own stack
- Stack: Local variables, parameters, return addresses
- Registers: EAX, EBX, ECX, … (saved per thread)
- TID (Thread ID): Unique identifier
Problem 6: Purpose of pthread_detach()
What’s the purpose of pthread_detach() function?
Student Solution
Answer by Justin:
The purpose of the pthread_detach() function is to make it so that the thread won’t go into a zombie mode and wait until another thread terminates it with pthread_join, and that the thread can be terminated automatically with automatic resource cleanup upon termination.
Elaboration (by LLM):
Thread States and Lifecycle:
Default joinable thread must be explicitly cleaned up via pthread_join(). If not joined, thread enters zombie state and resources leak.
With pthread_detach(), thread automatically frees resources upon termination without requiring explicit cleanup.
Without detach() - Default Joinable State:
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
// Thread runs and exits
pthread_join(tid, NULL); // Must call to clean up
// If you DON'T call pthread_join():
// - Thread becomes ZOMBIE
// - Resources (TCB, stack) remain allocated
// - Memory leak if thread exitsWith detach() - Automatic Cleanup:
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
pthread_detach(tid); // Thread becomes detached
// Thread runs and exits
// Resources automatically freed - no need for join()Comparison: Joined vs Detached Threads
| Aspect | Joinable (Default) | Detached |
|---|---|---|
| Default State | Requires pthread_join() | Auto-cleanup on exit |
| Cleanup | Manual via pthread_join() | Automatic |
| Zombie State | Yes, if not joined | No |
| Resource Leak Risk | High if forgotten | None |
| Exit Status | Can retrieve via pthread_join() | Not retrievable |
| Use Case | Wait for result | Fire-and-forget tasks |
Problem Without Detach:
void* worker(void* arg) {
printf("Worker running\n");
return NULL; // Thread exits
// <- Becomes ZOMBIE here!
}
int main() {
pthread_t tid;
for (int i = 0; i < 100; i++) {
pthread_create(&tid, NULL, worker, NULL);
// Forgot to join!
}
// 100 zombie threads, resources still allocated
return 0;
}Solution Using Detach:
void* worker(void* arg) {
printf("Worker running\n");
return NULL; // Thread exits
// <- Automatically cleaned up!
}
int main() {
pthread_t tid;
for (int i = 0; i < 100; i++) {
pthread_create(&tid, NULL, worker, NULL);
pthread_detach(tid); // Auto cleanup when done
}
// Threads cleaned automatically
return 0;
}When to Use Each:
Use Joinable (default):
- Need to retrieve thread’s return value
- Synchronize with thread completion
- Parent waits for child
Use Detached:
- No need for return value
- Fire-and-forget tasks (logging, cleanup)
- Server handling concurrent clients
// Server handling clients - use detached
void* handle_client(void* sock) {
// Process client request
// Automatically cleanup on exit
}
int main() {
while (accept_connection()) {
pthread_t tid;
pthread_create(&tid, NULL, handle_client, sock);
pthread_detach(tid); // Each client handled independently
}
}Key Point:
Detached threads cannot be joined - only one method of termination/cleanup available. Choose wisely at creation time.
Problem 7: Multi-Threaded Process Termination
List 2 conditions under which a multi-threaded process terminates.
Student Solution
Answer by Justin:
Two conditions under which a multi-threaded process terminates are if all threads terminate or if one thread terminates using the exit() system call.
Elaboration (by LLM):
Condition 1: All Threads Terminate Naturally
When every thread in a process exits gracefully:
void* thread_func(void* arg) {
// Do work
return NULL; // Thread exits
}
int main() {
pthread_t t1, t2, t3;
pthread_create(&t1, NULL, thread_func, NULL);
pthread_create(&t2, NULL, thread_func, NULL);
pthread_create(&t3, NULL, thread_func, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
pthread_join(t3, NULL);
// All threads done, main thread returns
// Process terminates
}Once the main thread returns from main(), it’s the last thread, so the process ends.
Condition 2: Any Thread Calls exit()
If ANY thread (including main) calls exit(), the entire process terminates immediately:
void* thread_func(void* arg) {
if (some_error) {
exit(1); // ← ENTIRE PROCESS TERMINATES!
}
return NULL;
}
int main() {
pthread_t t1;
pthread_create(&t1, NULL, thread_func, NULL);
pthread_join(t1, NULL);
return 0; // Even if we reach here...
}Key Differences:
| Trigger | Result | Other Threads |
|---|---|---|
| All threads exit normally | Process terminates | All cleaned up gracefully |
| One thread calls exit() | Process terminates immediately | Other threads terminated abruptly, no cleanup |
| Main thread returns | Process terminates | Other threads may continue briefly before OS cleans up |
Important: In Linux, if main() returns but other threads are still running, those threads are killed when the process terminates.
Comparison with exit() vs return:
// In main:
return 0; // Exit main thread only (process exits if it's last)
exit(0); // Exit ENTIRE process immediately
// In worker thread:
return NULL; // Exit thread only (process continues)
exit(0); // Exit ENTIRE process (all threads die!)
pthread_exit(NULL); // Exit thread only (safer than return)Best Practice:
Use pthread_exit() in threads instead of calling exit() to avoid terminating the entire process:
void* thread_func(void* arg) {
if (error) {
pthread_exit(NULL); // Exit just this thread
}
return NULL; // Equivalent to pthread_exit(NULL)
}Problem 8: Concurrency vs Parallelism
Define concurrency and parallelism and briefly explain the difference between the two.
Student Solution
Answer by Justin:
Concurrency is when a system can handle the execution of multiple tasks by interleaving their execution. Parallelism is when a system can handle execution multiple tasks simultaneously. The difference is that concurrency doesn’t execute the tasks at the “same time”, but rather appears to execute them at the same time through rapid context switching. Parallelism is accomplished by using multi-core CPUs, where tasks are scheduled on different cores in order to execute at the same time.
Elaboration (by LLM):
Concurrency: Illusion of Simultaneity
Concurrency means multiple tasks make progress by rapidly switching back and forth:
Single CPU Core (Time flows left to right):
Task1: [==code==] yields
Task2: [==code==] yields
Task3: [==code==] yields
Task1: [==code==] yields
User perceives: All three running "simultaneously"
Reality: Only one at a time, switching very quickly
How Concurrency Works:
- Task1 runs for time quantum (e.g., 10ms)
- OS switches (context switch) to Task2
- Task2 runs for 10ms
- OS switches to Task3
- Repeat…
At 1000 switches per second, appears instantaneous to user.
Parallelism: True Simultaneity
Parallelism means tasks execute at the exact same moment on different cores:
Multi-Core CPU (4 cores):
Core 1: Task1 [==code==]
Core 2: Task2 [==code==]
Core 3: Task3 [==code==]
Core 4: Task4 [==code==]
Reality: All four executing simultaneously
Key Differences:
| Aspect | Concurrency | Parallelism |
|---|---|---|
| Execution | Time-sliced interleaving | Simultaneous on multiple cores |
| Hardware Required | Single core (or any number) | Multiple cores |
| Appearance | Tasks seem to run together | Tasks actually run together |
| Context Switches | Many | Few or none |
| Example | Web server handling 100 clients on 1 CPU | 100 clients on 100 cores |
Visual Comparison:
Concurrent (1 core):
Time: 0ms 10ms 20ms 30ms
| | | |
T1: [====] [====] [====]
T2: [====] [====] [====]
T3: [====] [====] [====]
Parallel (3 cores):
Time: 0ms 30ms
| |
Core1: T1 [==========================]
Core2: T2 [==========================]
Core3: T3 [==========================]
Real-World Analogy:
Concurrency: One cashier at a store switching between customers (T1, T2, T3). Serves a bit of each, appears to work on all simultaneously.
Parallelism: Three cashiers, one per customer. All working at exact same time.
Can You Have Both?
Yes! Multi-core system with many threads:
Core 1: T1 [===T2====T3===]
Core 2: T4 [===T5====T6===]
Core 3: T7 [===T8====T9===]
Core 4: T10[====T11==T12==]
Threads on same core: Concurrent (interleaved)
Threads on different cores: Parallel (simultaneous)
Implications:
- Concurrency benefits: Responsiveness, server scalability, better resource utilization
- Parallelism benefits: True speedup for CPU-intensive tasks (actual execution time reduction)
- Concurrency alone: No speedup for pure CPU work, just better responsiveness
- Parallelism requires concurrency: Multi-threaded code needed to utilize multiple cores
Problem 9: Concurrency and Hardware
If you have a concurrent system, i.e., multi-tasking or multi-threading OS, does the underlying hardware have to have multiple CPU cores? Why or why not?
Student Solution
Answer by Justin:
The underlying hardware does not need to have multiple CPU cores. This is because tasks are able to be interleaved with one another, so a multi-core CPU is not required as those are for parallelism.
Elaboration (by LLM):
Concurrency ≠ Multi-Core
Concurrency is a software/OS concept, not a hardware requirement. The OS can create the illusion of simultaneity through rapid context switching on a single core.
Single-Core Concurrent System:
Single CPU Core, Multiple Threads:
Time: 0-10ms 10-20ms 20-30ms 30-40ms
┌───────┐┌───────┐┌───────┐┌───────┐
Thread 1 │ RUN ││READY ││ RUN ││READY │
Thread 2 │ READY ││ RUN ││READY ││ RUN │
Thread 3 │ READY ││READY ││ RUN ││ RUN │
└───────┘└───────┘└───────┘└───────┘
OS scheduler rapidly switches threads, creating concurrency.
Examples of Single-Core Concurrency:
-
Web Server on 1 CPU:
- Accept connection from client A (10ms)
- Block waiting for client A’s data
- Switch to client B’s thread (10ms)
- Switch to client C’s thread (10ms)
- Switch back to client A (data arrived)
- Result: Handle 100+ clients with 1 core
-
Interactive Applications:
- Main UI thread processes input (user sees immediate response)
- Background thread does heavy computation
- OS switches between them
How Single-Core Concurrency Works:
void thread1_func() {
printf("Thread 1 start\n");
sleep(1); // ← OS switches to another thread here
printf("Thread 1 end\n");
}
void thread2_func() {
printf("Thread 2 executing\n");
}
// On single-core system:
// Thread1 prints "start", blocks on sleep
// OS says: "OK, switch to thread2"
// Thread2 runs, thread1 still sleeping
// Sleep timer expires, thread1 ready again
// OS switches back, thread1 prints "end"Blocking Operations Enable Concurrency:
I/O operations naturally create concurrency opportunities:
Thread1: [do work] --BLOCK(disk read)-- [continue]
Thread2: [do work] --BLOCK(network)-- [continue]
Thread3: [do work] [continue]
While T1 blocked on disk, T2 and T3 can run!
While T2 blocked on network, T1 and T3 can run!
CPU-Bound vs I/O-Bound:
I/O-Bound (Concurrency beneficial on 1 core):
┌─────┐ ┌─────┐ ┌─────┐
│WORK ├─►│BLOCK├─►│WORK ├─ Low CPU usage, threads can interleave
└─────┘ └─────┘ └─────┘
CPU-Bound (No benefit from single-core concurrency):
┌────────────────────────────────┐
│WORK (no blocking) for 1 second ├─ Single thread will use 100% CPU
└────────────────────────────────┘
Single-Core Concurrent Scheduling:
OS kernel implements scheduler:
struct thread {
int thread_id;
int* stack;
int pc; // Program counter
int registers[8]; // CPU state
int state; // RUNNING, READY, BLOCKED
};
void context_switch(thread* from, thread* to) {
// Save 'from' thread's context
save_registers(from->registers);
from->pc = current_pc();
// Load 'to' thread's context
load_registers(to->registers);
jump_to(to->pc);
}Why Single-Core Concurrency Works So Well:
- I/O Wait Time: Threads block waiting for I/O, freeing CPU for other threads
- Responsiveness: Even small tasks get CPU time regularly
- Fair Scheduling: Each thread gets turns at the CPU
- Resource Utilization: CPU not idle while threads wait for I/O
Conclusion:
Multiple CPU cores are not required for concurrency. They’re required for parallelism (true simultaneous execution). Single-core systems can be fully concurrent through OS-level context switching.
Problem 10: Parallelism and Hardware
To achieve parallelism, does the underlying hardware have to have multiple CPU cores? Why or why not?
Student Solution
Answer by Justin:
The underlying hardware must have multiple CPU cores. This is because if there aren’t multiple CPU cores, the tasks will end up being interleaved, which is concurrency. For tasks to execute simultaneously, we need multiple cores so that we can execute each task on their own core. This is not possible on a single core CPU.
Elaboration (by LLM):
Parallelism Requires Multiple Cores
Parallelism is the execution of multiple tasks at the exact same moment in time. This is only possible with multiple CPU cores.
Why Single Core Can’t Achieve Parallelism:
Single Core Timeline:
┌──────┬──────┬──────┬──────┬──────┐
│Task1 │Task2 │Task3 │Task1 │Task2 │ ← Only one active at any instant
└──────┴──────┴──────┴──────┴──────┘
Time
At any point in time, only ONE task executing
Multi-Core System (True Parallelism):
Core 1: ┌─────────────────────────────────┐
│ Task1 executing │
└─────────────────────────────────┘
Core 2: ┌─────────────────────────────────┐
│ Task2 executing │
└─────────────────────────────────┘
Core 3: ┌─────────────────────────────────┐
│ Task3 executing │
└─────────────────────────────────┘
└─ At time T, all three are ACTUALLY running
Proof: Single Core Can’t Parallelize CPU-Bound Work
// Matrix multiplication: Pure CPU work, no I/O blocking
Single-threaded:
Time: 0 ............................ 10 seconds
[====== Matrix Math =======]
Multi-threaded on 1 core (concurrent but not parallel):
Time: 0 ............................ 10 seconds
[T1][T2][T3][T1][T2][T3]... (interleaved, same total time!)
Result: Still 10 seconds (no speedup)
Multi-threaded on 4 cores (parallel):
Time: 0 ...... 2.5 seconds
[T1][T2][T3][T4]
Result: ~2.5 seconds (4x speedup!)Sequential vs Parallel:
Sequential Code (1 CPU):
Task1: [==========]
Task2: [==========]
Total: 20 seconds
Concurrent Code (1 CPU):
Task1: [===][===][===]
Task2: [===][===][===]
Total: 20 seconds (same! just interleaved)
Parallel Code (2 CPUs):
Task1: [==========] }
Task2: [==========] } Both at same time
Total: 10 seconds
Amdahl’s Law: The Reality Check
Speedup formula with N cores:
Speedup = 1 / (S + P/N)
Where:
S = Sequential portion (cannot be parallelized)
P = Parallel portion
N = Number of cores
Example: 10% sequential, 90% parallel work
1 core: Speedup = 1 / (0.1 + 0.9/1) = 1.0x (baseline)
2 cores: Speedup = 1 / (0.1 + 0.9/2) = 1.82x
4 cores: Speedup = 1 / (0.1 + 0.9/4) = 3.08x
∞ cores: Speedup = 1 / 0.1 = 10x (limited by sequential part)
Hardware Context:
Single Core CPU:
┌────────────────────┐
│ ALU │ ← One arithmetic/logic unit
│ Registers │ ← One set per core
│ Cache L1/L2 │
└────────────────────┘
Multi-Core CPU:
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Core1: ALU │ Core2: ALU │ Core3: ALU │ Core4: ALU │
│ Regs Regs │ Regs Regs │ Regs Regs │ Regs Regs │
│ L1/L2 Cache │ L1/L2 Cache │ L1/L2 Cache │ L1/L2 Cache │
└──────────────┴──────────────┴──────────────┴──────────────┘
↓ ↓ ↓ ↓
Task1 Task2 Task3 Task4
(Parallel execution)
Modern CPUs:
- 2-core: Limited parallelism
- 4-core: Good for light parallelism
- 8-core: Excellent for workstation tasks
- 16-core: Data center / server grade
- 128-core: Supercomputing
Examples Requiring Multiple Cores:
-
Video Encoding:
- Frame processing can’t progress until other frames done
- 4 cores = ~4x speedup
-
Scientific Simulation:
- Grid computation (weather, physics)
- 8 cores = ~8x speedup
-
Machine Learning:
- Matrix operations trivially parallel
- 64 cores / GPUs = 64x+ speedup
Key Takeaway:
For pure CPU-bound work with no blocking:
- Concurrency on 1 core = No speedup, just interleaving
- Parallelism on N cores = ~N-fold speedup (minus overhead)
Problem 11: Multi-Threading in Single-Core Machine
Give one reason why you will want to implement a multi-threaded application in a single-core machine.
Student Solution
Answer by Justin:
You may want to implement a multi-threaded application on a single-core machine to break up parts of a problem into smaller parts for faster execution.
Elaboration (by LLM):
Concurrency Benefits on Single-Core:
While single-core threading can’t speed up pure CPU work, it provides critical benefits through concurrency:
1. Responsiveness (Primary Benefit)
// Bad: Single-threaded, blocking I/O
int main() {
printf("Downloading file...");
download_file(); // Blocks for 5 seconds
printf("Done!\n"); // User waits 5 seconds
// UI frozen during download
}
// Good: Multi-threaded, non-blocking
struct {
int download_progress;
} state;
void* download_thread(void* arg) {
download_file(); // Blocks for 5 seconds
state.download_progress = 100;
return NULL;
}
int main() {
pthread_t tid;
pthread_create(&tid, NULL, download_thread, NULL);
// Main thread continues immediately
while (state.download_progress < 100) {
printf(".\r"); // Show progress
usleep(100000);
}
// User sees responsive UI despite 5-second operation
}2. I/O-Bound Operations (Waiting Time Reuse)
Sequential I/O (single-threaded):
┌─────────┐ ┌─────────┐ ┌─────────┐
│Read 1 │ │Read 2 │ │Read 3 │
│(1 sec) │ │(1 sec) │ │(1 sec) │
└─────────┘ └─────────┘ └─────────┘
Total: 3 seconds
Concurrent I/O (multi-threaded on 1 core):
Thread1: [Read1..... ]
(blocked waiting for disk)
Thread2: [Read2..... ]
(blocked)
Thread3: [Read3..... ]
Total: ~1.1 seconds (much faster!)
Why? While Thread1 blocked on disk I/O, Thread2 and Thread3 can start their I/O
3. Structured Parallelism
Even without execution speed improvement, multi-threading models the problem better:
// Web Server on 1 core
void* handle_client(void* socket) {
// Each client gets own thread context
// Easy to understand: "this thread handles this client"
read_request(); // Blocks, thread sleeps
process_request();
send_response(); // Blocks, thread sleeps
// While blocked, OS runs other client threads
}
// Single-threaded equivalent would need manual state machine
// (very hard to understand)4. Practical Single-Core Multi-Threaded Uses
Example: Background Tasks
// Main thread: Responsive UI
while (true) {
display_ui();
handle_user_input();
}
// Background thread: Long operation (doesn't freeze UI)
static void* background_work(void* arg) {
expensive_computation();
return NULL;
}
// Start background task
pthread_t bg;
pthread_create(&bg, NULL, background_work, NULL);
// Main thread continues responsiveExample: Server Handling Clients
// Single-threaded (sequential - very slow):
while (true) {
socket s = accept_connection(); // Wait for client A
process_request(s); // Process for 1 second
send_response(s); // Send for 0.5 seconds
close(s);
// Next client must wait 1.5 seconds
}
// Multi-threaded (concurrent - responsive):
while (true) {
socket s = accept_connection();
pthread_create(new thread to handle s);
// Accept next client immediately!
}
// Each client's thread blocks on I/O
// While one blocked, others can runThe Critical Insight:
I/O Operations Often Involve Waiting:
• Network read: Thread blocks for 100ms
• Disk read: Thread blocks for 10ms
• User input: Thread blocks indefinitely
While one thread waits, another can make progress!
Speedup Comparison:
Task: 3 network requests, each 1 second
Sequential: [Network1][Network2][Network3] = 3 seconds
Concurrent(1 core):
[Network1 ---- blocked ---]
[Network2 ---- blocked ---]
[Network3 ---- blocked ---]
= ~1 second
Bottom Line:
Multi-threading on single-core is valuable for:
- Responsiveness: Long operations don’t freeze the UI
- I/O Efficiency: Many threads can wait for I/O simultaneously
- Code Clarity: Model problem structure naturally
- Resource Utilization: CPU not idle while threads wait
It’s NOT for raw execution speed on CPU-intensive work.
Problem 12: Many-to-One Thread Mapping
Consider an OS where the threads are implemented by a user-level thread library and all threads of a process are mapped to just one kernel thread (many-to-one mapping). Consider a process that has two threads: T1 and T2, and T1 is running.
True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.
Student Solution
Answer by Justin:
False.
When T1 blocks while executing an I/O system call, the OS cannot switch to T2 and start running it! This is because in a many-to-one mapping, if a thread is blocked while executing an I/O system call, all other threads are also blocked.
Elaboration (by LLM):
Many-to-One Thread Mapping Architecture:
User-Level Threads: Kernel Thread:
T1 ┐
T2 ├─────────────→ Kernel Thread 1
T3 │
T4 ┘
All user-level threads mapped to single kernel thread
The Problem: Blocking at Kernel Level
// T1 makes system call
T1: read(fd, buffer, 1024); // ← System call blocks
At kernel level:
┌─────────────────────────────────┐
│ Kernel Thread 1 BLOCKED │ ← Waiting for I/O
│ (waiting for disk/network) │
└─────────────────────────────────┘
Meanwhile in user space:
T1: BLOCKED (part of kernel thread)
T2: RUNNABLE (but stuck - kernel thread blocked!)
T3: RUNNABLE (but stuck - kernel thread blocked!)
T4: RUNNABLE (but stuck - kernel thread blocked!)Why Can’t T2 Run?
The scheduler only sees one kernel thread:
OS Scheduler:
┌─────────────────────────────────────┐
│ Process A: │
│ Kernel Thread 1: BLOCKED (I/O) │
│ │
│ Process B: │
│ Kernel Thread 1: RUNNABLE │
└─────────────────────────────────────┘
Scheduler decision: Run Process B's thread
Result: T2 still doesn't run! All user-level threads stuck
Timeline Example:
Time 0-100ms:
T1 calls read() system call
├─ Kernel thread transitions to BLOCKED
├─ T2, T3, T4 all BLOCKED (indirectly)
└─ Disk I/O in progress...
Time 100-150ms:
OS schedules other processes
T2, T3, T4 still cannot run
Even though they're READY
Time 150ms:
Disk I/O completes
Kernel thread becomes READY
Scheduler runs kernel thread again
├─ T1 resumes from read()
└─ T2, T3, T4 can only run after T1 yields
The Critical Issue: Blocking at Wrong Level
User Library (Green Threads):
┌─────────────────────────────────┐
│ T1, T2, T3, T4 │ ← User-level scheduler
│ (User-level thread scheduler) │ ← Can manage these
└─────────────────────────────────┘
Kernel Level:
┌─────────────────────────────────┐
│ Only sees: 1 kernel thread │ ← OS sees ONE thread
│ (user library's scheduler info) │ ← OS unaware of 4 user threads
│ │
│ When kernel thread BLOCKS, │
│ user library loses control! │
└─────────────────────────────────┘
Visual Comparison:
Wanted (what user library tried to do):
T1: [=====I/O BLOCK=====]
T2: [====run====] ← Switch while T1 blocked
T3: [===]←
What Actually Happens:
Kernel_Thread: [=====BLOCKED=====][T1 continues][T2][T3]
↑ T2, T3 stuck!
Why Many-to-One Fails:
- Blocking is at kernel level - OS scheduler doesn’t know about user threads
- No visibility - Kernel sees only one thread, doesn’t know others exist
- Lost control - User-level scheduler can’t run when kernel thread blocked
- Cascade failure - One I/O call blocks entire process
Code Example:
// Many-to-one mapping:
// User library: Green threads (T1, T2, T3)
// Kernel: Single kernel thread
void* thread1(void* arg) {
printf("T1 start\n");
read(fd, buffer, 1024); // ← Blocks at kernel level!
// All other threads blocked here!
printf("T1 end\n");
}
void* thread2(void* arg) {
printf("T2 running\n"); // ← Never reaches here while T1 blocked
}
int main() {
start_thread(thread1);
start_thread(thread2);
// When T1 calls read():
// ├─ Kernel thread blocked
// ├─ T2 cannot run (stuck with T1)
// └─ Entire process appears frozen
}Historical Context:
Many-to-one mapping (used in early Solaris green threads, Linux pthreads v1) suffered from this problem. Any I/O in one thread froze the entire process.
Solution: One-to-One or Many-to-Many Mapping (see Problem 13)
Problem 13: One-to-One Thread Mapping
Consider an OS where each user-level thread has a corresponding kernel-level thread (one-to-one mapping) as in Linux. Consider a process that has two threads, T1 and T2 and T1 is running.
True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.
Student Solution
Answer by Justin:
True.
When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. This is because in a one-to-one mapping, each user-level thread is mapped to a kernel-level thread. This means that if one thread gets blocked, all other threads will not be affected, allowing the OS to switch to them and start running!
Elaboration (by LLM):
One-to-One Thread Mapping Architecture:
User-Level Threads: Kernel Threads:
T1 ──────────────────→ Kernel Thread 1
T2 ──────────────────→ Kernel Thread 2
T3 ──────────────────→ Kernel Thread 3
T4 ──────────────────→ Kernel Thread 4
Each user thread has dedicated kernel thread
The Solution: Independent Kernel Threads
// T1 makes system call
T1: read(fd, buffer, 1024); // ← System call blocks
At kernel level:
Kernel Thread 1: BLOCKED (waiting for I/O)
Kernel Thread 2: RUNNABLE (independent!)
Kernel Thread 3: RUNNABLE (independent!)
Kernel Thread 4: RUNNABLE (independent!)
OS Scheduler sees 4 independent kernel threads
Decision: Run Kernel Thread 2 (for T2)
Result: T2 RUNS while T1 blocked!Why T2 CAN Run:
OS scheduler has visibility into all kernel threads:
OS Scheduler:
┌─────────────────────────────────────┐
│ Process A: │
│ Kernel Thread 1: BLOCKED (I/O) │
│ Kernel Thread 2: RUNNABLE ← Can run this!
│ Kernel Thread 3: RUNNABLE ← Or this!
│ Kernel Thread 4: RUNNABLE ← Or this!
│ │
│ Process B: │
│ Kernel Thread 1: RUNNABLE │
└─────────────────────────────────────┘
Scheduler: "K-Thread 1 blocked, run K-Thread 2"
Result: T2 executes immediately
Timeline Example:
Time 0-50ms:
T1 calls read() system call
├─ Kernel Thread 1 transitions to BLOCKED
└─ Other kernel threads remain RUNNABLE
Time 50-51ms: ← CRITICAL DIFFERENCE
OS scheduler notices K-Thread 1 blocked
Scheduler selects K-Thread 2 (for T2)
├─ K-Thread 2 scheduled on CPU
├─ T2 RUNS immediately
└─ T1 still waiting for I/O
Time 100ms:
Disk I/O completes for T1
├─ K-Thread 1 becomes READY
├─ T1 can resume
└─ Both T1 and T2 can now run (on different cores or time-sliced)
Visual Comparison: One-to-One vs Many-to-One
MANY-TO-ONE (Problem 12):
Kernel_Thread: [====BLOCKED====][T1 continues][T2][T3]
↑ T2, T3 stuck!
ONE-TO-ONE (Problem 13):
K_Thread 1: [====BLOCKED====] [K_Thread 1 resumes]
K_Thread 2: [====T2 runs====]
K_Thread 3: [====T3 runs====]
K_Thread 4: [====T4 runs====]
↑ All threads progressing concurrently!
Why One-to-One Works:
- OS Visibility: Kernel sees each thread independently
- Independent Blocking: One thread’s I/O doesn’t affect others
- Scheduler Control: OS can schedule any ready kernel thread
- True Concurrency: Multiple threads actually progressing
Code Example:
// One-to-one mapping (Linux pthreads, modern systems):
// Each pthread_t has dedicated kernel thread
void* thread1(void* arg) {
printf("T1 start\n");
read(fd, buffer, 1024); // ← K-Thread 1 blocks
// K-Thread 2, 3, 4 can still run!
printf("T1 end\n");
}
void* thread2(void* arg) {
printf("T2 running\n"); // ← Executes while T1 blocked!
}
int main() {
pthread_t t1, t2;
pthread_create(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
// When T1 calls read():
// ├─ K-Thread 1 blocked
// ├─ Scheduler switches to K-Thread 2
// ├─ T2 runs immediately
// └─ Both threads progressing concurrently
pthread_join(t1, NULL);
pthread_join(t2, NULL);
}Real-World Impact:
Web Server Example: Handling 100 clients
Many-to-One (Problem 12):
One client's network I/O blocks entire server
Result: Server can't handle other clients
Throughput: 1 client effectively at a time
One-to-One (Problem 13):
Client A's I/O blocked → Server handles Client B
Client B's I/O blocked → Server handles Client C
Result: Server handles many clients concurrently
Throughput: ~100 clients simultaneously
Linux Implementation:
Linux uses one-to-one mapping:
- Each
pthread_tis a lightweight process (LWP) - Each LWP has dedicated kernel thread
- Kernel scheduler sees and manages all LWPs independently
- Result: True POSIX thread semantics
Trade-offs:
One-to-One Advantages:
✅ True concurrency (one thread's I/O doesn't block others)
✅ Better utilization (CPU not idle while threads wait)
✅ Simpler implementation (kernel does scheduling)
✅ Standard behavior (POSIX pthreads)
One-to-One Disadvantages:
❌ More memory (each thread needs kernel resources)
❌ More context switches (more kernel threads to manage)
❌ Limited threads (can create fewer threads than many-to-one)
Conclusion:
One-to-one mapping solves the blocking problem by giving each user thread its own kernel thread. The OS scheduler can now make independent decisions about which threads to run, enabling true concurrency even on single-core systems.
Problem 14: Fork with Multi-Threaded Process
Consider a process with 4 threads in Linux. When a thread invokes fork() system call to create a new process, how many threads will the new process have? Briefly explain.
Student Solution
Answer by Justin:
When a thread invokes the fork() system call to create a new process, the new process will have one thread. This is because this newly forked process will not clone the threads of the previous process.
Elaboration (by LLM):
Why Fork Only Copies One Thread:
When fork() is called, the OS creates a completely new process with a copy of the address space, but it does NOT copy all the threads. Only the thread that called fork() is replicated in the child process.
Process Before Fork:
Parent Process (4 threads):
┌─────────────────────────────────────┐
│ Thread 1 │
│ Thread 2 (calls fork()) ← HERE │
│ Thread 3 │
│ Thread 4 │
└─────────────────────────────────────┘
After Fork:
Parent Process (still 4 threads): Child Process (1 thread):
┌─────────────────────────────────────┐ ┌──────────────────────┐
│ Thread 1 │ │ Thread 2 (copy) │
│ Thread 2 (original) │ │ (isolated from T1,3,4)
│ Thread 3 │ │ │
│ Thread 4 │ │ │
└─────────────────────────────────────┘ └──────────────────────┘
Code Example:
void* thread_func(void* arg) {
printf("This is thread %d\n", (int)arg);
sleep(2); // Long operation
return NULL;
}
int main() {
pthread_t t1, t2, t3;
// Create 3 additional threads
pthread_create(&t1, NULL, thread_func, (void*)1);
pthread_create(&t2, NULL, thread_func, (void*)2);
pthread_create(&t3, NULL, thread_func, (void*)3);
sleep(1);
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("Child process has how many threads? Only 1!\n");
printf("Threads 1, 2, 3 don't exist in child\n");
exit(0);
} else {
// Parent process
printf("Parent still has 4 threads\n");
pthread_join(t1, NULL);
pthread_join(t2, NULL);
pthread_join(t3, NULL);
wait(NULL); // Wait for child
}
}Why This Design?
- Thread State Complexity: Other threads might hold mutexes, be in critical sections, or have inconsistent state
- Resource Clarity: Child process gets clean, isolated resources
- Deadlock Prevention: Avoids situations where child inherits held locks from parent
Problem Scenario:
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void* thread_with_lock(void* arg) {
pthread_mutex_lock(&lock);
// Do some work
sleep(10);
pthread_mutex_unlock(&lock);
}
int main() {
pthread_t tid;
pthread_create(&tid, NULL, thread_with_lock, NULL);
sleep(1); // Let thread acquire lock
fork();
// In child: lock is HELD but thread doesn't exist!
// Any attempt to lock() = DEADLOCK
}Memory Layout:
Parent Memory (before fork): Child Memory (after fork):
┌──────────────────────────┐ ┌──────────────────────────┐
│ Code Segment (shared) │ ──────► │ Code Segment (copy) │
│ Global Data │ ──────► │ Global Data (copy) │
│ Heap │ ──────► │ Heap (copy) │
│ Stack (4 stacks): │ │ Stack (1 stack): │
│ T1 stack │ │ T2 stack (copy) │
│ T2 stack (main) │ ──────► │ │
│ T3 stack │ │ (T1, T3, T4 stacks not │
│ T4 stack │ │ copied) │
└──────────────────────────┘ └──────────────────────────┘
Key Point:
fork() creates a snapshot of the calling thread’s state. Other threads are NOT copied. If you need multi-threaded child processes, use posix_spawn() with POSIX_SPAWN_SETPGROUP or create new threads in the child.
Problem 15: Exec with Multi-Threaded Process
Consider a process with 4 threads in Linux. When a thread invokes exec() system call to load a new executable on top of the current process, how many threads will the new process have? Briefly explain.
Student Solution
Answer by Justin:
When a thread invokes the exec() system call to load a new executable on top of the current process, this new process will have just one thread that executes what was passed in the system call, and the previous process and all other threads will be terminated.
Elaboration (by LLM):
Exec Replaces Entire Process Image
When any thread calls exec(), the entire process image is replaced - all threads are terminated, and a single new thread starts executing the new program.
Process Before Exec:
Process (4 threads):
┌─────────────────────────────────────┐
│ Thread 1 │
│ Thread 2 (calls exec()) ← HERE │
│ Thread 3 │
│ Thread 4 │
└─────────────────────────────────────┘
Address space: Old program binary
After Exec:
Process (1 thread):
┌─────────────────────────────────────┐
│ Main Thread (NEW PROGRAM) │
│ (Threads 1,2,3,4 ALL TERMINATED) │
│ │
│ │
└─────────────────────────────────────┘
Address space: NEW program binary
Old threads: DESTROYED
Code Example:
void* thread_func(void* arg) {
int thread_num = (int)arg;
printf("Thread %d started\n", thread_num);
if (thread_num == 2) {
sleep(1);
printf("Thread 2 calling exec...\n");
execl("/bin/ls", "ls", "-l", NULL);
// Code here NEVER executes (exec replaces process)
printf("This line never prints\n");
} else {
sleep(10); // Simulate work
printf("Thread %d still running\n", thread_num);
}
return NULL;
}
int main() {
pthread_t t1, t2, t3;
pthread_create(&t1, NULL, thread_func, (void*)1);
pthread_create(&t2, NULL, thread_func, (void*)2);
pthread_create(&t3, NULL, thread_func, (void*)3);
pthread_join(t1, NULL); // Will never return (process replaced!)
pthread_join(t2, NULL);
pthread_join(t3, NULL);
}Output:
Thread 1 started
Thread 2 started
Thread 3 started
Thread 2 calling exec...
[Directory listing from ls command]
(Process terminates)
Note: Threads 1 and 3 never print their completion messages
Key Differences: Fork vs Exec vs Thread Creation
| Operation | Parent Process | New Process/Thread | Other Threads |
|---|---|---|---|
| fork() | Continues running | Created with 1 thread | Stay in parent |
| exec() | REPLACED (terminates) | Created with 1 thread | ALL TERMINATED |
| pthread_create() | Continues running | New thread in SAME process | Continue in same process |
Why Exec Kills All Threads:
- Program Replacement: Entire memory layout changes
- Resource Cleanup: All thread-local storage, stacks invalidated
- State Corruption: Old thread stacks no longer accessible
- No Recovery: Can’t resume old threads with new binary
Visualization:
Memory Before Exec:
┌─────────────────────────┐
│ Old Program Code │
│ Old Global Variables │
│ Heap │
│ Stacks (4 threads): │
│ T1 stack ┐ │
│ T2 stack ├─ Being used │
│ T3 stack │ │
│ T4 stack ┘ │
└─────────────────────────┘
After exec("/new/program", ...):
┌─────────────────────────┐
│ NEW Program Code │ ← Completely different!
│ NEW Global Variables │
│ NEW Heap (empty) │
│ Main Thread Stack │ ← Only one stack
│ │
│ │
│ │
└─────────────────────────┘
Old stacks: DESTROYED
Resource Management:
When exec() is called:
- File descriptors remain OPEN (inherit to new process)
- Mutex locks are ABANDONED (can cause deadlocks in parent if they wait)
- Thread-local storage is DESTROYED
- Memory mappings are REPLACED
Example - Resource Leak:
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
int main() {
pthread_mutex_lock(&lock);
execl("/bin/echo", "echo", "hello", NULL);
// Lock is never unlocked!
// If parent process still running and tries to lock: DEADLOCK
pthread_mutex_unlock(&lock); // Never reached
}Safe Pattern:
If you need to exec() from a multi-threaded process:
int main() {
pid_t pid = fork();
if (pid == 0) {
// Child process - safe to exec
// Other threads only existed in parent
execl("/new/program", "args", NULL);
} else {
// Parent continues with all threads
wait(NULL); // Wait for child
}
}Conclusion:
exec() is a process replacement, not just a program loader. All threads (except the calling thread, which gets replaced) are terminated instantly. This is why exec is usually preceded by fork() - fork creates a new process, then exec replaces only that new process’s image.
Problem 16: Thread Execution Output
What’s the output of the following code when run in a Unix system? Assume you have 2 functions:
(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread.
int g = 0;
int tids[2];
void T(int arg) {
if (arg == 3) {
g++;
printf("[MainThread] g: %d\n", g);
}
if (arg == 1) {
g++;
printf("[Thread1] g: %d\n", g);
}
if (arg == 0) {
JoinThread(tids[1]);
g++;
printf("[Thread0] g: %d\n", str, g);
}
}
main() {
tids[1] = CreateThread(T, 1);
tids[0] = CreateThread(T, 0);
JoinThread(tids[0]);
T(3);
}Student Solution
Answer by Justin:
[Thread1] g: 1
[Thread0] g: 2
[MainThread] g: 3
Elaboration (by LLM):
Execution Trace:
This problem demonstrates thread synchronization and shared state. Let’s trace the execution step-by-step:
Timeline:
Time 0: Main executes
tids[1] = CreateThread(T, 1); // Creates Thread1
tids[0] = CreateThread(T, 0); // Creates Thread0
JoinThread(tids[0]); // Main blocks waiting for Thread0
Time 1: Thread1 starts (runs T(1))
Prints: "[Thread1] g: 1"
g becomes 1
Thread1 terminates
Time 2: Thread0 starts (runs T(0))
Calls: JoinThread(tids[1]); // Waits for Thread1 (already done!)
Returns immediately
g++; (g becomes 2)
Prints: "[Thread0] g: 2"
Thread0 terminates
Time 3: Main's JoinThread(tids[0]) returns
T(3) is called
g++; (g becomes 3)
Prints: "[MainThread] g: 3"
Key Synchronization Points:
Main Thread Thread0 Thread1
| | |
CreateThread(T, 1) ─────────────────────────────────► T(1)
| |
CreateThread(T, 0) ───────────────► T(0) |
| | g++
JoinThread(tids[0]) {BLOCKS} | printf g:1
| JoinThread(1) |
| (waiting) | (waits for T1) |
| | (done)|
| (returns) |
| g++ |
| printf g:2 |
| | (done) |
|◄────────── (returns) ◄─ |
|
T(3)
g++
printf g:3
Shared State Evolution:
Initial: g = 0
After T(1): g = 1 [Thread1] g: 1
After T(0): g = 2 [Thread0] g: 2
After T(3): g = 3 [MainThread] g: 3
Why This Order?
- Thread1 runs first (likely due to scheduling, though not guaranteed in general)
- Thread0 waits for Thread1 via
JoinThread(tids[1]) - Main waits for Thread0 via
JoinThread(tids[0]) - Main runs T(3) last
What Could Go Wrong:
The actual execution order of threads is NOT deterministic. However, the output shown assumes:
- Thread1 executes before Thread0 is scheduled
- No interleaving within individual function calls
Alternative (But Less Likely) Orderings:
If Thread0 started before Thread1:
[Thread0] g: 1 (if Thread0 increments first)
[Thread1] g: 2
[MainThread] g: 3
But the solution assumes Thread1 gets CPU time first.
Important Detail:
The JoinThread(tids[1]) in Thread0 is crucial. It forces Thread0 to wait for Thread1 to complete before continuing. This ensures:
- Thread1 increments g first
- Thread0 increments g second
- Main increments g third
Memory Visibility:
All threads share the same g variable:
int g = 0; // Global - shared by all threadsChanges by one thread are immediately visible to all others (memory is shared in multi-threaded process).
Mutex-Free Synchronization:
This example uses JoinThread() for synchronization rather than mutexes. The join() forces a specific execution order rather than protecting data.
Problem 17: Thread Execution Output (Variant)
What’s the output of the following code when run in a Unix system? Assume you have 2 functions:
(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread.
int g = 0;
int tids[2];
void T(int arg) {
if (arg == 3) {
g++;
printf("[MainThread] g: %d\n", g);
}
if (arg == 1) {
g++;
printf("[Thread1] g: %d\n", g);
}
if (arg == 0) {
JoinThread(tids[1]);
g++;
printf("[Thread0] g: %d\n", str, g);
}
}
main() {
tids[1] = CreateThread(T, 1);
tids[0] = CreateThread(T, 0);
JoinThread(tids[0]);
T(3);
}LLM Solution
[Thread1] g: 1
[Thread0] g: 2
[MainThread] g: 3
Elaboration (by LLM):
Same Code, Same Output
This problem is identical to Problem 16 (same code, same expected output). The output depends on thread scheduling and join synchronization.
Why This Output?
The output is determined by the JoinThread() calls which enforce execution order:
main: CreateThread(T, 1) → starts Thread1
main: CreateThread(T, 0) → starts Thread0
main: JoinThread(tids[0]) → blocks waiting for Thread0
Thread1: runs T(1)
g++ → g = 1
printf "[Thread1] g: 1"
returns
Thread0: runs T(0)
JoinThread(tids[1]) → waits for Thread1 (already done)
returns immediately
g++ → g = 2
printf "[Thread0] g: 2"
returns
main: JoinThread(tids[0]) returns
T(3)
g++ → g = 3
printf "[MainThread] g: 3"
Guaranteed Order:
The synchronization guarantees this specific order:
- Thread1 before Thread0: Thread0 calls
JoinThread(tids[1])forcing it to wait - Thread0 before Main: Main calls
JoinThread(tids[0])forcing it to wait - Sequential increment: g = 1 → 2 → 3
Comparison with Problem 16:
Problems 16 and 17 have identical code and identical output. This variant exists to emphasize that:
- Thread creation order doesn’t guarantee execution order
- Synchronization primitives (join) enforce ordering
- Data races avoided through proper synchronization
Problem 18: Parallel Summation
Consider parallel summation: You are given an array A[0..N-1], and you are asked to compute the sum of the elements of this array in parallel as follows: The main thread will create 4 threads, each thread will compute the sum of N/4 elements of the array and terminate. The main thread will then wait for the termination of the worker threads, compute the overall sum, print it on the screen and terminate. Assume you have 2 functions:
(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread. Assume that array A is global and N is divisible by 4.
#include <stdio.h>
#define N ...
int A[N]; // Array to be summedStudent Solution
Answer by Justin:
#include <stdio.h>
#define N ...
int A[N];
int partialSum[4];
void sum(int id) {
int s = 0;
int start = id * (N/4);
int end = start + (N/4);
for (int i = start; i < end; i++) {
s += A[i];
}
partialSum[id] = s;
}
int main() {
int threads[4];
for (int i = 0; i < 4; i++) {
threads[i] = CreateThread(sum, i);
}
for (int i = 0; i < 4; i++) {
JoinThread(threads[i]);
}
int s = 0;
for (int i = 0; i < 4; i++) {
s += partialSum[i];
}
printf("The sum is %d", s);
exit(0);
}Elaboration (by LLM):
Parallel Summation Strategy:
This solution divides the array into 4 equal parts and assigns each part to a separate thread. Each thread independently computes a partial sum, then the main thread combines the results.
Thread Work Distribution:
Array A[0..N-1]:
T1 computes: A[0] + A[1] + ... + A[N/4-1] → partialSum[0]
T2 computes: A[N/4] + A[N/4+1] + ... + A[N/2-1] → partialSum[1]
T3 computes: A[N/2] + ... + A[3N/4-1] → partialSum[2]
T4 computes: A[3N/4] + ... + A[N-1] → partialSum[3]
Main combines: partialSum[0] + partialSum[1] + partialSum[2] + partialSum[3]
Memory Layout:
Shared Memory (visible to all threads):
┌────────────────────────────────────┐
│ int A[N] (shared) │
│ int partialSum[4] (shared) │
│ int threads[4] (main's stack) │
└────────────────────────────────────┘
Thread 0 Private: Thread 1 Private: Thread 2 Private: Thread 3 Private:
┌────────────────┐ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ id = 0 │ │ id = 1 │ │ id = 2 │ │ id = 3 │
│ s = (local sum)│ │ s = (local sum)│ │ s = (local sum)│ │ s = (local sum)│
│ start = 0 │ │ start = N/4 │ │ start = N/2 │ │ start = 3N/4 │
│ end = N/4 │ │ end = N/2 │ │ end = 3N/4 │ │ end = N │
└────────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
Execution Timeline:
Time 0:
Main: CreateThread(sum, 0) → T1 starts
Main: CreateThread(sum, 1) → T2 starts
Main: CreateThread(sum, 2) → T3 starts
Main: CreateThread(sum, 3) → T4 starts
Main: JoinThread(threads[0])... (waits for all)
Time 1-T:
T1: sum = A[0] + A[1] + ... + A[N/4-1] (uses local 's' variable)
T2: sum = A[N/4] + A[N/4+1] + ... + A[N/2-1] (uses local 's' variable)
T3: sum = A[N/2] + ... + A[3N/4-1] (uses local 's' variable)
T4: sum = A[3N/4] + ... + A[N-1] (uses local 's' variable)
Time T+1:
T1: partialSum[0] = s; (write to shared memory)
T2: partialSum[1] = s;
T3: partialSum[2] = s;
T4: partialSum[3] = s;
(All threads terminate)
Time T+2:
Main: (JoinThread calls return)
Main: s = 0
Main: s += partialSum[0] + partialSum[1] + partialSum[2] + partialSum[3]
Main: printf("The sum is %d", s)
Key Design Patterns:
1. No Synchronization Needed on Reads
Each thread works on a different part of the array:
Thread0: elements [0, N/4)
Thread1: elements [N/4, N/2)
Thread2: elements [N/2, 3N/4)
Thread3: elements [3N/4, N)No overlap → No race conditions on reading A[]
2. Single Write Per Thread
Each thread writes to a different partialSum element:
Thread0 writes only to partialSum[0]
Thread1 writes only to partialSum[1]
Thread2 writes only to partialSum[2]
Thread3 writes only to partialSum[3]No contention → No mutex needed
3. Synchronization via Join
Main thread waits for all workers:
for (int i = 0; i < 4; i++) {
JoinThread(threads[i]); // Wait for thread to finish
}
// Now safe to read all partialSum valuesPerformance Analysis:
Without parallelism:
Sequential: O(N) operations on single thread
Time: N operations
With 4 threads:
Parallel: Each thread does N/4 operations
Time: ~N/4 + overhead (near-linear speedup)
Speedup: ~4x on 4-core system
Amdahl’s Law Application:
Sequential portion (main thread combining): ~4 additions
Parallel portion (thread work): ~N-4 additions
Speedup = 1 / (epsilon + (1-epsilon)/4) where epsilon ≈ 4/N
For large N: Speedup ≈ 4x
For small N: Speedup less than 4x
Potential Issues and Improvements:
Issue 1: False Sharing
If partialSum array elements are on same cache line:
┌─────────────────────────────────────────────┐
│ Cache Line (64 bytes) │
│ [partialSum[0]] [partialSum[1]] [partialSum[2]] [partialSum[3]] │
└─────────────────────────────────────────────┘
All threads writing to same cache line → contention
Solution:
// Pad array to avoid false sharing
typedef struct {
int sum;
char padding[64 - sizeof(int)]; // Align to cache line
} aligned_sum;
aligned_sum partialSum[4];Issue 2: Thread Creation Overhead
Creating 4 threads has overhead. For very small N, sequential might be faster.
Break-even: Usually N > 10,000 elements
Complete Example with Different Array Values:
// Example: Array with 1000000 elements
#define N 1000000
int A[N];
// Initialize array
for (int i = 0; i < N; i++) {
A[i] = i + 1; // Sum = 1+2+...+1000000 = 500000500000
}
// Thread 0 computes: 1+2+...+250000 = 31,250,125,000
// Thread 1 computes: 250001+...+500000 = 93,750,125,000
// Thread 2 computes: 500001+...+750000 = 156,250,125,000
// Thread 3 computes: 750001+...+1000000 = 218,750,125,000
// Total: 500,000,500,000Problem 19: Multi-Threaded Minimum Element
Write a multi-threaded program to find minimum element of an array A[0..N-1] with two threads. The first thread should find the minimum of A[0..N/2] and the second thread should find the minimum of A[N/2+1..N-1]. Each thread should start immediately upon being created and each must execute the same function to find the minimum. After threads have been terminated, the minimum element of the array should be printed by the main thread. Assume that array A is global and N is divisible by 2.
Fix part of the above text to:
- The first thread should find the minimum of
A[0..N/2-1]and the second thread should find the minimum ofA[N/2..N-1]
#include <stdio.h>
#define N ...
int A[N]; // Array for which the minimum is to be computedStudent Solution
Answer by Justin:
#include <stdio.h>
#define N ...
int A[N];
int smallest[2];
void findMin(int id) {
int start = id * (N/2);
int end = start + (N/2);
int small = A[start];
for (int i = start + 1; i < end; i++) {
if (A[i] < small) {
small = A[i];
}
}
smallest[id] = small;
}
int main() {
int threads[2];
for (int i = 0; i < 2; i++) {
threads[i] = CreateThread(findMin, i);
JoinThread(threads[i]);
}
if (smallest[0] < smallest[1]) {
printf("The smallest number is %d", smallest[0]);
} else {
printf("The smallest number is %d", smallest[1]);
}
exit(0);
}Elaboration (by LLM):
Parallel Minimum Finding:
This solution divides the array into 2 halves and finds the minimum in each half using separate threads, then combines the results.
Thread Work Distribution:
Array A[0..N-1]:
Thread 0:
start = 0 * (N/2) = 0
end = 0 + (N/2) = N/2
Searches: A[0], A[1], ..., A[N/2-1]
Result: smallest[0] = min(A[0..N/2-1])
Thread 1:
start = 1 * (N/2) = N/2
end = N/2 + (N/2) = N
Searches: A[N/2], A[N/2+1], ..., A[N-1]
Result: smallest[1] = min(A[N/2..N-1])
Memory Diagram:
Shared Memory:
┌──────────────────────────────────────┐
│ A[0] A[1] ... A[N/2-1] A[N/2] ... A[N-1]
│ └─────────┬──────────┘ └──────┬─────┘
│ Thread 0's range Thread 1's range
│
│ smallest[0] smallest[1] (shared results)
└──────────────────────────────────────┘
Algorithm Trace:
Example: A = [7, 2, 9, 1, 5, 3, 8, 4] (N=8)
Thread 0 execution (id=0):
start = 0, end = 4
small = A[0] = 7
i=1: A[1]=2 < 7 → small = 2
i=2: A[2]=9 > 2 → small = 2
i=3: A[3]=1 < 2 → small = 1
Result: smallest[0] = 1
Thread 1 execution (id=1):
start = 4, end = 8
small = A[4] = 5
i=5: A[5]=3 < 5 → small = 3
i=6: A[6]=8 > 3 → small = 3
i=7: A[7]=4 > 3 → small = 3
Result: smallest[1] = 3
Main combines:
smallest[0]=1, smallest[1]=3
1 < 3 → Print "The smallest number is 1"
Execution Timeline:
Main: CreateThread(findMin, 0) → Thread 0 starts
CreateThread(findMin, 1) → Thread 1 starts
JoinThread(threads[0]) → Wait for Thread 0
Thread 0: Iterates through A[0..N/2-1]
Compares each with local 'small'
Writes result to smallest[0]
Terminates
Thread 1: (Runs concurrently or sequentially)
Iterates through A[N/2..N-1]
Compares each with local 'small'
Writes result to smallest[1]
Terminates
Main: JoinThread(threads[1]) → Wait for Thread 1
Compare smallest[0] vs smallest[1]
Print result
Key Design Patterns:
1. Independent Search Ranges
No data race:
Thread 0 searches: [0, N/2) - disjoint ranges
Thread 1 searches: [N/2, N) - no overlap2. Separate Result Storage
No contention:
smallest[0] ← only written by Thread 0
smallest[1] ← only written by Thread 13. Sequential Result Combination
Main thread safely reads after join:
JoinThread(threads[0]); // Ensure Thread 0 done writing
JoinThread(threads[1]); // Ensure Thread 1 done writing
// Now safe to compare smallest[0] and smallest[1]Handling Edge Cases:
Case 1: Both halves identical
if (smallest[0] < smallest[1]) {
printf(...smallest[0]);
} else {
printf(...smallest[1]); // Handles equality
}Case 2: Negative numbers
Algorithm works the same:
Array: [-5, 3, -8, 1]
T0: smallest[0] = -8 (finds -8 < -5)
T1: smallest[1] = 1 (finds 1 < 3)
Final: -8 is smallest
Performance Analysis:
Sequential minimum search:
Time: N-1 comparisons
Complexity: O(N)
Parallel with 2 threads:
Each thread: (N/2)-1 comparisons
Time: ~(N/2) + merge overhead ≈ N/2
Speedup: ~2x on dual-core
Generalized to k threads:
Time: ~(N/k) + k*log(k) for final merge
Speedup: ~k on k-core system (for large N)
Comparison with Summation (Problem 18):
| Aspect | Summation (P18) | Minimum (P19) |
|---|---|---|
| Threads | 4 | 2 |
| Parallel Ops | Independent adds | Independent compares |
| Merge Method | Simple addition | min() comparison |
| Complexity | O(N) → O(N/4) | O(N) → O(N/2) |
| Scalability | O(k) threads | O(k) threads |
Potential Issues:
Issue 1: Unbalanced Data
If minimum is in first half:
A = [1, 2, 3, 4, 5, 6, 7, 8]
T0: Does work, finds 1
T1: Does work, finds 5
T0 could finish much faster
Threads still finish at same time due to JoinThread (synchronization point).
Issue 2: Very Small Array
N = 10:
Sequential: 9 comparisons
Parallel (2 threads):
- Thread creation overhead
- Each thread: 4 comparisons
- Total time likely greater than sequential!
Break-even point usually N > 1,000
Issue 3: Initialization Bug
Current code assumes A[start] exists:
int small = A[start]; // OK only if start < NSafe version:
int small = INT_MAX;
for (int i = start; i < end; i++) {
if (A[i] < small) {
small = A[i];
}
}Problem 20: Parallel Key Search
You are given an unsorted array A[0..N-1] and a key, and you are asked to find the first index where key is stored in parallel as follows: The main thread will create 4 threads, each thread will search N/4 of the elements and terminate. The main thread will then wait for the termination of the worker threads, compute the overall search result, print it on the screen and terminate. If the key does not exist in the array, then print “Key does not exist in the array”. Assume you have 2 functions:
(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread
(2) JoinThread(int threadId) is used to wait for the termination of a thread. Assume that array A and key are both global and N is divisible by 4.
#include <stdio.h>
#define N ...
int A[N]; // Array to be searched
int key; // Key to be searchedStudent Solution
Answer by Justin:
#include <stdio.h>
#define N ...
int A[N];
int key;
int index[4];
void findKey(int id) {
int start = id * (N/4);
int end = start + (N/4);
for (int i = start; i < end; i++) {
if (A[i] == key) {
index[i] = i;
return;
}
}
}
int main() {
for (int i = 0; i < 4; i++) {
index[i] = N;
}
int threads[4];
for (int i = 0; i < 4; i++) {
threads[i] = CreateThread(findKey, i);
}
for (int i = 0; i < 4; i++) {
JoinThread(threads[i]);
}
int smallIndex = N;
for (int i = 0; i < 4; i++) {
if (smallIndex < index[i] && index[i] != N) {
smallIndex = index[i];
}
}
if (smallIndex != N) {
printf("The key was found at index: %d", smallIndex);
}
else {
printf("The key does not exist!");
}
exit(0);
}Elaboration (by LLM):
Parallel Key Search Strategy:
This solution performs a parallel linear search by dividing the array into 4 segments. Each thread searches its segment for the key and records the index if found. The main thread then finds the earliest index.
Thread Work Distribution:
Array A[0..N-1]:
Thread 0: start=0, end=N/4 searches A[0..N/4-1]
Thread 1: start=N/4, end=N/2 searches A[N/4..N/2-1]
Thread 2: start=N/2, end=3N/4 searches A[N/2..3N/4-1]
Thread 3: start=3N/4, end=N searches A[3N/4..N-1]
If key found by:
Thread 0 at position 5 → index[0] = 5
Thread 1 not found → index[1] = N (sentinel)
Thread 2 at position 200 → index[2] = 200
Thread 3 not found → index[3] = N (sentinel)
Result array: index = [5, N, 200, N]
Final answer: 5 (first occurrence)
Key Design Pattern - Sentinel Value:
// Initialize all to N (sentinel = not found)
for (int i = 0; i < 4; i++) {
index[i] = N;
}
// Each thread updates only if found
if (A[i] == key) {
index[id] = i; // Write position of first match in segment
return; // EARLY EXIT - stop searching this segment
}
// After all threads done, find smallest non-sentinel index
int smallIndex = N;
for (int i = 0; i < 4; i++) {
if (smallIndex < index[i] && index[i] != N) {
smallIndex = index[i]; // Update only if valid index
}
}Algorithm Trace:
Example: A = [3, 7, 2, 9, 5, 1, 8, 4], key = 5, N = 8
Initial: index = [8, 8, 8, 8]
Thread 0 (start=0, end=2):
i=0: A[0]=3 ≠ 5
i=1: A[1]=7 ≠ 5
No match found, index[0] remains 8
Thread 1 (start=2, end=4):
i=2: A[2]=2 ≠ 5
i=3: A[3]=9 ≠ 5
No match found, index[1] remains 8
Thread 2 (start=4, end=6):
i=4: A[4]=5 == 5 ← MATCH!
index[2] = 4
return (stop searching)
Thread 3 (start=6, end=8):
i=6: A[6]=8 ≠ 5
i=7: A[7]=4 ≠ 5
No match found, index[3] remains 8
Result: index = [8, 8, 4, 8]
Main combines:
smallIndex = 8
i=0: 8 < 8? NO
i=1: 8 < 8? NO
i=2: 8 < 4 && 4 ≠ 8? YES → smallIndex = 4
i=3: 4 < 8 && 8 ≠ 8? NO
smallIndex = 4 ≠ 8 → Print "The key was found at index: 4"
Execution Timeline:
Main:
Initialize index[4] = [N, N, N, N]
CreateThread(findKey, 0) → Thread 0 starts
CreateThread(findKey, 1) → Thread 1 starts
CreateThread(findKey, 2) → Thread 2 starts
CreateThread(findKey, 3) → Thread 3 starts
JoinThread(threads[0]) → Wait for all to finish
Threads (concurrent):
T0: Linear search [0, N/4)
T1: Linear search [N/4, N/2)
T2: Linear search [N/2, 3N/4) ← Finds key, sets index[2], returns
T3: Linear search [3N/4, N)
Main (after join):
Find minimum index from [index[0], index[1], index[2], index[3]]
Print result or "not found"
Important Implementation Details:
1. Early Exit with Return
if (A[i] == key) {
index[id] = i; // Record position
return; // EXIT IMMEDIATELY
}
// Don't continue searching once foundThis is critical for performance - don’t search the entire segment if key found early.
2. Sentinel Value (N)
// N is a sentinel because:
for (int i = 0; i < 4; i++) {
index[i] = N; // Initialize to "not found"
}
// Later...
if (index[i] != N) { // Check if thread found something
// Valid index, consider it
}3. Finding Earliest Index
// Multiple threads might find key in different segments
// We want the EARLIEST (smallest index)
int smallIndex = N;
for (int i = 0; i < 4; i++) {
if (smallIndex < index[i] && index[i] != N) {
smallIndex = index[i];
}
}Wait, there’s a bug here! The condition should be:
if (index[i] < smallIndex && index[i] != N) {
smallIndex = index[i];
}This correctly updates smallIndex to the smallest valid index found.
Performance Analysis:
Best Case: Key in first element
Sequential: 1 comparison
Parallel (4 threads): 1 comparison + overhead
Time: ~O(1)
Worst Case: Key not found or at end
Sequential: N comparisons
Parallel (4 threads): N/4 + overhead
Time: ~O(N/4)
Speedup: ~4x
Average Case: Key uniformly distributed
Expected position: N/2
Sequential: N/2 comparisons
Parallel: ~N/8 comparisons per thread
Speedup: ~4x
Comparison with Sequential Search:
Sequential Linear Search:
for (int i = 0; i < N; i++) {
if (A[i] == key) {
printf("Found at %d", i);
return;
}
}
printf("Not found");
Parallel Approach:
✅ Speedup in worst case (key not found)
✅ Speedup in average case
✅ Only slight overhead if key found early
❌ Thread creation/management overhead
❌ Only beneficial for large N
Edge Cases:
Case 1: Key not in array
index = [N, N, N, N]
smallIndex = N
Output: "The key does not exist!"
Case 2: Multiple occurrences
A = [1, 5, 2, 5, 3, 5, 4, 5], key = 5
T0: searches [0,2) → finds 5 at index 1 → index[0] = 1, return
T1: searches [2,4) → finds 5 at index 3 → index[1] = 3, return
T2: searches [4,6) → finds 5 at index 5 → index[2] = 5, return
T3: searches [6,8) → finds 5 at index 7 → index[3] = 7, return
index = [1, 3, 5, 7]
smallIndex = 1 (earliest occurrence)
Output: "The key was found at index: 1"
Case 3: Key at boundaries
A = [5, 1, 2, 3, 4, 6, 7, 8], key = 5
T0: searches [0,2) → finds 5 at index 0 → index[0] = 0
T1, T2, T3: don't find 5
index = [0, N, N, N]
smallIndex = 0
Output: "The key was found at index: 0"
Optimization Opportunities:
1. Lock-Free Update
Instead of recording in array, use atomic compare-and-swap:
volatile int earliest = N;
void findKey(int id) {
int start = id * (N/4);
int end = start + (N/4);
for (int i = start; i < end; i++) {
if (A[i] == key) {
// Atomic operation (no mutex needed)
__sync_bool_compare_and_swap(&earliest, N, i);
return;
}
}
}2. Break Early Across All Threads
Use shared flag to stop other threads when one finds key:
volatile int found = 0;
void findKey(int id) {
// ...
if (A[i] == key) {
index[id] = i;
found = 1; // Signal others
return;
}
if (found) return; // Exit if another thread found it
}