04 - Threads

Problem 1: Thread vs Process

Briefly explain how a thread differs from a process.

Student Solution

Answer by Justin:

A process is a currently executing program, with its own PCB, that defines the address space and general process resources. A thread exists within a process and is a sequential unit of execution within a process, sharing the PCB. The only thing it doesn’t share is its PC, general-purpose registers, stack, and SP.

Elaboration (by LLM):

The key distinction between processes and threads lies in resource isolation vs. resource sharing:

Process Characteristics:

  • Isolated Address Space: Each process has its own complete memory layout (code, data, heap, stack)
  • Independent Resources: Separate file descriptors, signal handlers, current working directory
  • PCB (Process Control Block): OS maintains unique PCB for each process containing execution context
  • Heavyweight: Creating a process is expensive (requires memory allocation, PCB setup)
  • Context Switch Overhead: High overhead when OS switches between processes (must save/restore entire address space)

Thread Characteristics:

  • Shared Address Space: All threads in a process share code, data, and heap
  • Thread-Local Resources: Each thread has its own:
    • PC (Program Counter) - where to execute next instruction
    • SP (Stack Pointer) - points to thread’s stack
    • Stack - local variables, function parameters, return addresses
    • General-purpose registers - temporary values
  • Shared PCB: Threads in same process share one PCB (hence shared memory management)
  • Lightweight: Creating a thread is inexpensive (minimal OS allocation)
  • Context Switch Overhead: Low overhead (only need to save/restore registers and stack pointer)

Memory Layout Comparison:

Process 1 has isolated memory, Process 2 has isolated memory. Single process with 2 threads shares code, data, heap but has separate stacks.

Why Threads Share Address Space:

Thread cooperation becomes simpler - threads can directly access/modify shared variables without IPC (pipes, sockets, shared memory mechanisms required for processes).

Why Each Thread Has Private Execution State:

Each thread needs independent execution flow - different PC, SP, registers enable each thread to execute different code at different points simultaneously (or appear to via time-slicing).

Problem 2: Shared vs Private Thread State

What state (information) does a thread share with other threads in the same process and what state (information) is private/specific to a thread? Be explicit in your answer.

Student Solution

Answer by Justin:

A thread shares with other threads in the same process the same resources (code, data, and heap), but what they don’t share (i.e., what is private to a thread) is the thread’s state (PC, SP, stack, general-purpose registers).

Elaboration (by LLM):

Shared Thread State (Visible to All Threads):

ResourceShared?Details
Code SegmentYesAll threads execute same program instructions
Global VariablesYesint g = 0; at file scope accessible to all
Static VariablesYesstatic int x; shared across threads
Heap MemoryYesmalloc(), new allocations visible to all
File DescriptorsYesOpen files/sockets shared
Process ID (PID)YesAll threads belong to same process

Private Thread State (Unique to Each Thread):

ResourcePrivate?Details
Program Counter (PC)YesEach thread at different instruction
Stack Pointer (SP)YesPoints to thread’s own stack
StackYesLocal variables, function parameters, return addresses
RegistersYesEAX, EBX, ECX, … saved per thread
Thread ID (TID)YesUnique identifier per thread

Practical Example:

int global = 0;       // SHARED - all threads see updates
 
void* thread_func(void* arg) {
    int local = 5;    // PRIVATE - each thread has own copy
    global++;         // SHARED - all threads see this change
}

Critical Implication - Data Races:

Because global variables are shared, concurrent access can cause data races:

int x = 0;  // SHARED
 
Thread1: x++;  // Read x (0), increment, write (1)
Thread2: x++;  // Can interleave, causing lost updates
 
// Without synchronization: x might be 1 instead of expected 2

Solution requires synchronization (mutexes, semaphores) to protect shared state.

Problem 3: OS Resource Allocation for Threads

What resources does the OS allocate for a thread during thread creation? How do they differ from those allocated when a process is created?

Student Solution

Answer by Justin:

The OS allocates for a thread during thread creation its stack, PC, SP, and general-purpose registers in the same address space as the parent process. This differs from the resources allocated when a process is created, because when this occurs, an entirely new address space is allocated, new execution state (PC, registers, stack, and SP), and a new PCB (threads share the same PCB as the parent process).

Elaboration (by LLM):

Resources Allocated for Thread Creation:

Thread creation allocates:

  • Stack (typically 1-8 MB, OS-dependent)
  • Thread-Local Storage (TLS)
  • PC, SP, Registers (initial values)
  • TCB (Thread Control Block)

What’s NOT allocated for thread:

  • New address space (reuses parent’s)
  • New PCB (shares parent’s)
  • New code segment
  • New data/heap

Comparison: Thread Creation vs Process Creation

ResourceThread CreationProcess Creation
Address SpaceReused (shared)NEW (isolated)
StackNEW (thread-local)NEW (process-local)
HeapSharedNEW
Code SegmentSharedNEW copy
PCBSHAREDNEW
Memory Overhead~1-8 MB (stack)~Hundreds MB
Creation TimeMicrosecondsMilliseconds

Why Threads Are Lightweight:

  1. Shared Memory: No need to copy entire address space
  2. Minimal Allocation: Only stack + TCB needed
  3. Cheap Context Switch: Just save/restore registers + SP, not entire memory

Cost Comparison:

Create 100 processes: ~100 * 500 MB = 50 GB potential
Create 100 threads:   ~100 * 2 MB  = 200 MB

Problem 4: Advantages of Multi-Threading

What are the advantages of multi-threading? Give an example when multi-threading is preferrable to a single-threaded processes?

Student Solution

Answer by Justin:

An advantage of multi-threading is that you can break a problem into smaller parts for faster execution. An example of this would be matrix multiplication. You can have numerous threads execute different parts of the problem “at the same time” for faster execution, but in a single-threaded process, you would just have one thread to handle the entire multiplication. This becomes tedious when the matrix becomes REALLY large, and will cause the process to last a very long time.

Elaboration (by LLM):

Key Advantages of Multi-Threading:

1. Parallelism (Multi-Core Systems)

On systems with multiple CPU cores, threads execute truly in parallel:

Single-threaded (1 core):
Thread: [====Task====] (must complete sequentially)
Total time: T

Multi-threaded (4 cores):
Thread1: [==Task1==]
Thread2: [==Task2==]
Thread3: [==Task3==]
Thread4: [==Task4==]
Total time: ~T/4 (speedup)

2. Responsiveness

Long-running operations don’t block entire application. Web servers handle multiple clients without blocking.

3. Shared Resources

Threads efficiently share memory without IPC overhead. Simple direct access to shared variables.

4. Resource Efficiency

Threads cheaper than processes in terms of memory and CPU overhead.

Matrix Multiplication Example:

// Single-threaded: O(N³) operations sequential
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < N; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
// Time: 100% CPU for entire duration
 
// Multi-threaded: Divide work
Thread1 computes C rows 0-N/4
Thread2 computes C rows N/4-N/2
Thread3 computes C rows N/2-3N/4
Thread4 computes C rows 3N/4-N
 
// On 4-core system: ~4x speedup
// All threads access shared arrays A, B, C

When Multi-Threading Wins:

  1. Computationally Intensive: Large matrix operations, image processing
  2. I/O Bound: One thread blocks on disk/network, others proceed
  3. Server Applications: Multiple clients needing simultaneous service
  4. Interactive Applications: UI thread stays responsive while background work happens

Performance Comparison (Hypothetical):

Matrix: 1000x1000 (1 billion operations)

Single-threaded (1 core):     10 seconds
Multi-threaded (4 cores):     2.5 seconds  (4x speedup)
Multi-threaded (8 cores):     1.25 seconds (8x speedup)

Limitations:

  • Speedup limited by number of available cores (Amdahl’s Law)
  • Sequential parts cannot be parallelized
  • Synchronization adds overhead

Problem 5: Shared and Private Thread Items

List 1 item that is shared by all threads, and 1 item that is private to each thread.

Student Solution

Answer by Justin:

One item that is shared by all threads is the address space, and one item that is private to each thread is its program counters (PC).

Elaboration (by LLM):

Shared: Address Space

The address space refers to the entire virtual memory layout of a process:

Virtual Address Space (Shared by all threads):
Code Segment      (Shared)
Data Segment      (Shared)
Heap              (Shared, grows up)
Free Memory
Stack             (Each thread has own stack here)
Kernel Space

All threads within same process can:

  • Read/write to same global variables
  • Access same heap allocations
  • Call same functions
  • Access same files/sockets

Private: Program Counter (PC)

The PC (Program Counter) is a CPU register that points to the next instruction to execute:

void func() {
    int x = 5;     // Line 1 - Thread1 PC might be here
    x++;           // Line 2 - Thread2 PC might be here simultaneously
    printf("%d", x); // Line 3 - Thread3 PC might be here
}

Each thread has:

  • Its own PC value: Points to different instruction
  • Independent execution: Threads progress independently
  • Concurrent advancement: Each thread increments PC at own pace

Why PC Must Be Private:

  1. Independent Execution: Threads execute asynchronously
  2. Different Functions: Threads may call different functions
  3. Loop Progress: Threads in same loop at different iterations

Example Demonstrating Private PC:

void thread_func() {
    printf("Start");  // Instruction A
    sleep(1);        // Instruction B
    printf("End");   // Instruction C
}
 
// Two threads:
Thread1 PC: A -> B -> (sleeping) -> C
Thread2 PC: A -> B -> C -> (already done)
 
// At time T:
Thread1 PC: points to sleep() call
Thread2 PC: points to printf("End")

Other Commonly Private Items:

Besides PC, each thread also has private:

  • SP (Stack Pointer): Points to thread’s own stack
  • Stack: Local variables, parameters, return addresses
  • Registers: EAX, EBX, ECX, … (saved per thread)
  • TID (Thread ID): Unique identifier

Problem 6: Purpose of pthread_detach()

What’s the purpose of pthread_detach() function?

Student Solution

Answer by Justin:

The purpose of the pthread_detach() function is to make it so that the thread won’t go into a zombie mode and wait until another thread terminates it with pthread_join, and that the thread can be terminated automatically with automatic resource cleanup upon termination.

Elaboration (by LLM):

Thread States and Lifecycle:

Default joinable thread must be explicitly cleaned up via pthread_join(). If not joined, thread enters zombie state and resources leak.

With pthread_detach(), thread automatically frees resources upon termination without requiring explicit cleanup.

Without detach() - Default Joinable State:

pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
// Thread runs and exits
pthread_join(tid, NULL);  // Must call to clean up
 
// If you DON'T call pthread_join():
// - Thread becomes ZOMBIE
// - Resources (TCB, stack) remain allocated
// - Memory leak if thread exits

With detach() - Automatic Cleanup:

pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
pthread_detach(tid);  // Thread becomes detached
// Thread runs and exits
// Resources automatically freed - no need for join()

Comparison: Joined vs Detached Threads

AspectJoinable (Default)Detached
Default StateRequires pthread_join()Auto-cleanup on exit
CleanupManual via pthread_join()Automatic
Zombie StateYes, if not joinedNo
Resource Leak RiskHigh if forgottenNone
Exit StatusCan retrieve via pthread_join()Not retrievable
Use CaseWait for resultFire-and-forget tasks

Problem Without Detach:

void* worker(void* arg) {
    printf("Worker running\n");
    return NULL;  // Thread exits
    // <- Becomes ZOMBIE here!
}
 
int main() {
    pthread_t tid;
 
    for (int i = 0; i < 100; i++) {
        pthread_create(&tid, NULL, worker, NULL);
        // Forgot to join!
    }
    // 100 zombie threads, resources still allocated
    return 0;
}

Solution Using Detach:

void* worker(void* arg) {
    printf("Worker running\n");
    return NULL;  // Thread exits
    // <- Automatically cleaned up!
}
 
int main() {
    pthread_t tid;
 
    for (int i = 0; i < 100; i++) {
        pthread_create(&tid, NULL, worker, NULL);
        pthread_detach(tid);  // Auto cleanup when done
    }
    // Threads cleaned automatically
    return 0;
}

When to Use Each:

Use Joinable (default):

  • Need to retrieve thread’s return value
  • Synchronize with thread completion
  • Parent waits for child

Use Detached:

  • No need for return value
  • Fire-and-forget tasks (logging, cleanup)
  • Server handling concurrent clients
// Server handling clients - use detached
void* handle_client(void* sock) {
    // Process client request
    // Automatically cleanup on exit
}
 
int main() {
    while (accept_connection()) {
        pthread_t tid;
        pthread_create(&tid, NULL, handle_client, sock);
        pthread_detach(tid);  // Each client handled independently
    }
}

Key Point:

Detached threads cannot be joined - only one method of termination/cleanup available. Choose wisely at creation time.

Problem 7: Multi-Threaded Process Termination

List 2 conditions under which a multi-threaded process terminates.

Student Solution

Answer by Justin:

Two conditions under which a multi-threaded process terminates are if all threads terminate or if one thread terminates using the exit() system call.

Elaboration (by LLM):

Condition 1: All Threads Terminate Naturally

When every thread in a process exits gracefully:

void* thread_func(void* arg) {
    // Do work
    return NULL;  // Thread exits
}
 
int main() {
    pthread_t t1, t2, t3;
    pthread_create(&t1, NULL, thread_func, NULL);
    pthread_create(&t2, NULL, thread_func, NULL);
    pthread_create(&t3, NULL, thread_func, NULL);
 
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    pthread_join(t3, NULL);
    // All threads done, main thread returns
    // Process terminates
}

Once the main thread returns from main(), it’s the last thread, so the process ends.

Condition 2: Any Thread Calls exit()

If ANY thread (including main) calls exit(), the entire process terminates immediately:

void* thread_func(void* arg) {
    if (some_error) {
        exit(1);  // ← ENTIRE PROCESS TERMINATES!
    }
    return NULL;
}
 
int main() {
    pthread_t t1;
    pthread_create(&t1, NULL, thread_func, NULL);
    pthread_join(t1, NULL);
    return 0;  // Even if we reach here...
}

Key Differences:

TriggerResultOther Threads
All threads exit normallyProcess terminatesAll cleaned up gracefully
One thread calls exit()Process terminates immediatelyOther threads terminated abruptly, no cleanup
Main thread returnsProcess terminatesOther threads may continue briefly before OS cleans up

Important: In Linux, if main() returns but other threads are still running, those threads are killed when the process terminates.

Comparison with exit() vs return:

// In main:
return 0;      // Exit main thread only (process exits if it's last)
exit(0);       // Exit ENTIRE process immediately
 
// In worker thread:
return NULL;   // Exit thread only (process continues)
exit(0);       // Exit ENTIRE process (all threads die!)
pthread_exit(NULL);  // Exit thread only (safer than return)

Best Practice:

Use pthread_exit() in threads instead of calling exit() to avoid terminating the entire process:

void* thread_func(void* arg) {
    if (error) {
        pthread_exit(NULL);  // Exit just this thread
    }
    return NULL;  // Equivalent to pthread_exit(NULL)
}

Problem 8: Concurrency vs Parallelism

Define concurrency and parallelism and briefly explain the difference between the two.

Student Solution

Answer by Justin:

Concurrency is when a system can handle the execution of multiple tasks by interleaving their execution. Parallelism is when a system can handle execution multiple tasks simultaneously. The difference is that concurrency doesn’t execute the tasks at the “same time”, but rather appears to execute them at the same time through rapid context switching. Parallelism is accomplished by using multi-core CPUs, where tasks are scheduled on different cores in order to execute at the same time.

Elaboration (by LLM):

Concurrency: Illusion of Simultaneity

Concurrency means multiple tasks make progress by rapidly switching back and forth:

Single CPU Core (Time flows left to right):

Task1: [==code==] yields
Task2:           [==code==] yields
Task3:                      [==code==] yields
Task1:                                   [==code==] yields

User perceives: All three running "simultaneously"
Reality: Only one at a time, switching very quickly

How Concurrency Works:

  1. Task1 runs for time quantum (e.g., 10ms)
  2. OS switches (context switch) to Task2
  3. Task2 runs for 10ms
  4. OS switches to Task3
  5. Repeat…

At 1000 switches per second, appears instantaneous to user.

Parallelism: True Simultaneity

Parallelism means tasks execute at the exact same moment on different cores:

Multi-Core CPU (4 cores):

Core 1: Task1 [==code==]
Core 2: Task2 [==code==]
Core 3: Task3 [==code==]
Core 4: Task4 [==code==]

Reality: All four executing simultaneously

Key Differences:

AspectConcurrencyParallelism
ExecutionTime-sliced interleavingSimultaneous on multiple cores
Hardware RequiredSingle core (or any number)Multiple cores
AppearanceTasks seem to run togetherTasks actually run together
Context SwitchesManyFew or none
ExampleWeb server handling 100 clients on 1 CPU100 clients on 100 cores

Visual Comparison:

Concurrent (1 core):

Time: 0ms        10ms       20ms       30ms
      |           |          |          |
T1:   [====]      [====]     [====]
T2:        [====]      [====]     [====]
T3:             [====]      [====]     [====]

Parallel (3 cores):

Time: 0ms                           30ms
      |                              |
Core1: T1 [==========================]
Core2: T2 [==========================]
Core3: T3 [==========================]

Real-World Analogy:

Concurrency: One cashier at a store switching between customers (T1, T2, T3). Serves a bit of each, appears to work on all simultaneously.

Parallelism: Three cashiers, one per customer. All working at exact same time.

Can You Have Both?

Yes! Multi-core system with many threads:

Core 1: T1 [===T2====T3===]
Core 2: T4 [===T5====T6===]
Core 3: T7 [===T8====T9===]
Core 4: T10[====T11==T12==]

Threads on same core: Concurrent (interleaved)
Threads on different cores: Parallel (simultaneous)

Implications:

  • Concurrency benefits: Responsiveness, server scalability, better resource utilization
  • Parallelism benefits: True speedup for CPU-intensive tasks (actual execution time reduction)
  • Concurrency alone: No speedup for pure CPU work, just better responsiveness
  • Parallelism requires concurrency: Multi-threaded code needed to utilize multiple cores

Problem 9: Concurrency and Hardware

If you have a concurrent system, i.e., multi-tasking or multi-threading OS, does the underlying hardware have to have multiple CPU cores? Why or why not?

Student Solution

Answer by Justin:

The underlying hardware does not need to have multiple CPU cores. This is because tasks are able to be interleaved with one another, so a multi-core CPU is not required as those are for parallelism.

Elaboration (by LLM):

Concurrency ≠ Multi-Core

Concurrency is a software/OS concept, not a hardware requirement. The OS can create the illusion of simultaneity through rapid context switching on a single core.

Single-Core Concurrent System:

Single CPU Core, Multiple Threads:

Time:     0-10ms    10-20ms   20-30ms   30-40ms
         ┌───────┐┌───────┐┌───────┐┌───────┐
Thread 1 │ RUN   ││READY ││ RUN   ││READY │
Thread 2 │ READY ││ RUN   ││READY ││ RUN   │
Thread 3 │ READY ││READY ││ RUN   ││ RUN   │
         └───────┘└───────┘└───────┘└───────┘

OS scheduler rapidly switches threads, creating concurrency.

Examples of Single-Core Concurrency:

  1. Web Server on 1 CPU:

    • Accept connection from client A (10ms)
    • Block waiting for client A’s data
    • Switch to client B’s thread (10ms)
    • Switch to client C’s thread (10ms)
    • Switch back to client A (data arrived)
    • Result: Handle 100+ clients with 1 core
  2. Interactive Applications:

    • Main UI thread processes input (user sees immediate response)
    • Background thread does heavy computation
    • OS switches between them

How Single-Core Concurrency Works:

void thread1_func() {
    printf("Thread 1 start\n");
    sleep(1);          // ← OS switches to another thread here
    printf("Thread 1 end\n");
}
 
void thread2_func() {
    printf("Thread 2 executing\n");
}
 
// On single-core system:
// Thread1 prints "start", blocks on sleep
// OS says: "OK, switch to thread2"
// Thread2 runs, thread1 still sleeping
// Sleep timer expires, thread1 ready again
// OS switches back, thread1 prints "end"

Blocking Operations Enable Concurrency:

I/O operations naturally create concurrency opportunities:

Thread1: [do work] --BLOCK(disk read)-- [continue]
Thread2:           [do work] --BLOCK(network)-- [continue]
Thread3:                     [do work] [continue]

While T1 blocked on disk, T2 and T3 can run!
While T2 blocked on network, T1 and T3 can run!

CPU-Bound vs I/O-Bound:

I/O-Bound (Concurrency beneficial on 1 core):
┌─────┐  ┌─────┐  ┌─────┐
│WORK ├─►│BLOCK├─►│WORK ├─ Low CPU usage, threads can interleave
└─────┘  └─────┘  └─────┘

CPU-Bound (No benefit from single-core concurrency):
┌────────────────────────────────┐
│WORK (no blocking) for 1 second ├─ Single thread will use 100% CPU
└────────────────────────────────┘

Single-Core Concurrent Scheduling:

OS kernel implements scheduler:

struct thread {
    int thread_id;
    int* stack;
    int pc;           // Program counter
    int registers[8]; // CPU state
    int state;        // RUNNING, READY, BLOCKED
};
 
void context_switch(thread* from, thread* to) {
    // Save 'from' thread's context
    save_registers(from->registers);
    from->pc = current_pc();
 
    // Load 'to' thread's context
    load_registers(to->registers);
    jump_to(to->pc);
}

Why Single-Core Concurrency Works So Well:

  1. I/O Wait Time: Threads block waiting for I/O, freeing CPU for other threads
  2. Responsiveness: Even small tasks get CPU time regularly
  3. Fair Scheduling: Each thread gets turns at the CPU
  4. Resource Utilization: CPU not idle while threads wait for I/O

Conclusion:

Multiple CPU cores are not required for concurrency. They’re required for parallelism (true simultaneous execution). Single-core systems can be fully concurrent through OS-level context switching.

Problem 10: Parallelism and Hardware

To achieve parallelism, does the underlying hardware have to have multiple CPU cores? Why or why not?

Student Solution

Answer by Justin:

The underlying hardware must have multiple CPU cores. This is because if there aren’t multiple CPU cores, the tasks will end up being interleaved, which is concurrency. For tasks to execute simultaneously, we need multiple cores so that we can execute each task on their own core. This is not possible on a single core CPU.

Elaboration (by LLM):

Parallelism Requires Multiple Cores

Parallelism is the execution of multiple tasks at the exact same moment in time. This is only possible with multiple CPU cores.

Why Single Core Can’t Achieve Parallelism:

Single Core Timeline:
┌──────┬──────┬──────┬──────┬──────┐
│Task1 │Task2 │Task3 │Task1 │Task2 │  ← Only one active at any instant
└──────┴──────┴──────┴──────┴──────┘
Time

At any point in time, only ONE task executing

Multi-Core System (True Parallelism):

Core 1: ┌─────────────────────────────────┐
        │ Task1 executing                 │
        └─────────────────────────────────┘

Core 2: ┌─────────────────────────────────┐
        │ Task2 executing                 │
        └─────────────────────────────────┘

Core 3: ┌─────────────────────────────────┐
        │ Task3 executing                 │
        └─────────────────────────────────┘
        └─ At time T, all three are ACTUALLY running

Proof: Single Core Can’t Parallelize CPU-Bound Work

// Matrix multiplication: Pure CPU work, no I/O blocking
 
Single-threaded:
Time: 0 ............................ 10 seconds
      [====== Matrix Math =======]
 
Multi-threaded on 1 core (concurrent but not parallel):
Time: 0 ............................ 10 seconds
      [T1][T2][T3][T1][T2][T3]... (interleaved, same total time!)
      Result: Still 10 seconds (no speedup)
 
Multi-threaded on 4 cores (parallel):
Time: 0 ...... 2.5 seconds
      [T1][T2][T3][T4]
      Result: ~2.5 seconds (4x speedup!)

Sequential vs Parallel:

Sequential Code (1 CPU):

Task1: [==========]
Task2:            [==========]
Total: 20 seconds

Concurrent Code (1 CPU):

Task1: [===][===][===]
Task2:     [===][===][===]
Total: 20 seconds (same! just interleaved)

Parallel Code (2 CPUs):

Task1: [==========]    }
Task2: [==========]    } Both at same time
Total: 10 seconds

Amdahl’s Law: The Reality Check

Speedup formula with N cores:

Speedup = 1 / (S + P/N)

Where:
S = Sequential portion (cannot be parallelized)
P = Parallel portion
N = Number of cores

Example: 10% sequential, 90% parallel work

1 core:  Speedup = 1 / (0.1 + 0.9/1) = 1.0x (baseline)
2 cores: Speedup = 1 / (0.1 + 0.9/2) = 1.82x
4 cores: Speedup = 1 / (0.1 + 0.9/4) = 3.08x
∞ cores: Speedup = 1 / 0.1 = 10x (limited by sequential part)

Hardware Context:

Single Core CPU:

┌────────────────────┐
│  ALU               │  ← One arithmetic/logic unit
│  Registers         │  ← One set per core
│  Cache L1/L2       │
└────────────────────┘

Multi-Core CPU:

┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Core1: ALU   │ Core2: ALU   │ Core3: ALU   │ Core4: ALU   │
│ Regs Regs    │ Regs Regs    │ Regs Regs    │ Regs Regs    │
│ L1/L2 Cache  │ L1/L2 Cache  │ L1/L2 Cache  │ L1/L2 Cache  │
└──────────────┴──────────────┴──────────────┴──────────────┘
     ↓              ↓              ↓              ↓
  Task1         Task2          Task3          Task4
  (Parallel execution)

Modern CPUs:

  • 2-core: Limited parallelism
  • 4-core: Good for light parallelism
  • 8-core: Excellent for workstation tasks
  • 16-core: Data center / server grade
  • 128-core: Supercomputing

Examples Requiring Multiple Cores:

  1. Video Encoding:

    • Frame processing can’t progress until other frames done
    • 4 cores = ~4x speedup
  2. Scientific Simulation:

    • Grid computation (weather, physics)
    • 8 cores = ~8x speedup
  3. Machine Learning:

    • Matrix operations trivially parallel
    • 64 cores / GPUs = 64x+ speedup

Key Takeaway:

For pure CPU-bound work with no blocking:

  • Concurrency on 1 core = No speedup, just interleaving
  • Parallelism on N cores = ~N-fold speedup (minus overhead)

Problem 11: Multi-Threading in Single-Core Machine

Give one reason why you will want to implement a multi-threaded application in a single-core machine.

Student Solution

Answer by Justin:

You may want to implement a multi-threaded application on a single-core machine to break up parts of a problem into smaller parts for faster execution.

Elaboration (by LLM):

Concurrency Benefits on Single-Core:

While single-core threading can’t speed up pure CPU work, it provides critical benefits through concurrency:

1. Responsiveness (Primary Benefit)

// Bad: Single-threaded, blocking I/O
int main() {
    printf("Downloading file...");
    download_file();        // Blocks for 5 seconds
    printf("Done!\n");      // User waits 5 seconds
    // UI frozen during download
}
 
// Good: Multi-threaded, non-blocking
struct {
    int download_progress;
} state;
 
void* download_thread(void* arg) {
    download_file();        // Blocks for 5 seconds
    state.download_progress = 100;
    return NULL;
}
 
int main() {
    pthread_t tid;
    pthread_create(&tid, NULL, download_thread, NULL);
 
    // Main thread continues immediately
    while (state.download_progress < 100) {
        printf(".\r");  // Show progress
        usleep(100000);
    }
 
    // User sees responsive UI despite 5-second operation
}

2. I/O-Bound Operations (Waiting Time Reuse)

Sequential I/O (single-threaded):
┌─────────┐  ┌─────────┐  ┌─────────┐
│Read 1   │  │Read 2   │  │Read 3   │
│(1 sec)  │  │(1 sec)  │  │(1 sec)  │
└─────────┘  └─────────┘  └─────────┘
Total: 3 seconds

Concurrent I/O (multi-threaded on 1 core):
Thread1: [Read1.....    ]
                        (blocked waiting for disk)
Thread2:           [Read2.....    ]
                                   (blocked)
Thread3:                       [Read3.....    ]
Total: ~1.1 seconds (much faster!)

Why? While Thread1 blocked on disk I/O, Thread2 and Thread3 can start their I/O

3. Structured Parallelism

Even without execution speed improvement, multi-threading models the problem better:

// Web Server on 1 core
void* handle_client(void* socket) {
    // Each client gets own thread context
    // Easy to understand: "this thread handles this client"
    read_request();   // Blocks, thread sleeps
    process_request();
    send_response();  // Blocks, thread sleeps
    // While blocked, OS runs other client threads
}
 
// Single-threaded equivalent would need manual state machine
// (very hard to understand)

4. Practical Single-Core Multi-Threaded Uses

Example: Background Tasks

// Main thread: Responsive UI
while (true) {
    display_ui();
    handle_user_input();
}
 
// Background thread: Long operation (doesn't freeze UI)
static void* background_work(void* arg) {
    expensive_computation();
    return NULL;
}
 
// Start background task
pthread_t bg;
pthread_create(&bg, NULL, background_work, NULL);
// Main thread continues responsive

Example: Server Handling Clients

// Single-threaded (sequential - very slow):
while (true) {
    socket s = accept_connection();  // Wait for client A
    process_request(s);              // Process for 1 second
    send_response(s);                // Send for 0.5 seconds
    close(s);
    // Next client must wait 1.5 seconds
}
 
// Multi-threaded (concurrent - responsive):
while (true) {
    socket s = accept_connection();
    pthread_create(new thread to handle s);
    // Accept next client immediately!
}
 
// Each client's thread blocks on I/O
// While one blocked, others can run

The Critical Insight:

I/O Operations Often Involve Waiting:
• Network read:    Thread blocks for 100ms
• Disk read:       Thread blocks for 10ms
• User input:      Thread blocks indefinitely

While one thread waits, another can make progress!

Speedup Comparison:

Task: 3 network requests, each 1 second

Sequential:    [Network1][Network2][Network3] = 3 seconds

Concurrent(1 core):
[Network1 ---- blocked ---]
          [Network2 ---- blocked ---]
                    [Network3 ---- blocked ---]
                                              = ~1 second

Bottom Line:

Multi-threading on single-core is valuable for:

  • Responsiveness: Long operations don’t freeze the UI
  • I/O Efficiency: Many threads can wait for I/O simultaneously
  • Code Clarity: Model problem structure naturally
  • Resource Utilization: CPU not idle while threads wait

It’s NOT for raw execution speed on CPU-intensive work.

Problem 12: Many-to-One Thread Mapping

Consider an OS where the threads are implemented by a user-level thread library and all threads of a process are mapped to just one kernel thread (many-to-one mapping). Consider a process that has two threads: T1 and T2, and T1 is running.

True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.

Student Solution

Answer by Justin:

False.

When T1 blocks while executing an I/O system call, the OS cannot switch to T2 and start running it! This is because in a many-to-one mapping, if a thread is blocked while executing an I/O system call, all other threads are also blocked.

Elaboration (by LLM):

Many-to-One Thread Mapping Architecture:

User-Level Threads:        Kernel Thread:
T1 ┐
T2 ├─────────────→ Kernel Thread 1
T3 │
T4 ┘

All user-level threads mapped to single kernel thread

The Problem: Blocking at Kernel Level

// T1 makes system call
T1: read(fd, buffer, 1024);  // ← System call blocks
 
At kernel level:
┌─────────────────────────────────┐
│ Kernel Thread 1 BLOCKED         │  ← Waiting for I/O
│ (waiting for disk/network)      │
└─────────────────────────────────┘
 
Meanwhile in user space:
T1: BLOCKED (part of kernel thread)
T2: RUNNABLE (but stuck - kernel thread blocked!)
T3: RUNNABLE (but stuck - kernel thread blocked!)
T4: RUNNABLE (but stuck - kernel thread blocked!)

Why Can’t T2 Run?

The scheduler only sees one kernel thread:

OS Scheduler:
┌─────────────────────────────────────┐
│ Process A:                          │
│   Kernel Thread 1: BLOCKED (I/O)    │
│                                     │
│ Process B:                          │
│   Kernel Thread 1: RUNNABLE         │
└─────────────────────────────────────┘

Scheduler decision: Run Process B's thread
Result: T2 still doesn't run! All user-level threads stuck

Timeline Example:

Time 0-100ms:
  T1 calls read() system call
  ├─ Kernel thread transitions to BLOCKED
  ├─ T2, T3, T4 all BLOCKED (indirectly)
  └─ Disk I/O in progress...

Time 100-150ms:
  OS schedules other processes
  T2, T3, T4 still cannot run
  Even though they're READY

Time 150ms:
  Disk I/O completes
  Kernel thread becomes READY
  Scheduler runs kernel thread again
  ├─ T1 resumes from read()
  └─ T2, T3, T4 can only run after T1 yields

The Critical Issue: Blocking at Wrong Level

User Library (Green Threads):
┌─────────────────────────────────┐
│ T1, T2, T3, T4                  │  ← User-level scheduler
│ (User-level thread scheduler)   │  ← Can manage these
└─────────────────────────────────┘

Kernel Level:
┌─────────────────────────────────┐
│ Only sees: 1 kernel thread      │  ← OS sees ONE thread
│ (user library's scheduler info) │  ← OS unaware of 4 user threads
│                                 │
│ When kernel thread BLOCKS,      │
│ user library loses control!     │
└─────────────────────────────────┘

Visual Comparison:

Wanted (what user library tried to do):
T1: [=====I/O BLOCK=====]
T2:      [====run====]  ← Switch while T1 blocked
T3:                [===]←

What Actually Happens:
Kernel_Thread: [=====BLOCKED=====][T1 continues][T2][T3]
               ↑ T2, T3 stuck!

Why Many-to-One Fails:

  1. Blocking is at kernel level - OS scheduler doesn’t know about user threads
  2. No visibility - Kernel sees only one thread, doesn’t know others exist
  3. Lost control - User-level scheduler can’t run when kernel thread blocked
  4. Cascade failure - One I/O call blocks entire process

Code Example:

// Many-to-one mapping:
// User library: Green threads (T1, T2, T3)
// Kernel: Single kernel thread
 
void* thread1(void* arg) {
    printf("T1 start\n");
    read(fd, buffer, 1024);  // ← Blocks at kernel level!
    // All other threads blocked here!
    printf("T1 end\n");
}
 
void* thread2(void* arg) {
    printf("T2 running\n");  // ← Never reaches here while T1 blocked
}
 
int main() {
    start_thread(thread1);
    start_thread(thread2);
    // When T1 calls read():
    // ├─ Kernel thread blocked
    // ├─ T2 cannot run (stuck with T1)
    // └─ Entire process appears frozen
}

Historical Context:

Many-to-one mapping (used in early Solaris green threads, Linux pthreads v1) suffered from this problem. Any I/O in one thread froze the entire process.

Solution: One-to-One or Many-to-Many Mapping (see Problem 13)

Problem 13: One-to-One Thread Mapping

Consider an OS where each user-level thread has a corresponding kernel-level thread (one-to-one mapping) as in Linux. Consider a process that has two threads, T1 and T2 and T1 is running.

True/False: When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. Justify your answer.

Student Solution

Answer by Justin:

True.

When T1 blocks while executing an I/O system call, the OS can switch to T2 and start running it. This is because in a one-to-one mapping, each user-level thread is mapped to a kernel-level thread. This means that if one thread gets blocked, all other threads will not be affected, allowing the OS to switch to them and start running!

Elaboration (by LLM):

One-to-One Thread Mapping Architecture:

User-Level Threads:     Kernel Threads:
T1 ──────────────────→ Kernel Thread 1
T2 ──────────────────→ Kernel Thread 2
T3 ──────────────────→ Kernel Thread 3
T4 ──────────────────→ Kernel Thread 4

Each user thread has dedicated kernel thread

The Solution: Independent Kernel Threads

// T1 makes system call
T1: read(fd, buffer, 1024);  // ← System call blocks
 
At kernel level:
Kernel Thread 1: BLOCKED (waiting for I/O)
Kernel Thread 2: RUNNABLE (independent!)
Kernel Thread 3: RUNNABLE (independent!)
Kernel Thread 4: RUNNABLE (independent!)
 
OS Scheduler sees 4 independent kernel threads
Decision: Run Kernel Thread 2 (for T2)
Result: T2 RUNS while T1 blocked!

Why T2 CAN Run:

OS scheduler has visibility into all kernel threads:

OS Scheduler:
┌─────────────────────────────────────┐
│ Process A:                          │
│   Kernel Thread 1: BLOCKED (I/O)    │
│   Kernel Thread 2: RUNNABLE    ← Can run this!
│   Kernel Thread 3: RUNNABLE    ← Or this!
│   Kernel Thread 4: RUNNABLE    ← Or this!
│                                     │
│ Process B:                          │
│   Kernel Thread 1: RUNNABLE         │
└─────────────────────────────────────┘

Scheduler: "K-Thread 1 blocked, run K-Thread 2"
Result: T2 executes immediately

Timeline Example:

Time 0-50ms:
  T1 calls read() system call
  ├─ Kernel Thread 1 transitions to BLOCKED
  └─ Other kernel threads remain RUNNABLE

Time 50-51ms:  ← CRITICAL DIFFERENCE
  OS scheduler notices K-Thread 1 blocked
  Scheduler selects K-Thread 2 (for T2)
  ├─ K-Thread 2 scheduled on CPU
  ├─ T2 RUNS immediately
  └─ T1 still waiting for I/O

Time 100ms:
  Disk I/O completes for T1
  ├─ K-Thread 1 becomes READY
  ├─ T1 can resume
  └─ Both T1 and T2 can now run (on different cores or time-sliced)

Visual Comparison: One-to-One vs Many-to-One

MANY-TO-ONE (Problem 12):
Kernel_Thread: [====BLOCKED====][T1 continues][T2][T3]
               ↑ T2, T3 stuck!

ONE-TO-ONE (Problem 13):
K_Thread 1: [====BLOCKED====]            [K_Thread 1 resumes]
K_Thread 2:      [====T2 runs====]
K_Thread 3:           [====T3 runs====]
K_Thread 4:                [====T4 runs====]
           ↑ All threads progressing concurrently!

Why One-to-One Works:

  1. OS Visibility: Kernel sees each thread independently
  2. Independent Blocking: One thread’s I/O doesn’t affect others
  3. Scheduler Control: OS can schedule any ready kernel thread
  4. True Concurrency: Multiple threads actually progressing

Code Example:

// One-to-one mapping (Linux pthreads, modern systems):
// Each pthread_t has dedicated kernel thread
 
void* thread1(void* arg) {
    printf("T1 start\n");
    read(fd, buffer, 1024);  // ← K-Thread 1 blocks
    // K-Thread 2, 3, 4 can still run!
    printf("T1 end\n");
}
 
void* thread2(void* arg) {
    printf("T2 running\n");  // ← Executes while T1 blocked!
}
 
int main() {
    pthread_t t1, t2;
    pthread_create(&t1, NULL, thread1, NULL);
    pthread_create(&t2, NULL, thread2, NULL);
 
    // When T1 calls read():
    // ├─ K-Thread 1 blocked
    // ├─ Scheduler switches to K-Thread 2
    // ├─ T2 runs immediately
    // └─ Both threads progressing concurrently
 
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
}

Real-World Impact:

Web Server Example: Handling 100 clients

Many-to-One (Problem 12):
  One client's network I/O blocks entire server
  Result: Server can't handle other clients
  Throughput: 1 client effectively at a time

One-to-One (Problem 13):
  Client A's I/O blocked → Server handles Client B
  Client B's I/O blocked → Server handles Client C
  Result: Server handles many clients concurrently
  Throughput: ~100 clients simultaneously

Linux Implementation:

Linux uses one-to-one mapping:

  • Each pthread_t is a lightweight process (LWP)
  • Each LWP has dedicated kernel thread
  • Kernel scheduler sees and manages all LWPs independently
  • Result: True POSIX thread semantics

Trade-offs:

One-to-One Advantages:
✅ True concurrency (one thread's I/O doesn't block others)
✅ Better utilization (CPU not idle while threads wait)
✅ Simpler implementation (kernel does scheduling)
✅ Standard behavior (POSIX pthreads)

One-to-One Disadvantages:
❌ More memory (each thread needs kernel resources)
❌ More context switches (more kernel threads to manage)
❌ Limited threads (can create fewer threads than many-to-one)

Conclusion:

One-to-one mapping solves the blocking problem by giving each user thread its own kernel thread. The OS scheduler can now make independent decisions about which threads to run, enabling true concurrency even on single-core systems.

Problem 14: Fork with Multi-Threaded Process

Consider a process with 4 threads in Linux. When a thread invokes fork() system call to create a new process, how many threads will the new process have? Briefly explain.

Student Solution

Answer by Justin:

When a thread invokes the fork() system call to create a new process, the new process will have one thread. This is because this newly forked process will not clone the threads of the previous process.

Elaboration (by LLM):

Why Fork Only Copies One Thread:

When fork() is called, the OS creates a completely new process with a copy of the address space, but it does NOT copy all the threads. Only the thread that called fork() is replicated in the child process.

Process Before Fork:

Parent Process (4 threads):
┌─────────────────────────────────────┐
│ Thread 1                            │
│ Thread 2 (calls fork())  ← HERE     │
│ Thread 3                            │
│ Thread 4                            │
└─────────────────────────────────────┘

After Fork:

Parent Process (still 4 threads):        Child Process (1 thread):
┌─────────────────────────────────────┐  ┌──────────────────────┐
│ Thread 1                            │  │ Thread 2 (copy)      │
│ Thread 2 (original)                 │  │ (isolated from T1,3,4)
│ Thread 3                            │  │                      │
│ Thread 4                            │  │                      │
└─────────────────────────────────────┘  └──────────────────────┘

Code Example:

void* thread_func(void* arg) {
    printf("This is thread %d\n", (int)arg);
    sleep(2);  // Long operation
    return NULL;
}
 
int main() {
    pthread_t t1, t2, t3;
 
    // Create 3 additional threads
    pthread_create(&t1, NULL, thread_func, (void*)1);
    pthread_create(&t2, NULL, thread_func, (void*)2);
    pthread_create(&t3, NULL, thread_func, (void*)3);
 
    sleep(1);
 
    pid_t pid = fork();
 
    if (pid == 0) {
        // Child process
        printf("Child process has how many threads? Only 1!\n");
        printf("Threads 1, 2, 3 don't exist in child\n");
        exit(0);
    } else {
        // Parent process
        printf("Parent still has 4 threads\n");
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);
        pthread_join(t3, NULL);
        wait(NULL);  // Wait for child
    }
}

Why This Design?

  1. Thread State Complexity: Other threads might hold mutexes, be in critical sections, or have inconsistent state
  2. Resource Clarity: Child process gets clean, isolated resources
  3. Deadlock Prevention: Avoids situations where child inherits held locks from parent

Problem Scenario:

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 
void* thread_with_lock(void* arg) {
    pthread_mutex_lock(&lock);
    // Do some work
    sleep(10);
    pthread_mutex_unlock(&lock);
}
 
int main() {
    pthread_t tid;
    pthread_create(&tid, NULL, thread_with_lock, NULL);
    sleep(1);  // Let thread acquire lock
 
    fork();
    // In child: lock is HELD but thread doesn't exist!
    // Any attempt to lock() = DEADLOCK
}

Memory Layout:

Parent Memory (before fork):          Child Memory (after fork):
┌──────────────────────────┐          ┌──────────────────────────┐
│ Code Segment (shared)    │  ──────► │ Code Segment (copy)      │
│ Global Data              │  ──────► │ Global Data (copy)       │
│ Heap                     │  ──────► │ Heap (copy)              │
│ Stack (4 stacks):        │          │ Stack (1 stack):         │
│  T1 stack                │          │  T2 stack (copy)         │
│  T2 stack (main)         │  ──────► │                          │
│  T3 stack                │          │  (T1, T3, T4 stacks not  │
│  T4 stack                │          │   copied)                │
└──────────────────────────┘          └──────────────────────────┘

Key Point:

fork() creates a snapshot of the calling thread’s state. Other threads are NOT copied. If you need multi-threaded child processes, use posix_spawn() with POSIX_SPAWN_SETPGROUP or create new threads in the child.

Problem 15: Exec with Multi-Threaded Process

Consider a process with 4 threads in Linux. When a thread invokes exec() system call to load a new executable on top of the current process, how many threads will the new process have? Briefly explain.

Student Solution

Answer by Justin:

When a thread invokes the exec() system call to load a new executable on top of the current process, this new process will have just one thread that executes what was passed in the system call, and the previous process and all other threads will be terminated.

Elaboration (by LLM):

Exec Replaces Entire Process Image

When any thread calls exec(), the entire process image is replaced - all threads are terminated, and a single new thread starts executing the new program.

Process Before Exec:

Process (4 threads):
┌─────────────────────────────────────┐
│ Thread 1                            │
│ Thread 2 (calls exec())   ← HERE    │
│ Thread 3                            │
│ Thread 4                            │
└─────────────────────────────────────┘
Address space: Old program binary

After Exec:

Process (1 thread):
┌─────────────────────────────────────┐
│ Main Thread (NEW PROGRAM)           │
│ (Threads 1,2,3,4 ALL TERMINATED)    │
│                                     │
│                                     │
└─────────────────────────────────────┘
Address space: NEW program binary
Old threads: DESTROYED

Code Example:

void* thread_func(void* arg) {
    int thread_num = (int)arg;
    printf("Thread %d started\n", thread_num);
 
    if (thread_num == 2) {
        sleep(1);
        printf("Thread 2 calling exec...\n");
        execl("/bin/ls", "ls", "-l", NULL);
        // Code here NEVER executes (exec replaces process)
        printf("This line never prints\n");
    } else {
        sleep(10);  // Simulate work
        printf("Thread %d still running\n", thread_num);
    }
    return NULL;
}
 
int main() {
    pthread_t t1, t2, t3;
 
    pthread_create(&t1, NULL, thread_func, (void*)1);
    pthread_create(&t2, NULL, thread_func, (void*)2);
    pthread_create(&t3, NULL, thread_func, (void*)3);
 
    pthread_join(t1, NULL);  // Will never return (process replaced!)
    pthread_join(t2, NULL);
    pthread_join(t3, NULL);
}

Output:

Thread 1 started
Thread 2 started
Thread 3 started
Thread 2 calling exec...
[Directory listing from ls command]
(Process terminates)

Note: Threads 1 and 3 never print their completion messages

Key Differences: Fork vs Exec vs Thread Creation

OperationParent ProcessNew Process/ThreadOther Threads
fork()Continues runningCreated with 1 threadStay in parent
exec()REPLACED (terminates)Created with 1 threadALL TERMINATED
pthread_create()Continues runningNew thread in SAME processContinue in same process

Why Exec Kills All Threads:

  1. Program Replacement: Entire memory layout changes
  2. Resource Cleanup: All thread-local storage, stacks invalidated
  3. State Corruption: Old thread stacks no longer accessible
  4. No Recovery: Can’t resume old threads with new binary

Visualization:

Memory Before Exec:
┌─────────────────────────┐
│ Old Program Code        │
│ Old Global Variables    │
│ Heap                    │
│ Stacks (4 threads):     │
│  T1 stack ┐             │
│  T2 stack ├─ Being used │
│  T3 stack │             │
│  T4 stack ┘             │
└─────────────────────────┘

After exec("/new/program", ...):
┌─────────────────────────┐
│ NEW Program Code        │  ← Completely different!
│ NEW Global Variables    │
│ NEW Heap (empty)        │
│ Main Thread Stack       │  ← Only one stack
│                         │
│                         │
│                         │
└─────────────────────────┘
Old stacks: DESTROYED

Resource Management:

When exec() is called:

  • File descriptors remain OPEN (inherit to new process)
  • Mutex locks are ABANDONED (can cause deadlocks in parent if they wait)
  • Thread-local storage is DESTROYED
  • Memory mappings are REPLACED

Example - Resource Leak:

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 
int main() {
    pthread_mutex_lock(&lock);
 
    execl("/bin/echo", "echo", "hello", NULL);
    // Lock is never unlocked!
    // If parent process still running and tries to lock: DEADLOCK
 
    pthread_mutex_unlock(&lock);  // Never reached
}

Safe Pattern:

If you need to exec() from a multi-threaded process:

int main() {
    pid_t pid = fork();
 
    if (pid == 0) {
        // Child process - safe to exec
        // Other threads only existed in parent
        execl("/new/program", "args", NULL);
    } else {
        // Parent continues with all threads
        wait(NULL);  // Wait for child
    }
}

Conclusion:

exec() is a process replacement, not just a program loader. All threads (except the calling thread, which gets replaced) are terminated instantly. This is why exec is usually preceded by fork() - fork creates a new process, then exec replaces only that new process’s image.

Problem 16: Thread Execution Output

What’s the output of the following code when run in a Unix system? Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread

(2) JoinThread(int threadId) is used to wait for the termination of a thread.

int g = 0;
int tids[2];
 
void T(int arg) {
    if (arg == 3) {
        g++;
        printf("[MainThread] g: %d\n", g);
    }
    if (arg == 1) {
        g++;
        printf("[Thread1] g: %d\n", g);
    }
    if (arg == 0) {
        JoinThread(tids[1]);
        g++;
        printf("[Thread0] g: %d\n", str, g);
    }
}
 
main() {
    tids[1] = CreateThread(T, 1);
    tids[0] = CreateThread(T, 0);
    JoinThread(tids[0]);
    T(3);
}
Student Solution

Answer by Justin:

[Thread1] g: 1
[Thread0] g: 2
[MainThread] g: 3

Elaboration (by LLM):

Execution Trace:

This problem demonstrates thread synchronization and shared state. Let’s trace the execution step-by-step:

Timeline:

Time 0: Main executes
        tids[1] = CreateThread(T, 1);  // Creates Thread1
        tids[0] = CreateThread(T, 0);  // Creates Thread0
        JoinThread(tids[0]);           // Main blocks waiting for Thread0

Time 1: Thread1 starts (runs T(1))
        Prints: "[Thread1] g: 1"
        g becomes 1
        Thread1 terminates

Time 2: Thread0 starts (runs T(0))
        Calls: JoinThread(tids[1]);    // Waits for Thread1 (already done!)
        Returns immediately
        g++; (g becomes 2)
        Prints: "[Thread0] g: 2"
        Thread0 terminates

Time 3: Main's JoinThread(tids[0]) returns
        T(3) is called
        g++; (g becomes 3)
        Prints: "[MainThread] g: 3"

Key Synchronization Points:

             Main Thread              Thread0           Thread1
                 |                       |                |
    CreateThread(T, 1) ─────────────────────────────────► T(1)
                 |                                         |
    CreateThread(T, 0) ───────────────► T(0)             |
                 |                       |                g++
    JoinThread(tids[0]) {BLOCKS}        |            printf g:1
                 |                   JoinThread(1)       |
                 | (waiting)             | (waits for T1) |
                 |                       |          (done)|
                 |                   (returns)           |
                 |                       g++              |
                 |                   printf g:2           |
                 |                       | (done)         |
                 |◄────────── (returns) ◄─                |
                 |
             T(3)
             g++
          printf g:3

Shared State Evolution:

Initial: g = 0

After T(1):  g = 1   [Thread1] g: 1
After T(0):  g = 2   [Thread0] g: 2
After T(3):  g = 3   [MainThread] g: 3

Why This Order?

  1. Thread1 runs first (likely due to scheduling, though not guaranteed in general)
  2. Thread0 waits for Thread1 via JoinThread(tids[1])
  3. Main waits for Thread0 via JoinThread(tids[0])
  4. Main runs T(3) last

What Could Go Wrong:

The actual execution order of threads is NOT deterministic. However, the output shown assumes:

  • Thread1 executes before Thread0 is scheduled
  • No interleaving within individual function calls

Alternative (But Less Likely) Orderings:

If Thread0 started before Thread1:

[Thread0] g: 1  (if Thread0 increments first)
[Thread1] g: 2
[MainThread] g: 3

But the solution assumes Thread1 gets CPU time first.

Important Detail:

The JoinThread(tids[1]) in Thread0 is crucial. It forces Thread0 to wait for Thread1 to complete before continuing. This ensures:

  • Thread1 increments g first
  • Thread0 increments g second
  • Main increments g third

Memory Visibility:

All threads share the same g variable:

int g = 0;  // Global - shared by all threads

Changes by one thread are immediately visible to all others (memory is shared in multi-threaded process).

Mutex-Free Synchronization:

This example uses JoinThread() for synchronization rather than mutexes. The join() forces a specific execution order rather than protecting data.

Problem 17: Thread Execution Output (Variant)

What’s the output of the following code when run in a Unix system? Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread

(2) JoinThread(int threadId) is used to wait for the termination of a thread.

int g = 0;
int tids[2];
 
void T(int arg) {
    if (arg == 3) {
        g++;
        printf("[MainThread] g: %d\n", g);
    }
    if (arg == 1) {
        g++;
        printf("[Thread1] g: %d\n", g);
    }
    if (arg == 0) {
        JoinThread(tids[1]);
        g++;
        printf("[Thread0] g: %d\n", str, g);
    }
}
 
main() {
    tids[1] = CreateThread(T, 1);
    tids[0] = CreateThread(T, 0);
    JoinThread(tids[0]);
    T(3);
}
LLM Solution
[Thread1] g: 1
[Thread0] g: 2
[MainThread] g: 3

Elaboration (by LLM):

Same Code, Same Output

This problem is identical to Problem 16 (same code, same expected output). The output depends on thread scheduling and join synchronization.

Why This Output?

The output is determined by the JoinThread() calls which enforce execution order:

main: CreateThread(T, 1) → starts Thread1
main: CreateThread(T, 0) → starts Thread0
main: JoinThread(tids[0]) → blocks waiting for Thread0

Thread1: runs T(1)
         g++ → g = 1
         printf "[Thread1] g: 1"
         returns

Thread0: runs T(0)
         JoinThread(tids[1]) → waits for Thread1 (already done)
         returns immediately
         g++ → g = 2
         printf "[Thread0] g: 2"
         returns

main: JoinThread(tids[0]) returns
      T(3)
      g++ → g = 3
      printf "[MainThread] g: 3"

Guaranteed Order:

The synchronization guarantees this specific order:

  1. Thread1 before Thread0: Thread0 calls JoinThread(tids[1]) forcing it to wait
  2. Thread0 before Main: Main calls JoinThread(tids[0]) forcing it to wait
  3. Sequential increment: g = 1 → 2 → 3

Comparison with Problem 16:

Problems 16 and 17 have identical code and identical output. This variant exists to emphasize that:

  • Thread creation order doesn’t guarantee execution order
  • Synchronization primitives (join) enforce ordering
  • Data races avoided through proper synchronization

Problem 18: Parallel Summation

Consider parallel summation: You are given an array A[0..N-1], and you are asked to compute the sum of the elements of this array in parallel as follows: The main thread will create 4 threads, each thread will compute the sum of N/4 elements of the array and terminate. The main thread will then wait for the termination of the worker threads, compute the overall sum, print it on the screen and terminate. Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread

(2) JoinThread(int threadId) is used to wait for the termination of a thread. Assume that array A is global and N is divisible by 4.

#include <stdio.h>
#define N ...
 
int A[N]; // Array to be summed
Student Solution

Answer by Justin:

#include <stdio.h>
#define N ...
 
int A[N];
int partialSum[4];
 
void sum(int id) {
    int s = 0;
    int start = id * (N/4);
    int end = start + (N/4);
 
    for (int i = start; i < end; i++) {
        s += A[i];
    }
 
    partialSum[id] = s;
}
 
int main() {
    int threads[4];
    for (int i = 0; i < 4; i++) {
        threads[i] = CreateThread(sum, i);
    }
 
    for (int i = 0; i < 4; i++) {
        JoinThread(threads[i]);
    }
 
    int s = 0;
    for (int i = 0; i < 4; i++) {
        s += partialSum[i];
    }
 
    printf("The sum is %d", s);
    exit(0);
}

Elaboration (by LLM):

Parallel Summation Strategy:

This solution divides the array into 4 equal parts and assigns each part to a separate thread. Each thread independently computes a partial sum, then the main thread combines the results.

Thread Work Distribution:

Array A[0..N-1]:

T1 computes:  A[0] + A[1] + ... + A[N/4-1]      → partialSum[0]
T2 computes:  A[N/4] + A[N/4+1] + ... + A[N/2-1]  → partialSum[1]
T3 computes:  A[N/2] + ... + A[3N/4-1]          → partialSum[2]
T4 computes:  A[3N/4] + ... + A[N-1]            → partialSum[3]

Main combines: partialSum[0] + partialSum[1] + partialSum[2] + partialSum[3]

Memory Layout:

Shared Memory (visible to all threads):
┌────────────────────────────────────┐
│ int A[N]           (shared)        │
│ int partialSum[4]  (shared)        │
│ int threads[4]     (main's stack)  │
└────────────────────────────────────┘

Thread 0 Private:        Thread 1 Private:      Thread 2 Private:      Thread 3 Private:
┌────────────────┐      ┌────────────────┐    ┌────────────────┐    ┌────────────────┐
│ id = 0         │      │ id = 1         │    │ id = 2         │    │ id = 3         │
│ s = (local sum)│      │ s = (local sum)│    │ s = (local sum)│    │ s = (local sum)│
│ start = 0      │      │ start = N/4    │    │ start = N/2    │    │ start = 3N/4   │
│ end = N/4      │      │ end = N/2      │    │ end = 3N/4     │    │ end = N        │
└────────────────┘      └────────────────┘    └────────────────┘    └────────────────┘

Execution Timeline:

Time 0:
  Main: CreateThread(sum, 0) → T1 starts
  Main: CreateThread(sum, 1) → T2 starts
  Main: CreateThread(sum, 2) → T3 starts
  Main: CreateThread(sum, 3) → T4 starts
  Main: JoinThread(threads[0])... (waits for all)

Time 1-T:
  T1: sum = A[0] + A[1] + ... + A[N/4-1]       (uses local 's' variable)
  T2: sum = A[N/4] + A[N/4+1] + ... + A[N/2-1]  (uses local 's' variable)
  T3: sum = A[N/2] + ... + A[3N/4-1]           (uses local 's' variable)
  T4: sum = A[3N/4] + ... + A[N-1]             (uses local 's' variable)

Time T+1:
  T1: partialSum[0] = s; (write to shared memory)
  T2: partialSum[1] = s;
  T3: partialSum[2] = s;
  T4: partialSum[3] = s;
  (All threads terminate)

Time T+2:
  Main: (JoinThread calls return)
  Main: s = 0
  Main: s += partialSum[0] + partialSum[1] + partialSum[2] + partialSum[3]
  Main: printf("The sum is %d", s)

Key Design Patterns:

1. No Synchronization Needed on Reads

Each thread works on a different part of the array:

Thread0: elements [0, N/4)
Thread1: elements [N/4, N/2)
Thread2: elements [N/2, 3N/4)
Thread3: elements [3N/4, N)

No overlap → No race conditions on reading A[]

2. Single Write Per Thread

Each thread writes to a different partialSum element:

Thread0 writes only to partialSum[0]
Thread1 writes only to partialSum[1]
Thread2 writes only to partialSum[2]
Thread3 writes only to partialSum[3]

No contention → No mutex needed

3. Synchronization via Join

Main thread waits for all workers:

for (int i = 0; i < 4; i++) {
    JoinThread(threads[i]);  // Wait for thread to finish
}
// Now safe to read all partialSum values

Performance Analysis:

Without parallelism:

Sequential: O(N) operations on single thread
Time: N operations

With 4 threads:

Parallel: Each thread does N/4 operations
Time: ~N/4 + overhead (near-linear speedup)
Speedup: ~4x on 4-core system

Amdahl’s Law Application:

Sequential portion (main thread combining): ~4 additions
Parallel portion (thread work): ~N-4 additions

Speedup = 1 / (epsilon + (1-epsilon)/4)  where epsilon ≈ 4/N

For large N: Speedup ≈ 4x
For small N: Speedup less than 4x

Potential Issues and Improvements:

Issue 1: False Sharing

If partialSum array elements are on same cache line:

┌─────────────────────────────────────────────┐
│ Cache Line (64 bytes)                      │
│ [partialSum[0]] [partialSum[1]] [partialSum[2]] [partialSum[3]] │
└─────────────────────────────────────────────┘

All threads writing to same cache line → contention

Solution:

// Pad array to avoid false sharing
typedef struct {
    int sum;
    char padding[64 - sizeof(int)];  // Align to cache line
} aligned_sum;
 
aligned_sum partialSum[4];

Issue 2: Thread Creation Overhead

Creating 4 threads has overhead. For very small N, sequential might be faster.

Break-even: Usually N > 10,000 elements

Complete Example with Different Array Values:

// Example: Array with 1000000 elements
#define N 1000000
int A[N];
 
// Initialize array
for (int i = 0; i < N; i++) {
    A[i] = i + 1;  // Sum = 1+2+...+1000000 = 500000500000
}
 
// Thread 0 computes: 1+2+...+250000 = 31,250,125,000
// Thread 1 computes: 250001+...+500000 = 93,750,125,000
// Thread 2 computes: 500001+...+750000 = 156,250,125,000
// Thread 3 computes: 750001+...+1000000 = 218,750,125,000
 
// Total: 500,000,500,000

Problem 19: Multi-Threaded Minimum Element

Write a multi-threaded program to find minimum element of an array A[0..N-1] with two threads. The first thread should find the minimum of A[0..N/2] and the second thread should find the minimum of A[N/2+1..N-1]. Each thread should start immediately upon being created and each must execute the same function to find the minimum. After threads have been terminated, the minimum element of the array should be printed by the main thread. Assume that array A is global and N is divisible by 2.

**LLM suggested to:**

Fix part of the above text to:

  • The first thread should find the minimum of A[0..N/2-1] and the second thread should find the minimum of A[N/2..N-1]
#include <stdio.h>
#define N ...
 
int A[N]; // Array for which the minimum is to be computed
Student Solution

Answer by Justin:

#include <stdio.h>
#define N ...
 
int A[N];
int smallest[2];
 
void findMin(int id) {
    int start = id * (N/2);
    int end = start + (N/2);
    int small = A[start];
 
    for (int i = start + 1; i < end; i++) {
        if (A[i] < small) {
            small = A[i];
        }
    }
 
    smallest[id] = small;
}
 
int main() {
    int threads[2];
    for (int i = 0; i < 2; i++) {
        threads[i] = CreateThread(findMin, i);
        JoinThread(threads[i]);
    }
 
    if (smallest[0] < smallest[1]) {
        printf("The smallest number is %d", smallest[0]);
    } else {
        printf("The smallest number is %d", smallest[1]);
    }
 
    exit(0);
}

Elaboration (by LLM):

Parallel Minimum Finding:

This solution divides the array into 2 halves and finds the minimum in each half using separate threads, then combines the results.

Thread Work Distribution:

Array A[0..N-1]:

Thread 0:
  start = 0 * (N/2) = 0
  end = 0 + (N/2) = N/2
  Searches: A[0], A[1], ..., A[N/2-1]
  Result: smallest[0] = min(A[0..N/2-1])

Thread 1:
  start = 1 * (N/2) = N/2
  end = N/2 + (N/2) = N
  Searches: A[N/2], A[N/2+1], ..., A[N-1]
  Result: smallest[1] = min(A[N/2..N-1])

Memory Diagram:

Shared Memory:
┌──────────────────────────────────────┐
│ A[0]  A[1]  ... A[N/2-1]  A[N/2] ... A[N-1]
│ └─────────┬──────────┘   └──────┬─────┘
│    Thread 0's range       Thread 1's range
│
│ smallest[0]  smallest[1]  (shared results)
└──────────────────────────────────────┘

Algorithm Trace:

Example: A = [7, 2, 9, 1, 5, 3, 8, 4]  (N=8)

Thread 0 execution (id=0):
  start = 0, end = 4
  small = A[0] = 7
  i=1: A[1]=2 < 7  → small = 2
  i=2: A[2]=9 > 2  → small = 2
  i=3: A[3]=1 < 2  → small = 1
  Result: smallest[0] = 1

Thread 1 execution (id=1):
  start = 4, end = 8
  small = A[4] = 5
  i=5: A[5]=3 < 5  → small = 3
  i=6: A[6]=8 > 3  → small = 3
  i=7: A[7]=4 > 3  → small = 3
  Result: smallest[1] = 3

Main combines:
  smallest[0]=1, smallest[1]=3
  1 < 3 → Print "The smallest number is 1"

Execution Timeline:

Main:  CreateThread(findMin, 0) → Thread 0 starts
       CreateThread(findMin, 1) → Thread 1 starts
       JoinThread(threads[0])   → Wait for Thread 0

Thread 0:  Iterates through A[0..N/2-1]
           Compares each with local 'small'
           Writes result to smallest[0]
           Terminates

Thread 1:  (Runs concurrently or sequentially)
           Iterates through A[N/2..N-1]
           Compares each with local 'small'
           Writes result to smallest[1]
           Terminates

Main:  JoinThread(threads[1])   → Wait for Thread 1
       Compare smallest[0] vs smallest[1]
       Print result

Key Design Patterns:

1. Independent Search Ranges

No data race:

Thread 0 searches: [0, N/2)      - disjoint ranges
Thread 1 searches: [N/2, N)      - no overlap

2. Separate Result Storage

No contention:

smallest[0]  ← only written by Thread 0
smallest[1]  ← only written by Thread 1

3. Sequential Result Combination

Main thread safely reads after join:

JoinThread(threads[0]);  // Ensure Thread 0 done writing
JoinThread(threads[1]);  // Ensure Thread 1 done writing
// Now safe to compare smallest[0] and smallest[1]

Handling Edge Cases:

Case 1: Both halves identical

if (smallest[0] < smallest[1]) {
    printf(...smallest[0]);
} else {
    printf(...smallest[1]);  // Handles equality
}

Case 2: Negative numbers

Algorithm works the same:

Array: [-5, 3, -8, 1]
T0: smallest[0] = -8  (finds -8 < -5)
T1: smallest[1] = 1   (finds 1 < 3)
Final: -8 is smallest

Performance Analysis:

Sequential minimum search:

Time: N-1 comparisons
Complexity: O(N)

Parallel with 2 threads:

Each thread: (N/2)-1 comparisons
Time: ~(N/2) + merge overhead  ≈ N/2
Speedup: ~2x on dual-core

Generalized to k threads:

Time: ~(N/k) + k*log(k)  for final merge
Speedup: ~k on k-core system (for large N)

Comparison with Summation (Problem 18):

AspectSummation (P18)Minimum (P19)
Threads42
Parallel OpsIndependent addsIndependent compares
Merge MethodSimple additionmin() comparison
ComplexityO(N) → O(N/4)O(N) → O(N/2)
ScalabilityO(k) threadsO(k) threads

Potential Issues:

Issue 1: Unbalanced Data

If minimum is in first half:

A = [1, 2, 3, 4, 5, 6, 7, 8]
T0: Does work, finds 1
T1: Does work, finds 5
T0 could finish much faster

Threads still finish at same time due to JoinThread (synchronization point).

Issue 2: Very Small Array

N = 10:
  Sequential: 9 comparisons
  Parallel (2 threads):
    - Thread creation overhead
    - Each thread: 4 comparisons
    - Total time likely greater than sequential!

Break-even point usually N > 1,000

Issue 3: Initialization Bug

Current code assumes A[start] exists:

int small = A[start];  // OK only if start < N

Safe version:

int small = INT_MAX;
for (int i = start; i < end; i++) {
    if (A[i] < small) {
        small = A[i];
    }
}

You are given an unsorted array A[0..N-1] and a key, and you are asked to find the first index where key is stored in parallel as follows: The main thread will create 4 threads, each thread will search N/4 of the elements and terminate. The main thread will then wait for the termination of the worker threads, compute the overall search result, print it on the screen and terminate. If the key does not exist in the array, then print “Key does not exist in the array”. Assume you have 2 functions:

(1) CreateThread(F, arg) creates a thread that starts running at the function F and takes the argument arg and returns the id of the created thread

(2) JoinThread(int threadId) is used to wait for the termination of a thread. Assume that array A and key are both global and N is divisible by 4.

#include <stdio.h>
#define N ...
int A[N]; // Array to be searched
int key;  // Key to be searched
Student Solution

Answer by Justin:

#include <stdio.h>
#define N ...
 
int A[N];
int key;
int index[4];
 
void findKey(int id) {
    int start = id * (N/4);
    int end = start + (N/4);
 
    for (int i = start; i < end; i++) {
        if (A[i] == key) {
            index[i] = i;
            return;
        }
    }
}
 
int main() {
    for (int i = 0; i < 4; i++) {
        index[i] = N;
    }
 
    int threads[4];
    for (int i = 0; i < 4; i++) {
        threads[i] = CreateThread(findKey, i);
    }
 
    for (int i = 0; i < 4; i++) {
        JoinThread(threads[i]);
    }
 
    int smallIndex = N;
    for (int i = 0; i < 4; i++) {
        if (smallIndex < index[i] && index[i] != N) {
            smallIndex = index[i];
        }
    }
 
    if (smallIndex != N) {
        printf("The key was found at index: %d", smallIndex);
    }
    else {
        printf("The key does not exist!");
    }
 
    exit(0);
}

Elaboration (by LLM):

Parallel Key Search Strategy:

This solution performs a parallel linear search by dividing the array into 4 segments. Each thread searches its segment for the key and records the index if found. The main thread then finds the earliest index.

Thread Work Distribution:

Array A[0..N-1]:

Thread 0: start=0,     end=N/4     searches A[0..N/4-1]
Thread 1: start=N/4,   end=N/2     searches A[N/4..N/2-1]
Thread 2: start=N/2,   end=3N/4    searches A[N/2..3N/4-1]
Thread 3: start=3N/4,  end=N       searches A[3N/4..N-1]

If key found by:
  Thread 0 at position 5   → index[0] = 5
  Thread 1 not found       → index[1] = N (sentinel)
  Thread 2 at position 200 → index[2] = 200
  Thread 3 not found       → index[3] = N (sentinel)

Result array: index = [5, N, 200, N]
Final answer: 5 (first occurrence)

Key Design Pattern - Sentinel Value:

// Initialize all to N (sentinel = not found)
for (int i = 0; i < 4; i++) {
    index[i] = N;
}
 
// Each thread updates only if found
if (A[i] == key) {
    index[id] = i;  // Write position of first match in segment
    return;         // EARLY EXIT - stop searching this segment
}
 
// After all threads done, find smallest non-sentinel index
int smallIndex = N;
for (int i = 0; i < 4; i++) {
    if (smallIndex < index[i] && index[i] != N) {
        smallIndex = index[i];  // Update only if valid index
    }
}

Algorithm Trace:

Example: A = [3, 7, 2, 9, 5, 1, 8, 4], key = 5, N = 8

Initial: index = [8, 8, 8, 8]

Thread 0 (start=0, end=2):
  i=0: A[0]=3 ≠ 5
  i=1: A[1]=7 ≠ 5
  No match found, index[0] remains 8

Thread 1 (start=2, end=4):
  i=2: A[2]=2 ≠ 5
  i=3: A[3]=9 ≠ 5
  No match found, index[1] remains 8

Thread 2 (start=4, end=6):
  i=4: A[4]=5 == 5  ← MATCH!
  index[2] = 4
  return (stop searching)

Thread 3 (start=6, end=8):
  i=6: A[6]=8 ≠ 5
  i=7: A[7]=4 ≠ 5
  No match found, index[3] remains 8

Result: index = [8, 8, 4, 8]

Main combines:
  smallIndex = 8
  i=0: 8 < 8? NO
  i=1: 8 < 8? NO
  i=2: 8 < 4 && 4 ≠ 8? YES → smallIndex = 4
  i=3: 4 < 8 && 8 ≠ 8? NO

  smallIndex = 4 ≠ 8 → Print "The key was found at index: 4"

Execution Timeline:

Main:
  Initialize index[4] = [N, N, N, N]
  CreateThread(findKey, 0) → Thread 0 starts
  CreateThread(findKey, 1) → Thread 1 starts
  CreateThread(findKey, 2) → Thread 2 starts
  CreateThread(findKey, 3) → Thread 3 starts
  JoinThread(threads[0]) → Wait for all to finish

Threads (concurrent):
  T0: Linear search [0, N/4)
  T1: Linear search [N/4, N/2)
  T2: Linear search [N/2, 3N/4)  ← Finds key, sets index[2], returns
  T3: Linear search [3N/4, N)

Main (after join):
  Find minimum index from [index[0], index[1], index[2], index[3]]
  Print result or "not found"

Important Implementation Details:

1. Early Exit with Return

if (A[i] == key) {
    index[id] = i;    // Record position
    return;           // EXIT IMMEDIATELY
}
// Don't continue searching once found

This is critical for performance - don’t search the entire segment if key found early.

2. Sentinel Value (N)

// N is a sentinel because:
for (int i = 0; i < 4; i++) {
    index[i] = N;  // Initialize to "not found"
}
 
// Later...
if (index[i] != N) {  // Check if thread found something
    // Valid index, consider it
}

3. Finding Earliest Index

// Multiple threads might find key in different segments
// We want the EARLIEST (smallest index)
int smallIndex = N;
for (int i = 0; i < 4; i++) {
    if (smallIndex < index[i] && index[i] != N) {
        smallIndex = index[i];
    }
}

Wait, there’s a bug here! The condition should be:

if (index[i] < smallIndex && index[i] != N) {
    smallIndex = index[i];
}

This correctly updates smallIndex to the smallest valid index found.

Performance Analysis:

Best Case: Key in first element

Sequential: 1 comparison
Parallel (4 threads): 1 comparison + overhead
Time: ~O(1)

Worst Case: Key not found or at end

Sequential: N comparisons
Parallel (4 threads): N/4 + overhead
Time: ~O(N/4)
Speedup: ~4x

Average Case: Key uniformly distributed

Expected position: N/2
Sequential: N/2 comparisons
Parallel: ~N/8 comparisons per thread
Speedup: ~4x

Comparison with Sequential Search:

Sequential Linear Search:
  for (int i = 0; i < N; i++) {
      if (A[i] == key) {
          printf("Found at %d", i);
          return;
      }
  }
  printf("Not found");

Parallel Approach:
  ✅ Speedup in worst case (key not found)
  ✅ Speedup in average case
  ✅ Only slight overhead if key found early
  ❌ Thread creation/management overhead
  ❌ Only beneficial for large N

Edge Cases:

Case 1: Key not in array

index = [N, N, N, N]
smallIndex = N
Output: "The key does not exist!"

Case 2: Multiple occurrences

A = [1, 5, 2, 5, 3, 5, 4, 5], key = 5

T0: searches [0,2) → finds 5 at index 1 → index[0] = 1, return
T1: searches [2,4) → finds 5 at index 3 → index[1] = 3, return
T2: searches [4,6) → finds 5 at index 5 → index[2] = 5, return
T3: searches [6,8) → finds 5 at index 7 → index[3] = 7, return

index = [1, 3, 5, 7]
smallIndex = 1 (earliest occurrence)
Output: "The key was found at index: 1"

Case 3: Key at boundaries

A = [5, 1, 2, 3, 4, 6, 7, 8], key = 5

T0: searches [0,2) → finds 5 at index 0 → index[0] = 0
T1, T2, T3: don't find 5

index = [0, N, N, N]
smallIndex = 0
Output: "The key was found at index: 0"

Optimization Opportunities:

1. Lock-Free Update

Instead of recording in array, use atomic compare-and-swap:

volatile int earliest = N;
 
void findKey(int id) {
    int start = id * (N/4);
    int end = start + (N/4);
 
    for (int i = start; i < end; i++) {
        if (A[i] == key) {
            // Atomic operation (no mutex needed)
            __sync_bool_compare_and_swap(&earliest, N, i);
            return;
        }
    }
}

2. Break Early Across All Threads

Use shared flag to stop other threads when one finds key:

volatile int found = 0;
 
void findKey(int id) {
    // ...
    if (A[i] == key) {
        index[id] = i;
        found = 1;  // Signal others
        return;
    }
    if (found) return;  // Exit if another thread found it
}