Quiz2 Study Guide

Quiz 2 is closed book, closed note, and no electronic devices are allowed. The quiz will focus on the material in chapters 6-9, 26-28, 36-45 but is inclusive with the material from quiz 1

CPU Mechanisms (Limited Direct Execution)

The Core Problem: CPU Virtualization

The main goal is to create the illusion of many running programs, even with only one (or a few) CPUs. The OS does this through time sharing—rapidly switching between processes.

Key Challenges: * Performance: How to run programs fast, without much OS overhead. * Control: How to let programs run, but stop them from taking over the machine or accessing things they shouldn’t.

The Solution: Limited Direct Execution

The basic idea is to run the program directly on the CPU for speed, but with hardware-enforced limits to maintain OS control.


Problem #1: Restricted Operations (like I/O)

If a program runs directly, how can it safely perform privileged operations (like reading a file) without compromising the system?

Solution: Processor Modes & System Calls

  1. Processor Modes: The CPU supports two modes:

    • User Mode: A restricted mode for running applications.
    • Kernel Mode: A privileged mode for running the OS.
  2. System Calls (Syscalls):

    • A user-mode program can’t do I/O itself. It must ask the OS by executing a special trap instruction.
    • This trap saves the program’s state, switches the CPU to kernel mode, and jumps to a specific “handler” inside the OS.
    • The OS (now in kernel mode) performs the requested operation (e.g., reads the file).
    • When finished, the OS executes a return-from-trap instruction, which restores the program’s state, switches back to user mode, and continues the program.
  3. Trap Table: The OS sets up a “trap table” at boot time to tell the hardware exactly where the OS handlers are. This prevents a program from “trapping” to a random, malicious location.


Problem #2: Switching Between Processes

If a program is running directly, how does the OS get control back to switch to another process (especially if the program is in an infinite loop)?

Solution: The Timer Interrupt (Non-Cooperative Control)

  • The OS can’t rely on programs to “cooperate” and give up the CPU.
  • Instead, the OS programs a hardware timer to fire an interrupt every few milliseconds.
  • When the interrupt fires:
    1. The CPU automatically stops the current program (no matter what it’s doing).
    2. It saves the program’s state, switches to kernel mode, and jumps to the OS’s “interrupt handler.”
  • The OS now has control again. It is free to run a different process.

The Mechanism: Context Switch

This is the low-level action of switching from one process (A) to another (B).

  1. The OS decides to stop A and run B (e.g., after a timer interrupt).
  2. It saves all of A’s register values (its current “context”) to memory.
  3. It restores all of B’s register values (its previously saved “context”) from memory.
  4. It executes a return-from-trap, which resumes B right where it left off.

CPU Scheduling

The Core Problem: How to Schedule?

Given a set of ready processes, which one should the OS run first? This decision is made by the CPU scheduler. The “how” is the scheduling policy.


Scheduling Metrics (How we measure success)

We need metrics to compare different scheduling policies.

  • Turnaround Time: The total time from when a job arrives to when it completes.

    • Turnaround Time = Completion Time - Arrival Time
    • This is a performance metric. Lower is better.
  • Response Time: The time from when a job arrives to the first time it gets to run.

    • Response Time = First Run Time - Arrival Time
    • This is an interactivity metric. Lower is better.

Basic Scheduling Policies

1. First In, First Out (FIFO)

  • How it works: The simplest policy. Processes are run in the order they arrive (like a line at a grocery store).
  • The Problem: Suffers from the convoy effect. If a long job (a “heavyweight”) arrives first, many short jobs (a “convoy”) get stuck waiting behind it, leading to terrible average turnaround time.

2. Shortest Job First (SJF)

  • How it works: Runs the job with the shortest known run time first, then the next shortest, and so on.
  • Benefit: Solves the convoy effect. This is optimal for average turnaround time (assuming all jobs arrive at the same time).
  • The Problem: What if a short job arrives just after a long job has already started? The new short job still has to wait.

3. Shortest Time-to-Completion First (STCF)

  • How it works: This is the preemptive version of SJF.
  • When a new job arrives, the scheduler checks its total run time against the remaining run time of the current job.
  • If the new job is shorter, the scheduler preempts (stops) the current job and runs the new, shorter one.
  • Benefit: This is optimal for average turnaround time, even when jobs arrive at different times.

The Problem with SJF/STCF

SJF and STCF are great for turnaround time, but they are bad for response time. If three jobs arrive at the same time, the third job has to wait for the first two to finish completely before it even runs once.

4. Round Robin (RR)

  • How it works: The “fair” scheduler. Instead of running jobs to completion, RR runs each job for a small time slice (or “quantum”) and then switches to the next job in the queue. It cycles through all ready processes.
  • Benefit: Provides excellent response time, making the system feel interactive.
  • The Trade-off:
    • Time Slice Length: The length of the time slice is critical.
      • Too Short: The cost of context switching (saving and restoring registers) will dominate, and no real work gets done.
      • Too Long: The scheduler starts to behave just like FIFO, and response time gets worse.
    • Turnaround Time: RR is one of the worst policies for turnaround time, because it “stretches out” every job by making it wait for all other jobs to get their turn.

Scheduling Trade-off

There is an inherent conflict between the two metrics: * Policies that optimize turnaround time (like SJF/STCF) are “unfair” and bad for response time. * Policies that optimize response time (like RR) are “fair” and bad for turnaround time.

Other Key Concepts

  • Incorporating I/O: When a process starts an I/O operation (e.g., reading from disk), it is blocked and doesn’t use the CPU. The scheduler should run another job during this time. When the I/O finishes, the process becomes “ready” again.
  • The “Oracle” Problem: Policies like SJF and STCF require that the OS knows the exact length of every job in advance. This is unrealistic. The next chapter discusses how to solve this by predicting the future based on past behavior.

The Multi-Level Feedback Queue (MLFQ)

The Goal: Scheduling Without an Oracle

We need a scheduler that can: 1. Optimize for short response time (for interactive jobs, like Round Robin does). 2. Optimize for short turnaround time (for batch jobs, like SJF does).

The problem is that the OS doesn’t know a job’s type or length in advance.

The Solution: The MLFQ learns from past behavior to predict a job’s type. It automatically gives high priority to interactive jobs and low priority to long-running, CPU-bound jobs.


How MLFQ Works: Queues and Rules

MLFQ uses multiple priority queues (e.g., Q2=High, Q1=Medium, Q0=Low). A job can only be in one queue at a time.

The Basic Rules

  • Rule 1: If Priority(A) > Priority(B), run A. (Higher priority runs first).
  • Rule 2: If Priority(A) == Priority(B), run A & B in Round Robin. (Equal priority jobs share the CPU).

The “Feedback” Part (Changing Priority)

  • Rule 3: When a new job enters, place it in the highest-priority queue. (We assume it’s short/interactive).
  • Rule 4: If a job uses its entire time slice, it’s a long-running (CPU-bound) job. Demote it (move it to a lower-priority queue).
  • Rule 5: If a job gives up the CPU before its time slice ends (e.g., for I/O), it’s an interactive job. It stays at the same priority level.

Problems with the Basic Rules

The simple rules above lead to two major problems:

  1. Starvation: If new interactive jobs keep arriving, the long-running jobs in the low-priority queues might never get to run.
  2. Gaming the Scheduler: A “smart” process could run for 99% of its time slice, then issue a fake, trivial I/O call. This “tricks” the scheduler (Rule 5) into keeping its high priority, allowing it to monopolize the CPU.

Solutions & Refined Rules

We fix these problems by adding/changing a few rules:

Solution for Starvation: The Priority Boost

  • New Rule: After some fixed time period (S), move all jobs in the system to the highest-priority queue.
  • Benefit: This ensures that “starved” jobs get a chance to run. It also allows the scheduler to adapt if a job’s behavior changes (e.g., a long job becomes interactive).

Solution for Gaming: Better Accounting

  • Revised Rule 4: Instead of just demoting a job after one time slice, the scheduler tracks the total CPU time a job has used at its current priority level.
  • Once a job’s total time budget at a level is used up, it is demoted, regardless of how many times it issued I/O.
  • Benefit: This prevents the “99% trick” because the trivial I/O no longer “resets” the job’s priority.

Tuning

An administrator must configure the MLFQ scheduler: * How many queues? * What is the time slice length for each queue? (Often, high-priority queues have short slices for fast response, and low-priority queues have long slices for efficiency). * How often (S) should the priority boost occur?

Proportional-Share Scheduling

1. Core Concept: Proportional Share

The main goal of this type of scheduler is different from previous ones. Instead of optimizing for turnaround time or response time, a proportional-share scheduler aims to guarantee that each job gets a certain percentage of the CPU’s time.


2. Lottery Scheduling

This is the primary example of proportional-share scheduling.

  • Basic Idea: Tickets are used to represent a process’s share of the CPU. The percentage of tickets a process holds is the share of the CPU it should receive.
  • Mechanism: At each time slice, the scheduler holds a “lottery” by picking a random “winning ticket” number. The process that holds that ticket (or is in that ticket’s range) gets to run for that slice.
  • Result: This is a probabilistic approach. It is not perfectly fair over short periods but will converge to the correct proportions over the long run.
  • Advantages:
    • It is very simple to implement (just needs a random number generator).
    • It is “stateless” and handles new jobs arriving very easily (just add their tickets to the total pool).

Key Ticket Mechanisms

  • Ticket Transfer: A process can temporarily give its tickets to another. This is very useful in client/server situations, where a client can transfer its tickets to a server to boost the server’s priority while it works on the client’s request.
  • Ticket Inflation: A process can temporarily “print” more tickets for itself. This is only useful in a trusted environment where processes cooperate.
  • Ticket Currency: Allows a user (or group) to allocate tickets to their own jobs in their own “currency,” which the OS then automatically converts to a global ticket value.

3. Stride Scheduling

This is the deterministic (non-random) alternative to lottery scheduling.

  • How it works:
    1. Each job has a stride value, which is inversely proportional to its number of tickets (e.g., Stride = 10000 / tickets). A job with more tickets has a smaller stride.
    2. Each job also has a pass value, which starts at 0 and tracks its progress.
    3. The Rule: The scheduler always picks the job with the lowest pass value to run next.
    4. When a job runs, its pass value is increased by its stride.
  • Result: This method achieves perfect proportions deterministically. The job with the smallest stride (most tickets) will have its pass value increase the slowest, so it will be picked most often.
  • Trade-off: Stride scheduling is perfectly fair, but it’s difficult to handle new jobs. When a new job enters, it’s not clear what its pass value should be set to (setting it to 0 would let it monopolize the CPU). Lottery scheduling’s stateless nature handles this much more easily.

4. The Linux Completely Fair Scheduler (CFS)

CFS is a real-world, highly efficient proportional-share scheduler used by the Linux kernel.

  • Core Concept: Virtual Runtime (vruntime)

    • This is the central idea. Every process has a vruntime counter.
    • The Rule: CFS always runs the process with the lowest vruntime.
    • As a process runs, its vruntime increases. This ensures the scheduler always picks the process that has had the least amount of CPU time.
  • Priority (Niceness)

    • CFS uses the standard nice value (from -20 to +19).
    • This nice value is mapped to a weight. A higher priority (negative nice value) gets a larger weight.
    • The weight determines how fast vruntime increases. A high-priority job’s vruntime increases slower, allowing it to run for longer before it’s no longer the “lowest.”
  • Data Structure: Red-Black Tree

    • CFS does not use a simple list. It stores all runnable processes in a red-black tree (a self-balancing binary search tree).
    • Processes are ordered in the tree by their vruntime.
    • This makes it extremely efficient (O(log N)) to find the next job (it’s always the leftmost node) and to re-insert a process after it runs.
  • Time Slices & Latency

    • CFS does not have a fixed time slice. It has a target latency (e.g., sched_latency = 48ms) and tries to run all processes within that period.
    • The time slice is dynamic, based on the number of processes.
    • It also has a min_granularity (e.g., 6ms) to set a minimum time slice, which prevents the system from wasting all its time on context switching if there are too many processes.
  • Handling Sleeping Processes (I/O)

    • A process that was sleeping (e.g., waiting for I/O) will have a very low vruntime. If it ran, it would starve other jobs.
    • The Solution: When a sleeping process wakes up, its vruntime is set to the minimum vruntime currently in the tree. This ensures it gets to run quickly (great for interactivity) but doesn’t get an unfair advantage.

Concurrency

1. What is a Thread?

  • A thread is a single point of execution within a process.
  • A “normal” process is single-threaded (it has one PC). A multi-threaded program has multiple points of execution (multiple PCs).
  • Threads are like separate processes, except they share the same address space. This means they can access the same data (code and heap).

2. Thread vs. Process

  • What threads share: Address space (Code, Heap/Global Data).
  • What is private to each thread:
    • Program Counter (PC): Each thread executes independently.
    • Registers: Each thread has its own private set of registers.
    • Stack: Each thread gets its own stack, used for its local variables and function calls. This is why a multi-threaded address space looks different, with multiple stacks.
  • Context Switch: Switching between threads (a “thread context switch”) is similar to a process switch: the OS must save T1’s registers (to its TCB) and restore T2’s registers.
  • Key Difference: A thread switch is faster because the OS does not need to switch the address space (i.e., no need to change page tables).

3. Why Use Threads?

  1. Parallelism: On a multiprocessor system, you can assign different threads to different CPUs to do work in parallel, speeding up the program.
  2. Overlap I/O and Computation: One thread can issue a slow I/O request (like reading from disk) and block. While it waits, other threads in the same process can continue to run, doing useful computation.

4. The Core Problem: Concurrency

  • The Problem: When threads share data, things get complicated.
  • Race Condition: The program’s result becomes indeterminate (it changes from run to run) because the final outcome depends on the uncontrolled scheduling of which thread runs when.
  • Critical Section: A piece of code that accesses a shared variable (e.g., counter = counter + 1;).
  • Why it fails: This simple increment is not one instruction. It’s three:
    1. load value from memory to register
    2. add 1 to register
    3. store value from register to memory
  • The failure scenario: If Thread 1 is interrupted after loading the value but before storing it, Thread 2 can run, load the same old value, and store its result. When Thread 1 runs again, it will overwrite Thread 2’s work. This is a “lost update.”
  • The Goal: We need mutual exclusion to ensure only one thread enters a critical section at a time. We want the update to be atomic (all-or-nothing).

Locks

  • API
    • pthread_mutex_t lock;
    • pthread_mutex_init(&lock, NULL)
    • pthread_mutex_lock(&lock)
    • pthread_mutex_unlock(&unlock)
  • Spinlocks
    • test-and-set
    • compare-and-swap
  • Evaluating locks
    • mutual exclusion
    • fairness
    • performance
  • Sleeping instead of spinning
    • park and unpark
  • Two-phase locks
    • Hybrid approach, first try a spinlock, then sleep
  • Ticket locks
    • fetch-and-add

I/O Devices

Crux: How to integrate I/O into Systems? How should I/O be integrated into systems? What are the general mechanisms? How can we make them efficient?

  • Hierarchical nature of system busses, (i.e., why are some busses faster than others)?
  • The structure of a canonical I/O device
    • Register Interface: (status, command and data)
    • Internals: Micro-controller, Memory, and hardware specific chips
  • Programmed I/O (PIO)
  • Direct Memory Access (DMA)
  • Privileged I/O instructions (e.g., in and out) vs. memory mapped I/O
  • The abstraction of device drivers
  • A basic understanding of how an IDE disk driver works
    • Wait for drive to be ready (how?)
    • Write parameters to command registers
    • Start the I/O
    • Data transfer for writes
    • Handle interrupts
    • Error handling (how?)

Hard Disk Drives

Crux: How to store and access data on disk? How do modern hard-disk drives store data? What is the interface? How is the data actually laid out and accessed? How does disk scheduling improve performance?

  • Sector interface (address space) of a hard drive
    • Atomic read/write
  • Geometry and Anatomy
    • Platter, spindle, track, cylinder, disk head, and disk arm
  • Timing of read/writes
    • Seek Time, Rotational Delay, Data Transfer
  • Sequential vs. Random performance
  • Disk Scheduling
    • SSTF: Shortest Seek Time First
    • Elevator (SCAN or C-SCAN)
    • SPTF: Shortest Positioning Time First
  • I/O Merging
  • Work-conserving vs. non-work-conserving

RAID

Crux: How to make a large, fast, and reliable disk. How can we make a large, fast, and reliable storage system? What are the key techniques? What are trade-offs between different approaches?

  • Benefits of RAID: Performance, Capacity, and Reliability
  • Interface: a linear array of blocks (just like a regular disk)
  • Fault Model: fail-stop
  • RAID Evaluation
    • Useful capacity available to clients
    • How many disk faults can the given design tolerate
    • Random and sequential throughput
  • Chunk Sizes
  • Raid-0 Striping
  • Raid-1 Mirroring
  • Raid-4 Parity Disk
    • Small write problem
  • Raid-5 Rotating Parity

Files and Directories

Crux: How to manage a persistent device. How should the OS manage a persistent device? What are the APIs? What are the important aspects of the implementation?

  • The abstraction of a file
    • A linear array of bytes
    • It has a low-level name (inode)
    • OS does not know (or care) about the internal structure of the file
  • Directories
    • A special type of file
    • Contains a mapping of human readable names (foo.txt) to inode number (869432)
    • directory hierarch starts and the root directory (/)
    • directories contain files or other directories (called a sub-directory)
  • File System Interface
    • Creating Files (open)
      • Takes a number of flags (e.g., O_CREATE, O_WRONLY, O_TRUNC)
      • Returns an integer file descriptor, the handle to the file
      • Read, Write, Execute permissions
      • Owner, Group, and Everyone permissions
    • Reading and writing files (read and write)
      • May not read or write all the bytes you specify
      • Must look at return value
      • Keeps track of where in the file you have read/written
      • lseek to change this file position index
    • Forcing writes (fsync)
      • For performance reasons, the OS often delays writes
      • This forces a write to disk for the given file descriptor
    • Renaming files (rename)
      • Usually implemented atomically
    • Getting information about files (stat)
      • Metadata about the file (inode number, size, permissions, etc.)
      • Stored in the inode of a file
    • The difference between hard links and soft links
    • Removing files (unlink)
      • inode has a reference count to hard links to the file
    • Making directories (mkdir)
      • Meaning of . and ..
    • Mounting a file system (mount)
      • A file system must be mounted before it can be used
      • Need to specify where on the directory tree the filesystem “mounts” to
      • Masks any existing files and directories at the mount point (until the filesystem is unmounted)

File System Implementation (VSFS Very Simple File System)

Crux: How to implement a simple file system? How can we build a simple file system? What structures are needed on the disk? What do they need to track? How are they accessed?

  • VSFS Overall Organization
    • Data Blocks
    • Inodes
      • Metadata
      • Direct pointers, indirect pointers, double indirect pointers
    • Inode Bitmap
    • Data block bitmap
    • Superblock
  • Reading and writing files
    • Operations involved in reading and writing a file
  • Overall Performance

The Fast File System (FFS)

Crux: How to organize on-disk data to improve performance? How can we organize file system data structures so as to improve performance? What types of allocation policies do we need on top of those data structures? How do we make the file system “disk aware”?

  • What motivated the design of the FFS?
  • Cylinder/Block group organizing structure
    • Each cylinder group has its own superblock, inode bitmap, datablock bitmap, inodes, and data blocks
  • Policies to allocate files and directories
    • Keep related stuff together
    • Keep unrelated stuff far apart
  • Directory placement: find cylinder group with log number of allocated directories and a high number of free inodes
    • How did FFS handle allocation of large files?
      • Direct pointers kept in cylinder group
      • Indirect blocks are placed in another cylinder group
  • How did FFS handle internal fragmentation?
    • Sub-blocks

FSCK and Journaling

Crux: How to update the disk despite crashes? The system may crash or lose power between any two writes, and thus the on-disk state may only partially get updated. After the crash, the system boots and wishes to mount the file system again (in order to access files and such). Given that crashes can occur at arbitrary points in time, how do we ensure the file system keeps the on-disk image in a reasonable state?

  • Crash Scenarios
    • Need to update the Data block, Inode, Data block bitmap, must happen atomically
    • If only one or two of those writes happen, which can leave the filesystem in an inconsistent state?
  • FSCK
    • When is it run, what does it do? What does it check?
      • Superblock
      • Free blocks
      • Inode state
      • Inode links
      • Duplicates
      • Bad blocks
      • Directory checks
    • Problems with FSCK?
      • Too slow
  • Journaling (Write-Ahead Logging)
    • New structure, the superblock
      1. Data write
      2. Journal metadata write
      3. Journal commit
      4. Checkpoint metadata
      5. Free transaction in journal superblock
    • Block Reuse
      • Directories are considered metadata
      • Problems with deleing and reusing a directory block
      • Replay could overwrite new data block
      • Revoke record

Log-structured File Systems

Crux: How to make all writes sequential? How can a file system transform all writes into sequential writes? For reads, this task is impossible, as the desired block to be read may be anywhere on disk. For writes, however, the file system always has a choice, and it is exactly this choice we hope to exploit.

  • Motivation
    • System memories are growing
    • There is a large gap between random I/O performance and sequential I/O performance
    • Existing file systems perform poorly on many common workloads
    • File systems are not RAID-Aware
      • Small-write problem for RAID-4 and RAID-5
  • New structure: segment
    • Never update in place
    • Buffer writes into segment sized chunks
  • How do you find inodes?
    • Inode map: mapping of inode numbers to inodes blocks
    • Also not updated in place
    • How do you find the inode map? Checkpoint region stored at a fixed location on disk
      • Usually buffered in memory
  • Garbage Collection
    • When blocks are updated, the old blocks must be reclaimed by the system
    • Determining block liveness
    • Segment summary block records the inode number and offset for each data block
    • Compare to inode map to see if they are the same
  • Which blocks to clean and when to clean them?
    • Hot (contents frequently updated) and cold segments (not frequently updated)
    • Clean cold segments sooner

Flash-based SSDs

Crux: How to build a flashed-based SSD? How can we build a flash-based SSD? How can we handle the expensive nature of erasing? How can we build a device that lasts a long time, given that repeated overwrite will wear the device out? Will the march of progress in technology ever cease? Or cease to amaze?

  • SSD organization
    • Banks: largest grouping of a flash, contains many blocks
    • Blocks: typically sized 128 KB to 256 KB – erasing is done in block sized chunks, contains many pages
    • Pages: typical size 512 bytes to 4 KB, smallest read/write unit
  • Flash operations
    • Read a page
    • Erase a block
    • Program a page
  • Performance in reliability
    • No moving parts, less power, less heat that HDD
    • Much better random I/O performance
    • Flash pages wear out after too many erase cycles (100,000 to 1,000,000 times)
    • Program disturbs: writing a page may flip the bits in neighboring pages
      • Best to write data in page order
    • Cost about 10x more than HDD
  • Flash Translation Layer (FTL)
    • Takes read/write requests on logical blocks and turns them into the low-level read, erase, and program requests on physical blocks and physical pages
  • Log-structured FTL
    • For a write to a logical block, the device appends the write to the next free spot in the currently-being-written-to block
    • Need a mapping table (both in memory for performance and on disk for persistence)
    • Garbage collection
      1. Find a block with one or more garbage pages
      2. Read in the live pages from that block
      3. Write out live pages to the log
      4. Erase the entire block to reclaim it for use
    • Mapping Table Size
      • It can grow large for big SSDs
      • Block based mapping, one pointer per block of the device
      • Now logical blocks within a block must be kept together
      • Performance problems for small writes
    • Hybrid Mapping
      • Log table: per page log blocks mappings
      • Data table: per block data block mappings
      • Check in log table before data table
      • Writes occur in the log blocks
      • Switch log blocks to Data Blocks in the data table
      • Easy for a switch merge, OK for a partial merge, and painful for a full merge
    • Page Mapping Plus Caching
      • Go with simpler page mapping approach but only keep a subset of popular mappings in memory
      • Rest stored on flash