virtualmemory.html

COP 4610: Operating Systems & Concurrent Programming

Virtual Memory

Background & Motivation

The segmented/paged memory principle
The hierarchical memory (cache) principle.

These are the two key observations that motivate the development of virtual memory. We just finished looking at pages and segmented memories.

Common Features of Paging & Segmentation

Memory references are dynamically translated into physical addresses at run time
A process may be located in different regions at different times
A process may be broken up into discontiguous pieces
Special hardware speeds up the translation of logical to physical addresses, to make this feasible

Observation: All pieces of a process do not need to be loaded in main memory during execution

Once we have paid the price for the extra hardware to do dynamic address translation (originally to solve the problem of storage management), we discover that we can do more with it.

Advantages of Breaking up a Process

More processes may be maintained in main memory

Only load in some of the pieces of each process
With more processes in main memory, it is more likely a process will be in the Ready state at any particular time
Reduces CPU idle time, increases throughput

A process may be larger than all of main memory

Allows executing larger programs, on larger datasets

This is another case of the hierarchical memory concept, which we saw previously applied to memory cache. This is a very useful concept, which has several applications in operating systems.

Hierarchical Storage, Memory, Buffers, Caches

memory cache(s)
virtual memory
translation lookaside buffer (TLB)
disk drive internal cache
filesystem buffer cache

We hope that by the end of this course, you should be able to answer:

What are each of these? What is its purpose?
How are all these things similar?
How is each different from the others?

You should have seen memory caches before. We will now look at virtual memroy and the TLB. The other two will come up latter, when we look at the I/O and file systems.

If you have not read Chapter 1 of Stallings' book (including Appendix 1A) carefullly before, now is a good time to go back and read it. In particular, make certain you have read and understood the sections on The Memory Hierarchy, Cache Memory, and I/O Communication Techniques. All of what we are studying now is an elaboration on the themes introduced there.

As we have observed before, there are a few basic ideas in computing, which people apply over and over in slightly different ways. One of these ideas is that of a cache. A cache is a special kind of buffer, that is used to reduce the average-case cost of accessing a comparatively slow storage medium. Main memory access time is much slower than CPU instruction cycle times. Disk storage access time is MUCH slower than main memory access times.

Because all of the items on the list above are based on the cache principle, you will see many similarities between them. The similarities will make it easier for you to understand them, but you must take care not to forget that they are all distinct.

Similarities: Cache Principles/Assumptions

References to storage/memory are not random
In particular, references exhibit a pattern of locality in space and time
We can guess a set of locations that, with high probability, will include most of the next few references
Keeping the contents of those locations in faster storage/memory will save time
The overhead of this mechanism will not exceed the payoff, on the average

Note that all of the above is based on probabilities, and long term, average case, behavior. The fundamental assumption is that of locality of reference.

Recall from Chapter 1 the two forms of locality:

temporal: locations referenced recently are more likely to be referenceed again soon
spatial: locations at addressed near locations referenced recently are more likely to be referenced again soon

Similarities: Cache/VM Performance Issues

Total size of cache
Size of blocks/units within cache (granularity)
Replacement policy

The design and performance issues for the various applications of the cache principle are similar. Generally, a larger total cache size results in a higher probability of hits, which is good. Generally, the granularity must be tuned to fit the system. If the granularity is too large or too small, the miss ratio will go up. In all cases, a critical decision is the replacement policy, i.e., how to decide which data should be kicked out of the cache when we need to bring in some new data.

We will see there are also important differences in each application of the cache principle. Some of these have to do with the relative speeds of the devices, the relative costs of devices, the degree of success we can expect in predicting the locality of references, and the relative cost of bad guesses. For example, the faster the replacement decision must be made the simpler the policy must be, since the performance improvement due to a better hit ratio would be eaten up by the overhead of making the decisions.

Execution of a Program with Virtual Memory

OS brings a few piece of the program into main memory when process starts execution
Resident set = portion of process that is in main memory
A trap/interrupt is generated when an address is needed that is not in the resident set
(This trap is called a page/segment fault)
OS puts the process in the Blocked state
Piece of process that contains the logical address is brought into main memory

OS initiates a disk read request
Another process is dispatched to run while until the disk I/O completes
An interrupt is issued when disk I/O completes

OS puts the process in the Ready state

Virtual versus Real Memory

Real memory = Main memory
Virtual memory
- may be distributed between main memory and secondary memory (disk, also called backing store)
- allows for effective multiprogramming and relieves the user of tight constraints of main memory

Virtuality

The term virtual is used a lot in computing.

virtual memory, virtual address
virtual processor
virtual machine
virtual registers
virtual time (see below)
virtual clock
virtual reality

Do you know what these all mean?

Can you think of any more?

You should be able to generalize the meaning. What do these all have in common?

By now, the term virtual has migrated from the technical jargon of computer scientists to common speach, through the popular media, so most people already understand the difference between a virtual thing and a real thing.

At a deeper philosophical level, it may be argued that none of us knows what is real and what is virtual. However, the implementor of a virtual system knows effectively what is virtual, meaning what she/he implements, and what is real, meaning what he/she uses to implement it.

How can Virtual Memory Work?

Accessing disk storage is very much slower than main memory
e.g., milliseconds vs. nanoseconds (1,000,000 : 1)
Must not need to go to disk very often
Many instructions must be executed between virtual memory faults
Otherwise, the processor will spend most of its time waiting for swapping I/O
the latter is called thrashing

Principle of Locality

People observed:

Program and data references within a process tend to cluster
in certain areas of memory for significant periods of time
Only a few pieces of a process will be needed over a given period of time
Possible to make intelligent guesses about which pieces will be needed in the future

These all suggested that virtual memory might work efficiently.

They guessed correctly.

Historical Note:This work was done in the late 1960's, and pioneered by IBM.

What Factors Affect Locality of Reference?

Positive influences:

modest-sized subprograms & loops
use of local variables
modest-sized arrays

Negative influences:

hash tables
huge arrays
dynamic storage allocation
large linked structures
unstructured code (jumps around a lot)
very tiny subprograms (calls jump around a lot)
multithreading

The effectiveness of virtual memory seems to have decreased in recent years. One factor is the lower cost of main memory. When we can afford to put several gigabytes of main memory on a machine, the utility of having a larger virtual address space than physical address space is reduced. Another factor seems to be reduced locality of reference.

How has the adoption of OOP and object-oriented languages affected locality of reference?

Which of the above causes of locality are temporal and which are physical?

Do you undertand how each of the negative influences listed above reduceds locality of reference?

Support Needed for Virtual Memory

Hardware must support paging and/or segmentation
OS must manage movement of pages and/or segments to and from secondary storage

Paging

Each process has its own page table
Each page table entry contains:
- resident bit: is the page is in main memory?
- frame number of the corresponding page in main memory, if resident
- modified bit: has the page has been altered since it was last loaded into main memory?
  If not, it does not have to be written to the disk when it is swapped out, so long as the original copy is still out there
- other control bits, e.g. for protection

Page Table Entries

Address Translation in a Paging System

Handling Large Page Tables

The entire page table may take up too much main memory
Example:
with 4-kbyte pages (2¹² bytes/page), 2³² virtual address space occupies 2²⁰ pages
with 4-byte page table entries (4 bytes/page), page table occupies 2²² bytes (2¹⁰ pages)
Page tables can also be stored in virtual memory
When a process is running, only part of its page table is in main memory
Only root page table needs to be always resident
Example: 2¹⁰ pages takes 4 kbytes

Two-Level Scheme for 32-bit Address

Translation Lookaside Buffer (TLB)

Each virtual memory reference can cause two physical memory accesses

one to fetch the page table entry
one to fetch the data

To overcome this problem a high-speed cache is set up for page table entries

called the TLB - Translation Lookaside Buffer

Contains page table entries that have been most recently used
Performs associative mapping
Functions like a memory cache

Operation of TLB

Given a virtual address, processor examines the TLB
If page table entry is present (a hit), the frame number is retrieved and the real address is formed
If page table entry is not found in the TLB (a miss), the page number is used to index the process page table
First checks if page is already in main memory
If not in main memory a page fault is issued,
and the OS brings the page into main memory
The TLB is updated to include the new page entry

How is the TLB distinguished functionally from the memory cache?

How do they operate together?

What effects limit the practical size of the TLB?

Operation of TLB as Flowchart

Use of a Translation Lookaside Buffer

TLB Fault Processing

The figure shows the timing and causal relationships between the events and agents involved in a TLB fault. It starts when the CPU issues a memory fetch operation (possibly for data or for an instruction) that refers to an address whose page frame number s not in the TLB. The TLB fetches the corresponding page table entry from memory. It then translates the virtual address to a physical address and fetches the operand originally requested by the CPU.

The fetch operation ends up taking two memory cycles. In order for this not to slow the execution down appreciably, most memory references must be TLB hits.

Relation of TLB to Memory Cache

Page Fault Processing

The figure shows the timing and causal relationships between some of the events and agents involved in a page fault. It starts when the CPU issues a memory reference for an adress that is not in the TLB.

In order to focus on the interesting events, the figure leaves out many details, among them the entire role of the memory cache(s), and the details of the disk subsystem.

The dashed arrows allude to some of these details that we cannot ignore entirely, i.e., the many interactions between the executing OS software and the memory that are needed to handle the page fault, bring the page in from the disk, and update the page table.

The OS actions start with the execution of an interrupt (or trap) handler, the page fault (PF) handler. The OS marks the page-faulted process as being blocked, and initiates a disk read operation to fetch the page into main memory. The PF handler may do some of this work directly, or it may wake up a pager process and let that process do the work. (What are the arguments for each of these two alternatives? Is there a middle ground?)

At some point the PF handler will call the dispatcher, so that some other process can use the CPU while the disk is busy.

When the disk subsystem (what is included in the disk subsystem?) has brought the page into memory, it interrupts the CPU again. The disk I/O interrupt handler, or a pager process that it wakes up, updates the page table in memory, and marks the page-faulted process as being ready.

Eventually, the dispatcher will resume execution of the page-faulted process. It will reissue the instruction that caused the page fault. This time the TLB will succeed in fetching the page table entry from memory, and the operation of the TLB will proceed as in the preceding figure, i.e., there will be a TLB fault, the page table will entry will be fetched, the TLB will be updated, and finally the CPU

After all this, the system is back at the point of the original TLB fault, which led to the page fault. It is clear that the OS does quite a bit of work to handle each page fault. It follows that we cannot afford to have page faults very often.

In particular, it would be disastrous if the OS code for handling a page fault itself were to cause a page fault! If it did, the system could get into an infinite cycle of page faults. For this reason, the code that handles page faults is kept permanently in main memory. (However, it is possible that some of the processing by the OS for a page fault may generate further page faults. In particular, this may be the case if the page table itself is paged.)

Page Size Performance Trade-Offs

Smaller page size	=>	less internal fragmentation
Smaller page size	=>	fewer page faults
Larger page size	=>	more disk I/O efficiency
Larger page size	=>	smaller page tables
	=>	fewer pages per process
	=>	fewer TLB misses, fewer secondary page faults

Graph (a) shows the general effect of increasing the page size. I think this is a little bit misleading, since it is not clear from the figure what assumptions are being made about the number of page frames assigned to the process. The explanation for the increase on the left edge of the chart is that as the page size increases the probability of having the entire "working set" in memory goes down. However, that presumes a limited amount of real memory is available. On the other hand, the reason given for the decrease in page fault rate on the right end of the chart is that the whole process fits into one page. That seems to presume there is enough real memory to hold the entire process.

Graph (b) shows how, as the number of frames allocated to the process grows (for any page size), the page fault rate decreases, until the whole process is in memory.

Example Page Sizes

Atlas		512 48-bit words
Honeywell-Multics		1024 36-bit words
IBM 370/XA and 370/ESA		4K bytes
VAX family		512 bytes
IBM AS/400		512 bytes
DEC Alpha		8K bytes
MIPS		4K to 16M bytes
UltraSPARC		8K to 4M bytes
Pentium		4K to 4M bytes
PowerPC		4K bytes

Page table sizes are chosen based on experimentation, with performance monitoring.

Variable Page Sizes?

Typical sizes have grown from 512 to 8K bytes -- Why?
Variable page size may help

can make more effective use of the TLB
large pages for program instructions
small pages for thread stacks

MIPS R4000, Alpha, UltraSPARC, Pentium support multiple page sizes
Most OS's still support only one page size

Segmentation

Segment sizes may be unequal, dynamic
Advantages:
- Simplifies handling of growing data structures
- Allows programs to be altered and recompiled independently
- Lends itself to sharing data among processes
- Lends itself to protection

Each Segment Table Entry Contains

resident bit: is the segment in main memory?
modified bit: has segment been modified since it was loaded into main memory?
starting address of corresponding segment in main memory, if resident
length of the segment
other control bits, e.g. for protection

Segment Table Entries

Combined Paging and Segmentation

Paging is transparent to the programmer
Paging eliminates external fragmentation
Segmentation is visible to the programmer
Segmentation allows for growing data structures, modularity, and support for sharing and protection
Each segment is broken into fixed-size pages

Combined Segmentation and Paging

Example: Typical Protection Relationships Between Segments

Other Protection Relationships

Process cannot execute from its own data and stack segments
Process can read its own PCB
Process cannot modify its own PCB (except in supervisor mode)

What are the benefits of each of these restrictions?

Implementation Issues for Paged Virtual Memory

Fetch Policy
Placement Policy
Replacement Policy
Resident Set Management

Fetch Policy

Determines when a page should be brought into memory
Demand paging only brings pages into main memory when a reference is made to a location on the page
- Many page faults when process first started
Prepaging brings in more pages than needed at the moment

More efficient to bring in pages that reside contiguously on the disk
Hard to predict future usage, except for specially designed (realtime) application architectures

Placement Policy

Where should segment/page be placed?
Usually not important for systems with paging
Might be important for NUMA (Non-Uniform Memory Architecture) machine

Replacement Policy

Which page is replaced?
Page removed should be the page least likely to be referenced in the near future
Most policies predict the future behavior on the basis of past behavior

Frame Locking

Associate a lock bit with each frame
If frame is locked, it may not be replaced
Candidates for locking:
- Kernel of the operating system
- OS's control data structures
- I/O buffers

What are the reasons for locking each of the above?

The Unix API for this feature is the functions mlock(), munlock(), mlockall(), and munlockall(). Take a look at the man-pages for these.

Basic Replacement Policies

Optimal policy
Least Recently Used (LRU)
First-In First-Out (FIFO)
Clock
Page Buffering

Optimal Policy

Selects for replacement that page for which the time to the next reference is the longest
Impossible to have perfect knowledge of future events
Useful for comparison

Least Recently Used (LRU)

Replaces the page that has not been referenced for the longest time
By the principle of locality, this should be the page least likely to be referenced in the near future
Each page could be tagged with the time of last reference. This would require a great deal of overhead.

Can you explain why implementing LRU is not feasible? That is, where is the "overhead" mentioned above and how do we know it is too large for us to want to implement LRU?

First-in, first-out (FIFO)

Treats page frames allocated to a process as a circular buffer
Pages are removed in round-robin style
Simplest replacement policy to implement
Page that has been in memory the longest is replaced
These pages may be needed again very soon

Clock Policy

Additional bit called a use bit
When a page is first loaded in memory, the use bit is set to 0
When the page is referenced, the use bit is set to 1
When it is time to replace a page, the first frame encountered with the use bit set to 0 is replaced.
During the search for replacement, each use bit previously set to 1 is changed to 0

Example of Clock Policy Operation

Using The Modified-Bit

Modified?	Used?	Priority
0	0	1
0	1	2
1	0	3
1	1	4

The clock algorithm can be extended to make use of a modified bit along with the use bit. This gives a slightly closer approximation to the LRU behavior, and potentially cuts in half the amount of disk traffic.

Where is the difference in the disk traffic?

Replacement Algorithms

The figure shows approximately how the algorithms compared in performance on a simulated page reference string.

Page Buffering

Works with some other basic replacement policy
Replaced page is added to one of two lists:
Free page list if page has not been modified
Modified page list, otherwise
With luck, a page may be still in memory when it is next needed
When frames are needed they are taken first from the free page list

How can we afford to leave pages in memory, if we needed to replace them?

Only makes sense if the OS reserves some of main memory for the page buffer.

Makes more sense if each process has a set limit on its resident set size (see local replacement, below).

Resident Set Management

Size
- Fixed allocation
  - gives a process a fixed number of pages within which to execute
  - when a page fault occurs, one of the pages of that process must be replaced
- Variable allocation
  - number of pages allocated to a process varies over the lifetime of the process
Scope
- Local - choose from the same process
- Global - choose from all processes

Variable Allocation, Global Scope

Easiest to implement
Adopted by many operating systems
Operating system keeps list of free frames
Free frame is added to resident set of process when a page fault occurs
If no free frame, replaces one from another process

Variable Allocation, Local Scope

When new process added, allocate number of page frames based on application type, program request, or other criteria
When page fault occurs, select page from among the resident set of the process that suffers the fault
Reevaluate allocation from time to time

Working Set Concept

Definition: The set of pages that have been referenced in the last D virtual time units
(depends on window size D)
Observation: The working set size changes, but stays stable over some fairly long intervals of time
(consequence of Principle of Locality)

What is meant by virtual time here?

Resident Set Size Strategy Based on Working Set Model

Naive idea:

Monitor the working set of each process
Periodically remove from the resident set the pages that are not in the working set
A process may execute only if its working set is (fits) in main memory

Problems:

Working set changes over time
Keeping track of the actual working set is impractical
How do we determine the window size?

Observation:

When working set is not in memory, page fault rate goes up
If PFF is in some range we have working set in memory

Page Fault Frequencey (PFF) Algorithm

Use bit per page, set when page is accessed (by hardware)
Keep track of time interval between page faults
If interval is too short, increase working set size
If interval is too long discard all pages with use-bit 0, and shrink resident set

How do we determine the thresholds?

Why does this not perform well during shifts to new locality?

How does the VSWS (variable-interval sampled working set) policy address this problem?

Cleaning Policy

Determines when modified pages are written out
Demand cleaning

a page is written out only when it has been selected for replacement

Precleaning

pages are written out in batches

Best approach uses page buffering

Replaced pages are placed in two lists

Modified and unmodified

Pages in the modified list are periodically written out in batches
Pages in the unmodified list are either reclaimed if referenced again or lost when its frame is assigned to another page

Load Control

Determines the number of processes that will be resident in main memory
Too few processes, many occasions when all processes will be blocked and much time will be spent in swapping
Too many processes will lead to thrashing
Some Policies
- Working set (implicit)
- Page fault frequency
- Multiprogramming level = number of processes in memory
- L = S criterion: mean time between page faults = mean time to process page fault
- 50% criterion: keep paging device utilization at 50%
- Clock scanning rate threshold: if too high, reduce MP level

Process Suspension

Determines which process to suspend (roll out)
Lowest priority process
Faulting process

does not have its working set in main memory so it will be blocked anyway

Last process activated

likely to have its working set resident

Process with smallest resident set

requires the least future effort to reload

Largest process

obtains the most free frames

Process with the largest remaining execution window

UNIX and Solaris Memory Management Data Structures

Page table: table per process, entry per page
Disk block descriptor: entry per disk block in swap space
Page frame data table: one table, entry per real memory page
Swap-use table: table per swap device

SVR4 Unix Memory Management Table Structures

UNIX and Solaris Memory Management

Page Replacement

refinement of the clock policy

Kernel Memory Allocator

most blocks are smaller than a typical page size
a "lazy" Buddy System is used

What is meant by a "lazy" Buddy System? What is the intended advantage over the normal Buddy System?

Linux Memory Management

See separate notes on Linux memory management.

In order to provide more detail, the notes on Linux memory management are in a separate file.

Windows 2000 Memory Management

Resident Set Management = variable allocation, local scope
- resident sets allowed to grow when main memory is plentiful
- resident sets reduced when main memory is scarce
Virtual address space split in two:
2Gbytes of shared system space (default)
2Gbytes of per-process space (default)
Three sets of page frames
- Available - not currently used by any process
- Reserved - set aside for a process but not yet committed (not yet in use)
- Committed - allocated to a process & swap space also set aside

Windows 2000 Memory Management

© 2002, 2005 T. P. Baker & Florida State University. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without written permission. (Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)