Virtual Memory
Background & Motivation
- The segmented/paged memory principle
- The hierarchical memory (cache) principle.
These are the two key observations that motivate the
development of virtual memory. We just finished looking at
pages and segmented memories.
Common Features of Paging & Segmentation
- Memory references are dynamically translated into physical addresses at
run time
- A process may be located in different regions at different times
- A process may be broken up into discontiguous pieces
- Special hardware speeds up the translation of logical to
physical addresses, to make this feasible
Observation:
All pieces of a process do not need to be loaded in main memory during execution
Once we have paid the price for the extra hardware to do dynamic
address translation (originally to solve the problem of storage
management), we discover that we can do more with it.
Advantages of Breaking up a Process
- More processes may be maintained in main memory
- Only load in some of the pieces of each process
- With more processes in main memory, it is more likely a process
will be in the Ready state at any particular time
- Reduces CPU idle time, increases throughput
- A process may be larger than all of main memory
- Allows executing larger programs, on larger datasets
This is another case of the hierarchical memory concept, which we
saw previously applied to memory cache. This is a very useful
concept, which has several applications in operating systems.
Hierarchical Storage, Memory, Buffers, Caches
- memory cache(s)
- virtual memory
- translation lookaside buffer (TLB)
- disk drive internal cache
- filesystem buffer cache
We hope that by the end of this course, you should be able
to answer:
- What are each of these? What is its purpose?
- How are all these things similar?
- How is each different from the others?
You should have seen memory caches before. We will now look at
virtual memroy and the TLB. The other two will come up latter, when
we look at the I/O and file systems.
If you have not read Chapter 1 of Stallings' book (including
Appendix 1A) carefullly before, now is a good time to go back and read
it. In particular, make certain you have read and understood the
sections on The Memory Hierarchy, Cache Memory, and I/O Communication
Techniques. All of what we are studying now is an elaboration on the
themes introduced there.
As we have observed before, there are a few basic ideas in computing,
which people apply over and over in slightly different ways. One of these
ideas is that of a cache. A cache is a special kind of buffer, that is
used to reduce the average-case cost of accessing a comparatively slow
storage medium. Main memory access time is much slower than CPU
instruction cycle times. Disk storage access time is MUCH slower than
main memory access times.
Because all of the items on the list above are based on the
cache principle, you will see many similarities between them.
The similarities will make it easier for you to understand them,
but you must take care not to forget that they are all distinct.
Similarities: Cache Principles/Assumptions
- References to storage/memory are not random
- In particular, references exhibit a pattern of
locality in space and time
- We can guess a set of locations that,
with high probability, will include most of the next few references
- Keeping the contents of those locations in faster storage/memory
will save time
- The overhead of this mechanism will not exceed the payoff, on
the average
Note that all of the above is based on probabilities, and
long term, average case, behavior. The fundamental assumption is that
of locality of reference.
Recall from Chapter 1 the two forms of locality:
- temporal: locations referenced recently are more likely to
be referenceed again soon
- spatial: locations at addressed near locations referenced
recently are more likely to be referenced again soon
Similarities: Cache/VM Performance Issues
- Total size of cache
- Size of blocks/units within cache (granularity)
- Replacement policy
The design and performance issues for the various applications of
the cache principle are similar. Generally, a larger total cache size
results in a higher probability of hits, which is good. Generally,
the granularity must be tuned to fit the system. If the granularity
is too large or too small, the miss ratio will go up. In all cases,
a critical decision is the replacement policy, i.e., how
to decide which data should be kicked out of the cache when we need
to bring in some new data.
We will see there are also important differences in each
application of the cache principle. Some of these have to do with the
relative speeds of the devices, the relative costs of devices, the
degree of success we can expect in predicting the locality of
references, and the relative cost of bad guesses. For example, the
faster the replacement decision must be made the simpler the policy
must be, since the performance improvement due to a better hit ratio
would be eaten up by the overhead of making the decisions.
Execution of a Program with Virtual Memory
- OS brings a few piece of the program into main memory when process starts execution
- Resident set = portion of process that is in main memory
- A trap/interrupt is generated when an address is needed that is not in the resident set
(This trap is called a page/segment fault)
- OS puts the process in the Blocked state
- Piece of process that contains the logical address is brought into main memory
- OS initiates a disk read request
- Another process is dispatched to run while until the disk I/O completes
- An interrupt is issued when disk I/O completes
- OS puts the process in the Ready state
Virtual versus Real Memory
- Real memory = Main memory
- Virtual memory
memory preceived by user/program
- may be distributed between main memory and secondary memory (disk,
also called backing store)
- allows for effective multiprogramming and relieves the user of tight
constraints of main memory
Virtuality
The term virtual is used a lot in computing.
- virtual memory, virtual address
- virtual processor
- virtual machine
- virtual registers
- virtual time (see below)
- virtual clock
- virtual reality
Do you know what these all mean?
Can you think of any more?
You should be able to generalize the meaning. What do these all
have in common?
By now, the term virtual has migrated from
the technical jargon of computer scientists to common speach, through
the popular media, so most people already understand the
difference between a virtual thing and a real thing.
At a deeper philosophical level, it may be argued that none of us
knows what is real and what is virtual. However, the implementor of a
virtual system knows effectively what is virtual, meaning what she/he
implements, and what is real, meaning what he/she uses to implement
it.
How can Virtual Memory Work?
- Accessing disk storage is very much slower than main memory
e.g., milliseconds vs. nanoseconds (1,000,000 : 1)
- Must not need to go to disk very often
- Many instructions must be executed between virtual memory faults
- Otherwise, the processor will spend most of its time waiting for swapping I/O
- the latter is called thrashing
Principle of Locality
People observed:
- Program and data references within a process tend to cluster
in certain areas of memory for significant periods of time
- Only a few pieces of a process will be needed over a given period of time
- Possible to make intelligent guesses about which pieces will be needed in the future
These all suggested that virtual memory might work efficiently.
They guessed correctly.
Historical Note:This work was done in the late 1960's, and pioneered by IBM.
What Factors Affect Locality of Reference?
Positive influences:
- modest-sized subprograms & loops
- use of local variables
- modest-sized arrays
Negative influences:
- hash tables
- huge arrays
- dynamic storage allocation
- large linked structures
- unstructured code (jumps around a lot)
- very tiny subprograms (calls jump around a lot)
- multithreading
The effectiveness of virtual memory seems to have decreased
in recent years. One factor is the lower cost of main memory.
When we can afford to put several gigabytes of main memory on
a machine, the utility of having a larger virtual address space
than physical address space is reduced.
Another factor seems to be reduced locality of reference.
How has the adoption of OOP and object-oriented languages
affected locality of reference?
Which of the above causes of locality are temporal and
which are physical?
Do you undertand how each of the negative influences
listed above reduceds locality of reference?
Support Needed for Virtual Memory
- Hardware must support paging and/or segmentation
- OS must manage movement of
pages and/or segments to and from secondary storage
Paging
- Each process has its own page table
- Each page table entry contains:
- resident bit: is the page is in main memory?
- frame number of the corresponding page in main memory, if resident
- modified bit: has the page has been altered since it was last loaded into main memory?
If not, it does not have to be written
to the disk when it is swapped out, so long as the original copy is still
out there
- other control bits, e.g. for protection
Page Table Entries
Address Translation in a Paging System
Handling Large Page Tables
- The entire page table may take up too much main memory
- Example:
with 4-kbyte pages (212 bytes/page),
232 virtual address space occupies 220 pages
with 4-byte page table entries (4 bytes/page),
page table occupies 222 bytes (210 pages)
- Page tables can also be stored in virtual memory
- When a process is running, only part of its page table is in main memory
- Only root page table needs to be always resident
- Example: 210 pages takes 4 kbytes
Two-Level Scheme for 32-bit Address
Translation Lookaside Buffer (TLB)
- Each virtual memory reference can cause two physical memory accesses
- one to fetch the page table entry
- one to fetch the data
- To overcome this problem a high-speed cache is set up for page
table entries
- called the TLB - Translation Lookaside Buffer
- Contains page table entries that have been most recently used
- Performs associative mapping
- Functions like a memory cache
Operation of TLB
- Given a virtual address, processor examines the TLB
- If page table entry is present (a hit), the frame number is
retrieved and the real address is formed
- If page table entry is not found in the TLB (a miss), the page
number is used to index the process page table
- First checks if page is already in main memory
- If not in main memory a page fault is issued,
and the OS brings the page into main memory
- The TLB is updated to include the new page entry
How is the TLB distinguished functionally from the memory cache?
How do they operate together?
What effects limit the practical size of the TLB?
Operation of TLB as Flowchart
Use of a Translation Lookaside Buffer
TLB Fault Processing
The figure shows the timing and causal relationships between the
events and agents involved in a TLB fault. It starts when the CPU
issues a memory fetch operation (possibly for data or for an
instruction) that refers to an address whose page frame number s not
in the TLB. The TLB fetches the corresponding page table entry from
memory. It then translates the virtual address to a physical address
and fetches the operand originally requested by the CPU.
The fetch operation ends up taking two memory cycles. In order for
this not to slow the execution down appreciably, most memory
references must be TLB hits.
Relation of TLB to Memory Cache
Page Fault Processing
The figure shows the timing and causal relationships between
some of the events and agents involved in a page fault. It starts
when the CPU issues a memory reference for an adress that is not
in the TLB.
In order to focus on the interesting events, the figure leaves out
many details, among them the entire role of
the memory cache(s), and the details of the disk subsystem.
The dashed arrows allude to some of these details that we
cannot ignore entirely,
i.e., the many interactions between the executing OS software
and the memory that are needed to handle the page fault, bring the
page in from the disk, and update the page table.
The OS actions start with the execution of an interrupt (or trap)
handler, the page fault (PF) handler. The OS marks the page-faulted process
as being blocked, and initiates a disk read
operation to fetch the page into main memory. The PF handler may do
some of this work directly, or it may wake up a pager process and let
that process do the work. (What are the arguments for each of these
two alternatives? Is there a middle ground?)
At some point the PF handler will call the dispatcher,
so that some other process can use the CPU while the disk is busy.
When the disk subsystem (what is included in the disk subsystem?)
has brought the page into memory, it interrupts the CPU again. The
disk I/O interrupt handler, or a pager process that it wakes up,
updates the page table in memory, and marks the page-faulted process
as being ready.
Eventually, the dispatcher will resume execution of the
page-faulted process. It will reissue the instruction that
caused the page fault. This time the TLB will succeed in
fetching the page table entry from memory, and the operation
of the TLB will proceed as in the preceding figure, i.e.,
there will be a TLB fault, the page table will entry will
be fetched, the TLB will be updated, and finally the
CPU
After all this, the system is back at the point of the original TLB
fault, which led to the page fault. It is clear that the OS does
quite a bit of work to handle each page fault. It follows that we
cannot afford to have page faults very often.
In particular, it would be disastrous if the OS code for handling a
page fault itself were to cause a page fault! If it did, the system
could get into an infinite cycle of page faults. For this reason, the
code that handles page faults is kept permanently in main memory.
(However, it is possible that some of the processing by the
OS for a page fault may generate further page faults. In particular,
this may be the case if the page table itself is paged.)
Page Size Performance Trade-Offs
Smaller page size | => | less internal fragmentation |
Smaller page size | => | fewer page faults |
Larger page size | => | more disk I/O efficiency |
Larger page size | => | smaller page tables |
| => | fewer pages per process |
| => | fewer TLB misses, fewer secondary page faults |
Graph (a) shows the general effect of increasing the page
size. I think this is a little bit misleading, since it is not clear
from the figure what assumptions are being made about the
number of page frames assigned to the process. The explanation
for the increase on the left edge of the chart is that as the
page size increases the probability of having the entire "working set"
in memory goes down. However, that presumes a limited amount
of real memory is available. On the other hand, the reason given
for the decrease in page fault rate on the right end of the chart
is that the whole process fits into one page. That seems to
presume there is enough real memory to hold the entire process.
Graph (b) shows how, as the number of frames allocated to
the process grows (for any page size), the page fault rate
decreases, until the whole process is in memory.
Example Page Sizes
Atlas | | 512 48-bit words |
Honeywell-Multics | | 1024 36-bit words |
IBM 370/XA and 370/ESA | | 4K bytes |
VAX family | | 512 bytes |
IBM AS/400 | | 512 bytes |
DEC Alpha | | 8K bytes |
MIPS | | 4K to 16M bytes |
UltraSPARC | | 8K to 4M bytes |
Pentium | | 4K to 4M bytes |
PowerPC | | 4K bytes |
Page table sizes are chosen based on experimentation, with
performance monitoring.
Variable Page Sizes?
- Typical sizes have grown from 512 to 8K bytes -- Why?
- Variable page size may help
- can make more effective use of the TLB
- large pages for program instructions
- small pages for thread stacks
- MIPS R4000, Alpha, UltraSPARC, Pentium support multiple page sizes
- Most OS's still support only one page size
Segmentation
- Segment sizes may be unequal, dynamic
- Advantages:
- Simplifies handling of growing data structures
- Allows programs to be altered and recompiled independently
- Lends itself to sharing data among processes
- Lends itself to protection
Each Segment Table Entry Contains
- resident bit: is the segment in main memory?
- modified bit: has segment been modified since it was loaded into main memory?
- starting address of corresponding segment in main memory, if resident
- length of the segment
- other control bits, e.g. for protection
Segment Table Entries
Combined Paging and Segmentation
- Paging is transparent to the programmer
- Paging eliminates external fragmentation
- Segmentation is visible to the programmer
- Segmentation allows for growing data structures, modularity,
and support for sharing and protection
- Each segment is broken into fixed-size pages
Combined Segmentation and Paging
Example: Typical Protection Relationships Between Segments
Other Protection Relationships
- Process cannot execute from its own data and stack segments
- Process can read its own PCB
- Process cannot modify its own PCB (except in supervisor mode)
What are the benefits of each of these restrictions?
Implementation Issues for Paged Virtual Memory
- Fetch Policy
- Placement Policy
- Replacement Policy
- Resident Set Management
Fetch Policy
- Determines when a page should be brought into memory
- Demand paging only brings pages into main memory when a reference
is made to a location on the page
- Many page faults when process first started
- Prepaging brings in more pages than needed at the moment
- More efficient to bring in pages that reside contiguously on the disk
- Hard to predict future usage, except for specially designed (realtime)
application architectures
Placement Policy
- Where should segment/page be placed?
- Usually not important for systems with paging
- Might be important for NUMA (Non-Uniform Memory Architecture) machine
Replacement Policy
- Which page is replaced?
- Page removed should be the page least likely to be referenced in the
near future
- Most policies predict the future behavior on the basis of past behavior
Frame Locking
- Associate a lock bit with each frame
- If frame is locked, it may not be replaced
- Candidates for locking:
- Kernel of the operating system
- OS's control data structures
- I/O buffers
What are the reasons for locking each of the above?
The Unix API for this feature is the functions
mlock(),
munlock(),
mlockall(), and
munlockall(). Take a look at the man-pages
for these.
Basic Replacement Policies
- Optimal policy
- Least Recently Used (LRU)
- First-In First-Out (FIFO)
- Clock
- Page Buffering
Optimal Policy
- Selects for replacement that page for which the time to the next
reference is the longest
- Impossible to have perfect knowledge of future events
- Useful for comparison
Least Recently Used (LRU)
- Replaces the page that has not been referenced for the longest time
- By the principle of locality, this should be the page least likely to be
referenced in the near future
- Each page could be tagged with the time of last reference. This would
require a great deal of overhead.
Can you explain why implementing LRU is not feasible?
That is, where is the "overhead" mentioned above and how do we
know it is too large for us to want to implement LRU?
First-in, first-out (FIFO)
- Treats page frames allocated to a process as a circular buffer
- Pages are removed in round-robin style
- Simplest replacement policy to implement
- Page that has been in memory the longest is replaced
- These pages may be needed again very soon
Clock Policy
- Additional bit called a use bit
- When a page is first loaded in memory, the use bit is set to 0
- When the page is referenced, the use bit is set to 1
- When it is time to replace a page, the first frame encountered with the use bit set to 0 is replaced.
- During the search for replacement, each use bit previously set to 1 is changed to 0
Example of Clock Policy Operation
Using The Modified-Bit
Modified? | Used? | Priority |
0 | 0 | 1 |
0 | 1 | 2 |
1 | 0 | 3 |
1 | 1 | 4 |
The clock algorithm can be extended to make use of a modified bit
along with the use bit. This gives a slightly closer approximation to
the LRU behavior, and potentially cuts in half the amount of disk traffic.
Where is the difference in the disk traffic?
Replacement Algorithms
The figure shows approximately how the algorithms compared
in performance on a simulated page reference string.
Page Buffering
- Works with some other basic replacement policy
- Replaced page is added to one of two lists:
Free page list if page has not been modified
Modified page list, otherwise
- With luck, a page may be still in memory when it is next needed
- When frames are needed they are taken first from the free page list
How can we afford to leave pages in memory, if we needed to replace them?
Only makes sense if the OS reserves some of main memory for the page buffer.
Makes more sense if each process has a set limit on its resident set size (see local
replacement, below).
Resident Set Management
- Size
- Fixed allocation
- gives a process a fixed number of pages within which to execute
- when a page fault occurs, one of the pages of that process must be replaced
- Variable allocation
- number of pages allocated to a process varies over the lifetime of the process
- Scope
- Local - choose from the same process
- Global - choose from all processes
Variable Allocation, Global Scope
- Easiest to implement
- Adopted by many operating systems
- Operating system keeps list of free frames
- Free frame is added to resident set of process when a page fault occurs
- If no free frame, replaces one from another process
Variable Allocation, Local Scope
- When new process added, allocate number of page frames based
on application type, program request, or other criteria
- When page fault occurs, select page from among the resident set
of the process that suffers the fault
- Reevaluate allocation from time to time
Working Set Concept
- Definition: The set of pages that have been referenced
in the last D virtual time units
(depends on window size D)
- Observation: The working set size changes, but stays stable over
some fairly long intervals of time
(consequence of Principle of Locality)
What is meant by virtual time here?
Resident Set Size Strategy Based on Working Set Model
Naive idea:
- Monitor the working set of each process
- Periodically remove from the resident set the pages that are not in the working set
- A process may execute only if its working set is (fits) in main memory
Problems:
- Working set changes over time
- Keeping track of the actual working set is impractical
- How do we determine the window size?
Observation:
- When working set is not in memory, page fault rate goes up
- If PFF is in some range we have working set in memory
Page Fault Frequencey (PFF) Algorithm
- Use bit per page, set when page is accessed (by hardware)
- Keep track of time interval between page faults
- If interval is too short, increase working set size
- If interval is too long discard all pages with use-bit 0, and shrink resident set
How do we determine the thresholds?
Why does this not perform well during shifts to new locality?
How does the VSWS (variable-interval sampled working set) policy address this problem?
Cleaning Policy
- Determines when modified pages are written out
- Demand cleaning
- a page is written out only when it has been selected for replacement
- Precleaning
- pages are written out in batches
- Best approach uses page buffering
- Replaced pages are placed in two lists
- Pages in the modified list are periodically written out in batches
- Pages in the unmodified list are either reclaimed if referenced again or
lost when its frame is assigned to another page
Load Control
- Determines the number of processes that will be resident in main memory
- Too few processes, many occasions when all processes will be
blocked and much time will be spent in swapping
- Too many processes will lead to thrashing
- Some Policies
- Working set (implicit)
- Page fault frequency
- Multiprogramming level = number of processes in memory
- L = S criterion: mean time between page faults = mean time to
process page fault
- 50% criterion: keep paging device utilization at 50%
- Clock scanning rate threshold: if too high, reduce MP level
Process Suspension
- Determines which process to suspend (roll out)
- Lowest priority process
- Faulting process
- does not have its working set in main memory so it will be blocked anyway
- Last process activated
- likely to have its working set resident
- Process with smallest resident set
- requires the least future effort to reload
- Largest process
- obtains the most free frames
- Process with the largest remaining execution window
UNIX and Solaris Memory Management Data Structures
- Page table: table per process, entry per page
- Disk block descriptor: entry per disk block in swap space
- Page frame data table: one table, entry per real memory page
- Swap-use table: table per swap device
SVR4 Unix Memory Management Table Structures
UNIX and Solaris Memory Management
- Page Replacement
- refinement of the clock policy
- Kernel Memory Allocator
- most blocks are smaller than a typical page size
- a "lazy" Buddy System is used
What is meant by a "lazy" Buddy System? What is the intended
advantage over the normal Buddy System?
Linux Memory Management
See separate notes on Linux
memory management.
In order to provide more detail, the notes on Linux memory
management are in a separate file.
Windows 2000 Memory Management
- Resident Set Management = variable allocation, local scope
- resident sets allowed to grow when main memory is plentiful
- resident sets reduced when main memory is scarce
- Virtual address space split in two:
2Gbytes of shared system space (default)
2Gbytes of per-process space (default)
- Three sets of page frames
- Available - not currently used by any process
- Reserved - set aside for a process but not yet committed (not yet in use)
- Committed - allocated to a process & swap space also set aside
Windows 2000 Memory Management
© 2002, 2005
T. P. Baker &
Florida State University.
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means without written permission.
(Last updated by $Author: cop4610 $ on $Date: 2002/09/26 20:06:06 $.)
|