Skip to content

Page cache: implement page eviction for user-mapped pages #2097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

francescolavra
Copy link
Member

This change set:

  • creates a generic infrastructure for robust handling of low-memory and out-of-memory conditions in the kernel, integrating allocation functions with "memory cleaners" so that when an allocation request fails, memory can be reclaimed from different kernel subsystems (optionally waiting for peripheral I/O or other CPUs to do the necessary work) in order to satisfy the allocation request
  • removes boilerplate code (therefore decreasing kernel code size) for memory allocations done at boot time that are required for starting the kernel (including klibs if any) and the user program, by using the MEM_NOFAIL flag in calls to the mem_alloc() function
  • enhances the VirtIO balloon driver by adding support for deflating the balloon synchronously (i.e. suspending the current context in order to wait for the device to acknowledge the deflate operation): this allows the driver to release memory when an allocation fails and waiting is permitted, and makes it more robust in out-of-memory conditions because it can operate on stack-allocated memory instead of requiring heap allocations
  • fixes some issues in the TLB shootdown implementation by adding a synchronous shootdown mechanism that invokes a CPU "rendezvous" in order to synchronize all CPUs when a page table entry is modified
  • enhances the "memory cleaner" implementation for the page cache by adding support for evicting user-mapped pages, which allows releasing memory during low-memory or out-of-memory conditions: when invoked, the page cache memory cleaner scans both shared and private mappings, and evicts "old" pages, i.e. pages that have not been accessed recently (in out-of-memory conditions, the cleaner is more aggressive and evicts even recently accessed pages)

In addition, there are fixes to miscellaneous issues that have been identified when testing the kernel under different low-memory and out-of-memory conditions.

In a future commit, the `kvirt` struct field will be used also for
anonymous mappings.
When the update_timer() function is called during the runloop, if
it returns a non-zero timstamp, the returned value must be used to
trigger an interrupt on the local timer, otherwise the relevant
timer event is missed. If the aync queues are serviced after the
update_timer() call, when an async operation does not return the
timestamp returned by update_timer() is lost and the local timer
is not armed, thereby missing the relevant timer event.
Fix this issue by moving all servicing of aync queues at the
beginning of the runlopp, before the update_timer() call, using a
loop that runs until no operations are dequeued from any aync
queue.
Add a new memory cleaner type: a "waiting" cleaner, i.e. a cleaner
that can optionally wait (for work to be done by peripherals or
other CPUs) when releasing memory. Waiting cleaners are invoked
with a set of flags that specify whether they are allowed to wait
and whether their invocation is triggered by an out-of-memory
condition.
Add a mem_alloc() function, that takes a set of allocation flags,
and optionally invokes memory cleaners (via a new mem_clean()
function) if the allocation fails.
Move initialization of memory cleaning data structures to
init_kernel_heaps() so that memory cleaning API functions (and thus
mem_alloc()) can be called before kernel_runtime_init().
Change the mm_service() function to a new mem_service() function
that is meant to be called only from a background task and returns
whether memory has been cleaned during its invocation.
Change the pagecache code to call mem_clean() when page allocation
fails.
Change the anonymous page fault hander to call mem_alloc() when
allocating an anonymous page.
These changes are intended to create a generic infrastructure for
robust handling of low-memory and out-of-memory conditions.
For memory allocations done at boot time that are required for
starting the kernel (including klibs if any) and the user program,
replace calls to the allocate() macro with calls to the mem_alloc()
function (with the MEM_NOFAIL flag set): this allows removing
relevant assert() statements and therefore decreases code size. Use
the additional MEM_NOWAIT flag for allocations done before (or
during) kernel_runtime_init(), because at that time the
functionality to suspend a context or wait for other CPUS is not in
place yet.
Do the same change also for allocations that are not supposed to
fail in user programs (unit and runtime tests).
When creating a new context, instead of asserting that memory
allocation is successful, use the mem_alloc() function (with the
appropriate flags) and handle any allocation errors gracefully:
this avoids kernel crashes caused by the get_process_context()
function (which can be called as a result of syscall execution)
when the kernel runs out of memory.
If an allocation fails because of insufficient memory,
allocate_region() should return INVALID_ADDRESS (which should be
handled properly by the calling code) instead of crashing the
kernel.
This function is called from a syscall context after instantiating
a contextual closure that resumes the context upon file I/O
completion. If this function suspends the current context while
waiting for pages to be fetched from the cache, the contextual
closure can be invoked during context suspension, which corrupts
the stack of the suspended context and causes a kernel crash when
the context is resumed. Fix this issue by using an asynchronous
task (whose context can be safely suspended) to fetch the pages.
The memory cleaner for the virtio balloon device deflates the
balloon when receiving a request to release memory.
To follow the virtio specifications, instantiate the cleaner only
when the "deflate_on_oom" feature is negotiated, and when the
cleaner is instantiated, deflate the balloon only when the system
is running out of memory.
In addition, add support for deflating the balloon synchronously
(i.e. suspending the current context to wait for the device to
acknowledge the deflate operation): this allows the cleaner to be
used to release memory when an allocation fails and waiting is
permitted, and makes the cleaner more robust in out-of-memory
conditions because it can operate on stack-allocated memory
instead of requiring heap allocations.
Make the deflate function more robust against allocation failures,
and retry a deflate operation via a kernel timer if not all
requested pages could be deflated (e.g. due to allocation
failures).
When the user program is faulted-in on-demand, relevant pages are
put in the page cache; if the program code modifies this memory
(e.g. when writing to static variables), those changes are made to
the page cache, and even though they are not persisted to disk
(i.e. the executable file in the filesystem is not modified), they
appear when the executable file is read via standard syscalls, as
exemplified by the following code:
```
static int static_var;
void *b1 = malloc(100000);
int fd = open("executable_file", O_RDONLY);
int size = read(fd, b1, 100000);
static_var++;
void *b2 = malloc(100000);
lseek(fd, SEEK_SET, 0);
size = read(fd, b2, size);
printf("size %d memcmp %d\n", size, memcmp(b1, b2, size));
```
When the above code is run, the memcmp() function returns a
non-zero value, which indicates that the assignment to the static
variable is reflected in the in-cache memory backed by the
executable file.
Fix the above issue by adding a `private` boolean parameter to the
pagecache_get_page() and pagecache_get_page_if_filled() functions:
when the argument value is true, "steal" the page from the cache so
that it becomes "owned" by the user program; this allows any writes
to page memory by the program to not be reflected in any
file-backed pages, while avoiding the overhead of the copy-on-write
mechanism.
This makes on-demand retrieval of file-backed pages more robust in
low-memory conditions, e.g. it allows purging existing pages from
the cache in order to be able to create a new file-backed page to be accessed
by the user program (similarly to what is done when creating a new
page for anonymous mappings).
The unmap_and_free_phys() function uses a range_handler to
deallocate physical memory pages as they are encountered during
page table traversal, and then calls page_invalidate_sync() to
perform a TLB shootdown. Between physical memory deallocation and
PTE invalidation, another CPU can potentially re-allocate the page
just freed and use it for other purposes while it is still
referenced in the TLB as associated to the mapping being torn down;
this can cause memory corruption issues. The same issue affects
the pagecache_node_unmap_pages() function.
A separate but similar issue affects the pagecache code that scans
shared file-backed mappings: when it finds a dirty page, it clears
the dirty flag, and then invokes a TLB shootdown; between the
clearing of the dirty flag and the PTE invalidation, another CPU
can potentially write to the mapped page and fail to re-set the
dirty flag in the PTE (because this flag may already be set in
the local TLB), causing the write to fail to be persisted on disk.

Fix the first issue by ensuring that a physical memory page is
deallocated only after TLB shootdown is performed. Try to do the
shootdown asynchronously (i.e. avoid waiting for the other CPUs to
do the invalidation) if possible, and resort to a synchronous
shootdown on memory allocation failure (the
unmap_and_free_phys_sync() and pagecache_node_unmap_pages_sync()
functions operate on stack memory only). Amend the TLB shoodown
code to implement a CPU "rendezvous" when a synchronous shootdown
is requested: the requesting CPU sends an IPI, waits for the other
CPUs to join the rendezvous, executes the completion closure
associated to the flush entry, then releases the other CPUs, which
can then leave the rendezvous.

Fix the second issue by invoking a synchronous TLB shootdown and
clearing the dirty flag during the CPU rendezvous.

Opportunistically amend the pagecache node data structure to
handle both shared and private mappings (in preparation for a
future commit which will implement scanning of private mappings),
using an embedded struct rangemap instead of a pointer to a
separately allocated struct, and adding proper locking.
Page cache pages that are mapped by the user process and faulted-in
on-demand are never evicted from the cache until they are unmapped;
this includes all the read-only sections of the user program ELF
(which are never unmapped).
To allow relasing memory from the above pages during low-memory or
out-of-memory conditions, enhance the memory cleaner implementation
so that it scans both shared and private mappings, and evicts "old"
pages, i.e. pages that have not been accessed recently (in OOM
conditions, be more aggressive by evicting even recently accessed
pages). To do the eviction safely on SMP machines, invoke a
synchronous (rendezvous-based) TLB shootdown.
This optimizes memory utilization by the page cache, e.g. it allows
more memory to be released to the host OS via a virtio balloon
device.

This change can cause a SIGBUS signal to be delivered to the user
process when a page cannot be faulted-in during the "ruby_alloc"
end-to-end test; the Ruby program handles this signal by dumping
process state information and then aborting (i.e. raising a SIGABRT
signal); therefore, to make this test pass, add "6" to the list of
expected exit codes for this test.
@francescolavra francescolavra force-pushed the feature/pagecache-drain branch from e4ae114 to 1f18e49 Compare April 15, 2025 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant