Page cache: implement page eviction for user-mapped pages #2097

francescolavra · 2025-04-02T10:30:08Z

This change set:

creates a generic infrastructure for robust handling of low-memory and out-of-memory conditions in the kernel, integrating allocation functions with "memory cleaners" so that when an allocation request fails, memory can be reclaimed from different kernel subsystems (optionally waiting for peripheral I/O or other CPUs to do the necessary work) in order to satisfy the allocation request
removes boilerplate code (therefore decreasing kernel code size) for memory allocations done at boot time that are required for starting the kernel (including klibs if any) and the user program, by using the MEM_NOFAIL flag in calls to the mem_alloc() function
enhances the VirtIO balloon driver by adding support for deflating the balloon synchronously (i.e. suspending the current context in order to wait for the device to acknowledge the deflate operation): this allows the driver to release memory when an allocation fails and waiting is permitted, and makes it more robust in out-of-memory conditions because it can operate on stack-allocated memory instead of requiring heap allocations
fixes some issues in the TLB shootdown implementation by adding a synchronous shootdown mechanism that invokes a CPU "rendezvous" in order to synchronize all CPUs when a page table entry is modified
enhances the "memory cleaner" implementation for the page cache by adding support for evicting user-mapped pages, which allows releasing memory during low-memory or out-of-memory conditions: when invoked, the page cache memory cleaner scans both shared and private mappings, and evicts "old" pages, i.e. pages that have not been accessed recently (in out-of-memory conditions, the cleaner is more aggressive and evicts even recently accessed pages)

In addition, there are fixes to miscellaneous issues that have been identified when testing the kernel under different low-memory and out-of-memory conditions.

In a future commit, the `kvirt` struct field will be used also for anonymous mappings.

When the update_timer() function is called during the runloop, if it returns a non-zero timstamp, the returned value must be used to trigger an interrupt on the local timer, otherwise the relevant timer event is missed. If the aync queues are serviced after the update_timer() call, when an async operation does not return the timestamp returned by update_timer() is lost and the local timer is not armed, thereby missing the relevant timer event. Fix this issue by moving all servicing of aync queues at the beginning of the runlopp, before the update_timer() call, using a loop that runs until no operations are dequeued from any aync queue.

Add a new memory cleaner type: a "waiting" cleaner, i.e. a cleaner that can optionally wait (for work to be done by peripherals or other CPUs) when releasing memory. Waiting cleaners are invoked with a set of flags that specify whether they are allowed to wait and whether their invocation is triggered by an out-of-memory condition. Add a mem_alloc() function, that takes a set of allocation flags, and optionally invokes memory cleaners (via a new mem_clean() function) if the allocation fails. Move initialization of memory cleaning data structures to init_kernel_heaps() so that memory cleaning API functions (and thus mem_alloc()) can be called before kernel_runtime_init(). Change the mm_service() function to a new mem_service() function that is meant to be called only from a background task and returns whether memory has been cleaned during its invocation. Change the pagecache code to call mem_clean() when page allocation fails. Change the anonymous page fault hander to call mem_alloc() when allocating an anonymous page. These changes are intended to create a generic infrastructure for robust handling of low-memory and out-of-memory conditions.

For memory allocations done at boot time that are required for starting the kernel (including klibs if any) and the user program, replace calls to the allocate() macro with calls to the mem_alloc() function (with the MEM_NOFAIL flag set): this allows removing relevant assert() statements and therefore decreases code size. Use the additional MEM_NOWAIT flag for allocations done before (or during) kernel_runtime_init(), because at that time the functionality to suspend a context or wait for other CPUS is not in place yet. Do the same change also for allocations that are not supposed to fail in user programs (unit and runtime tests).

When creating a new context, instead of asserting that memory allocation is successful, use the mem_alloc() function (with the appropriate flags) and handle any allocation errors gracefully: this avoids kernel crashes caused by the get_process_context() function (which can be called as a result of syscall execution) when the kernel runs out of memory.

If an allocation fails because of insufficient memory, allocate_region() should return INVALID_ADDRESS (which should be handled properly by the calling code) instead of crashing the kernel.

This function is called from a syscall context after instantiating a contextual closure that resumes the context upon file I/O completion. If this function suspends the current context while waiting for pages to be fetched from the cache, the contextual closure can be invoked during context suspension, which corrupts the stack of the suspended context and causes a kernel crash when the context is resumed. Fix this issue by using an asynchronous task (whose context can be safely suspended) to fetch the pages.

The memory cleaner for the virtio balloon device deflates the balloon when receiving a request to release memory. To follow the virtio specifications, instantiate the cleaner only when the "deflate_on_oom" feature is negotiated, and when the cleaner is instantiated, deflate the balloon only when the system is running out of memory. In addition, add support for deflating the balloon synchronously (i.e. suspending the current context to wait for the device to acknowledge the deflate operation): this allows the cleaner to be used to release memory when an allocation fails and waiting is permitted, and makes the cleaner more robust in out-of-memory conditions because it can operate on stack-allocated memory instead of requiring heap allocations. Make the deflate function more robust against allocation failures, and retry a deflate operation via a kernel timer if not all requested pages could be deflated (e.g. due to allocation failures).

When the user program is faulted-in on-demand, relevant pages are put in the page cache; if the program code modifies this memory (e.g. when writing to static variables), those changes are made to the page cache, and even though they are not persisted to disk (i.e. the executable file in the filesystem is not modified), they appear when the executable file is read via standard syscalls, as exemplified by the following code: ``` static int static_var; void *b1 = malloc(100000); int fd = open("executable_file", O_RDONLY); int size = read(fd, b1, 100000); static_var++; void *b2 = malloc(100000); lseek(fd, SEEK_SET, 0); size = read(fd, b2, size); printf("size %d memcmp %d\n", size, memcmp(b1, b2, size)); ``` When the above code is run, the memcmp() function returns a non-zero value, which indicates that the assignment to the static variable is reflected in the in-cache memory backed by the executable file. Fix the above issue by adding a `private` boolean parameter to the pagecache_get_page() and pagecache_get_page_if_filled() functions: when the argument value is true, "steal" the page from the cache so that it becomes "owned" by the user program; this allows any writes to page memory by the program to not be reflected in any file-backed pages, while avoiding the overhead of the copy-on-write mechanism.

This makes on-demand retrieval of file-backed pages more robust in low-memory conditions, e.g. it allows purging existing pages from the cache in order to be able to create a new file-backed page to be accessed by the user program (similarly to what is done when creating a new page for anonymous mappings).

The unmap_and_free_phys() function uses a range_handler to deallocate physical memory pages as they are encountered during page table traversal, and then calls page_invalidate_sync() to perform a TLB shootdown. Between physical memory deallocation and PTE invalidation, another CPU can potentially re-allocate the page just freed and use it for other purposes while it is still referenced in the TLB as associated to the mapping being torn down; this can cause memory corruption issues. The same issue affects the pagecache_node_unmap_pages() function. A separate but similar issue affects the pagecache code that scans shared file-backed mappings: when it finds a dirty page, it clears the dirty flag, and then invokes a TLB shootdown; between the clearing of the dirty flag and the PTE invalidation, another CPU can potentially write to the mapped page and fail to re-set the dirty flag in the PTE (because this flag may already be set in the local TLB), causing the write to fail to be persisted on disk. Fix the first issue by ensuring that a physical memory page is deallocated only after TLB shootdown is performed. Try to do the shootdown asynchronously (i.e. avoid waiting for the other CPUs to do the invalidation) if possible, and resort to a synchronous shootdown on memory allocation failure (the unmap_and_free_phys_sync() and pagecache_node_unmap_pages_sync() functions operate on stack memory only). Amend the TLB shoodown code to implement a CPU "rendezvous" when a synchronous shootdown is requested: the requesting CPU sends an IPI, waits for the other CPUs to join the rendezvous, executes the completion closure associated to the flush entry, then releases the other CPUs, which can then leave the rendezvous. Fix the second issue by invoking a synchronous TLB shootdown and clearing the dirty flag during the CPU rendezvous. Opportunistically amend the pagecache node data structure to handle both shared and private mappings (in preparation for a future commit which will implement scanning of private mappings), using an embedded struct rangemap instead of a pointer to a separately allocated struct, and adding proper locking.

Page cache pages that are mapped by the user process and faulted-in on-demand are never evicted from the cache until they are unmapped; this includes all the read-only sections of the user program ELF (which are never unmapped). To allow relasing memory from the above pages during low-memory or out-of-memory conditions, enhance the memory cleaner implementation so that it scans both shared and private mappings, and evicts "old" pages, i.e. pages that have not been accessed recently (in OOM conditions, be more aggressive by evicting even recently accessed pages). To do the eviction safely on SMP machines, invoke a synchronous (rendezvous-based) TLB shootdown. This optimizes memory utilization by the page cache, e.g. it allows more memory to be released to the host OS via a virtio balloon device. This change can cause a SIGBUS signal to be delivered to the user process when a page cannot be faulted-in during the "ruby_alloc" end-to-end test; the Ruby program handles this signal by dumping process state information and then aborting (i.e. raising a SIGABRT signal); therefore, to make this test pass, add "6" to the list of expected exit codes for this test.

francescolavra mentioned this pull request Apr 2, 2025

virtio_balloon: add support for MMIO devices #2081

Merged

francescolavra added 14 commits April 15, 2025 15:17

Runtime/heap: remove unused debug_heap.c file

4048e62

Unix/pending_fault: move kvirt from filebacked to common fields

1696964

In a future commit, the `kvirt` struct field will be used also for anonymous mappings.

allocate_region(): remove misleading assertion

7d84924

If an allocation fails because of insufficient memory, allocate_region() should return INVALID_ADDRESS (which should be handled properly by the calling code) instead of crashing the kernel.

little_stack_buffer(): fix double evaluation of macro argument

1e7c57b

francescolavra force-pushed the feature/pagecache-drain branch from e4ae114 to 1f18e49 Compare April 15, 2025 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Page cache: implement page eviction for user-mapped pages #2097

Page cache: implement page eviction for user-mapped pages #2097

Uh oh!

francescolavra commented Apr 2, 2025

Uh oh!

Uh oh!

Page cache: implement page eviction for user-mapped pages #2097

Are you sure you want to change the base?

Page cache: implement page eviction for user-mapped pages #2097

Uh oh!

Conversation

francescolavra commented Apr 2, 2025

Uh oh!

Uh oh!