-
Notifications
You must be signed in to change notification settings - Fork 141
Page cache: implement page eviction for user-mapped pages #2097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
francescolavra
wants to merge
14
commits into
master
Choose a base branch
from
feature/pagecache-drain
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In a future commit, the `kvirt` struct field will be used also for anonymous mappings.
When the update_timer() function is called during the runloop, if it returns a non-zero timstamp, the returned value must be used to trigger an interrupt on the local timer, otherwise the relevant timer event is missed. If the aync queues are serviced after the update_timer() call, when an async operation does not return the timestamp returned by update_timer() is lost and the local timer is not armed, thereby missing the relevant timer event. Fix this issue by moving all servicing of aync queues at the beginning of the runlopp, before the update_timer() call, using a loop that runs until no operations are dequeued from any aync queue.
Add a new memory cleaner type: a "waiting" cleaner, i.e. a cleaner that can optionally wait (for work to be done by peripherals or other CPUs) when releasing memory. Waiting cleaners are invoked with a set of flags that specify whether they are allowed to wait and whether their invocation is triggered by an out-of-memory condition. Add a mem_alloc() function, that takes a set of allocation flags, and optionally invokes memory cleaners (via a new mem_clean() function) if the allocation fails. Move initialization of memory cleaning data structures to init_kernel_heaps() so that memory cleaning API functions (and thus mem_alloc()) can be called before kernel_runtime_init(). Change the mm_service() function to a new mem_service() function that is meant to be called only from a background task and returns whether memory has been cleaned during its invocation. Change the pagecache code to call mem_clean() when page allocation fails. Change the anonymous page fault hander to call mem_alloc() when allocating an anonymous page. These changes are intended to create a generic infrastructure for robust handling of low-memory and out-of-memory conditions.
For memory allocations done at boot time that are required for starting the kernel (including klibs if any) and the user program, replace calls to the allocate() macro with calls to the mem_alloc() function (with the MEM_NOFAIL flag set): this allows removing relevant assert() statements and therefore decreases code size. Use the additional MEM_NOWAIT flag for allocations done before (or during) kernel_runtime_init(), because at that time the functionality to suspend a context or wait for other CPUS is not in place yet. Do the same change also for allocations that are not supposed to fail in user programs (unit and runtime tests).
When creating a new context, instead of asserting that memory allocation is successful, use the mem_alloc() function (with the appropriate flags) and handle any allocation errors gracefully: this avoids kernel crashes caused by the get_process_context() function (which can be called as a result of syscall execution) when the kernel runs out of memory.
If an allocation fails because of insufficient memory, allocate_region() should return INVALID_ADDRESS (which should be handled properly by the calling code) instead of crashing the kernel.
This function is called from a syscall context after instantiating a contextual closure that resumes the context upon file I/O completion. If this function suspends the current context while waiting for pages to be fetched from the cache, the contextual closure can be invoked during context suspension, which corrupts the stack of the suspended context and causes a kernel crash when the context is resumed. Fix this issue by using an asynchronous task (whose context can be safely suspended) to fetch the pages.
The memory cleaner for the virtio balloon device deflates the balloon when receiving a request to release memory. To follow the virtio specifications, instantiate the cleaner only when the "deflate_on_oom" feature is negotiated, and when the cleaner is instantiated, deflate the balloon only when the system is running out of memory. In addition, add support for deflating the balloon synchronously (i.e. suspending the current context to wait for the device to acknowledge the deflate operation): this allows the cleaner to be used to release memory when an allocation fails and waiting is permitted, and makes the cleaner more robust in out-of-memory conditions because it can operate on stack-allocated memory instead of requiring heap allocations. Make the deflate function more robust against allocation failures, and retry a deflate operation via a kernel timer if not all requested pages could be deflated (e.g. due to allocation failures).
When the user program is faulted-in on-demand, relevant pages are put in the page cache; if the program code modifies this memory (e.g. when writing to static variables), those changes are made to the page cache, and even though they are not persisted to disk (i.e. the executable file in the filesystem is not modified), they appear when the executable file is read via standard syscalls, as exemplified by the following code: ``` static int static_var; void *b1 = malloc(100000); int fd = open("executable_file", O_RDONLY); int size = read(fd, b1, 100000); static_var++; void *b2 = malloc(100000); lseek(fd, SEEK_SET, 0); size = read(fd, b2, size); printf("size %d memcmp %d\n", size, memcmp(b1, b2, size)); ``` When the above code is run, the memcmp() function returns a non-zero value, which indicates that the assignment to the static variable is reflected in the in-cache memory backed by the executable file. Fix the above issue by adding a `private` boolean parameter to the pagecache_get_page() and pagecache_get_page_if_filled() functions: when the argument value is true, "steal" the page from the cache so that it becomes "owned" by the user program; this allows any writes to page memory by the program to not be reflected in any file-backed pages, while avoiding the overhead of the copy-on-write mechanism.
This makes on-demand retrieval of file-backed pages more robust in low-memory conditions, e.g. it allows purging existing pages from the cache in order to be able to create a new file-backed page to be accessed by the user program (similarly to what is done when creating a new page for anonymous mappings).
The unmap_and_free_phys() function uses a range_handler to deallocate physical memory pages as they are encountered during page table traversal, and then calls page_invalidate_sync() to perform a TLB shootdown. Between physical memory deallocation and PTE invalidation, another CPU can potentially re-allocate the page just freed and use it for other purposes while it is still referenced in the TLB as associated to the mapping being torn down; this can cause memory corruption issues. The same issue affects the pagecache_node_unmap_pages() function. A separate but similar issue affects the pagecache code that scans shared file-backed mappings: when it finds a dirty page, it clears the dirty flag, and then invokes a TLB shootdown; between the clearing of the dirty flag and the PTE invalidation, another CPU can potentially write to the mapped page and fail to re-set the dirty flag in the PTE (because this flag may already be set in the local TLB), causing the write to fail to be persisted on disk. Fix the first issue by ensuring that a physical memory page is deallocated only after TLB shootdown is performed. Try to do the shootdown asynchronously (i.e. avoid waiting for the other CPUs to do the invalidation) if possible, and resort to a synchronous shootdown on memory allocation failure (the unmap_and_free_phys_sync() and pagecache_node_unmap_pages_sync() functions operate on stack memory only). Amend the TLB shoodown code to implement a CPU "rendezvous" when a synchronous shootdown is requested: the requesting CPU sends an IPI, waits for the other CPUs to join the rendezvous, executes the completion closure associated to the flush entry, then releases the other CPUs, which can then leave the rendezvous. Fix the second issue by invoking a synchronous TLB shootdown and clearing the dirty flag during the CPU rendezvous. Opportunistically amend the pagecache node data structure to handle both shared and private mappings (in preparation for a future commit which will implement scanning of private mappings), using an embedded struct rangemap instead of a pointer to a separately allocated struct, and adding proper locking.
Page cache pages that are mapped by the user process and faulted-in on-demand are never evicted from the cache until they are unmapped; this includes all the read-only sections of the user program ELF (which are never unmapped). To allow relasing memory from the above pages during low-memory or out-of-memory conditions, enhance the memory cleaner implementation so that it scans both shared and private mappings, and evicts "old" pages, i.e. pages that have not been accessed recently (in OOM conditions, be more aggressive by evicting even recently accessed pages). To do the eviction safely on SMP machines, invoke a synchronous (rendezvous-based) TLB shootdown. This optimizes memory utilization by the page cache, e.g. it allows more memory to be released to the host OS via a virtio balloon device. This change can cause a SIGBUS signal to be delivered to the user process when a page cannot be faulted-in during the "ruby_alloc" end-to-end test; the Ruby program handles this signal by dumping process state information and then aborting (i.e. raising a SIGABRT signal); therefore, to make this test pass, add "6" to the list of expected exit codes for this test.
e4ae114
to
1f18e49
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change set:
In addition, there are fixes to miscellaneous issues that have been identified when testing the kernel under different low-memory and out-of-memory conditions.