FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and they are all set to 128KiB. I'm not so sure if they really should be that small ...
It depends on the use case. For our hardop servers, we set it to 4MB, as they prioritize throughput over latency. However, for our Kubernetes servers, we keep it at 128KB since those services are more latency-sensitive, and increasing it could lead to more frequent latency spikes.
Thanks for sharing.
or if large folio readahead code should just be able to exceed it.
"mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even if we have no history of readahead being successful".
So not about exceeding the configured limit, but exceeding the "readahead history".
So I consider VM_HUGEPAGE the sign here to "ignore readahead history" and not to "violate the config".
MADV_HUGEPAGE is definitely a new addition to readahead, and its behavior isn’t yet defined in the documentation. All we need to do is clarify its behavior there. The documentation isn’t set in stone—we can update it as long as it doesn’t disrupt existing applications.
If Willy thinks this is the way to go, then we should document that MADV_HUGEPAGE may ignore the parameter, agreed.
I'll submit an additional patch to update the documentation for MADV_HUGEPAGE.
I still don't understand your one comment:
"It's worth noting that if read_ahead_kb is set to a larger value that isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail to map to hugepages."
Do you mean that MADV_HUGEPAGE+read_ahead_kb<=4M will give you 2M pages, but MADV_HUGEPAGE+read_ahead_kb>4M won't? Or is this the case without MADV_HUGEPAGE?
Typically, users set read_ahead_kb to aligned sizes, such as 128KB, 256KB, 512KB, 1MB, 2MB, 4MB, or 8MB. With this patch, MADV_HUGEPAGE functions well for all these settings. However, if read_ahead_kb is set to a non-hugepage-aligned size (e.g., 4MB + 128KB), MADV_HUGEPAGE won’t work. This is because the initial readahead size for MADV_HUGEPAGE is set to 4MB, as established in commit 4687fdbb805a:
ra->size = HPAGE_PMD_NR; if (!(vmf->vma->vm_flags & VM_RAND_READ)) ra->size *= 2;
However, as Willy noted, non-aligned settings are quite stupid, so we should disregard them.
Right. What I've been wondering, to make this code easier to understand, if there should be some kind of ra->size_fixed=true parameter that tells readahead code to simply not mess with the ra->size until something changes. (below)
[...]
A quick tip for you: the readahead size already exceeds readahead_kb even without MADV_HUGEPAGE. You might want to spend some time tracing that behavior.
Care to save me some time and point me at what you mean?
I reached this conclusion by tracing ra->size in each page_cache_ra_order() call, but I’m not fully equipped to provide all the details ;)
I've been staring at the readhead code for 30 minutes and I am still lost. Either I'm too stupid for the code or the code is too complicated.
If we'd have something like
ra->start += ra->size; /* * If someone like VM_HUGEPAGE fixed the size, just don't mess with it. */ if (!ra->size_fixed) ra->size = get_next_ra_size(ra, max_pages); ra->async_size = ra->size;
It would be a lot clearer at least to me -- and we'd likely be able to get rid of the "unaligned readhahead" oddity.
If we'd grep for who sets "size_fixed", we could even figure out how all of this belongs together.