On Tue, Nov 12, 2024 at 2:31 AM David Hildenbrand david@redhat.com wrote:
On 11.11.24 17:08, Yafang Shao wrote:
On Mon, Nov 11, 2024 at 11:05 PM David Hildenbrand david@redhat.com wrote:
On 11.11.24 15:28, Yafang Shao wrote:
On Mon, Nov 11, 2024 at 6:33 PM David Hildenbrand david@redhat.com wrote:
On 08.11.24 15:17, Yafang Shao wrote:
When testing large folio support with XFS on our servers, we observed that only a few large folios are mapped when reading large files via mmap. After a thorough analysis, I identified it was caused by the `/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this parameter is set to 128KB. After I tune it to 2MB, the large folio can work as expected. However, I believe the large folio behavior should not be dependent on the value of read_ahead_kb. It would be more robust if the kernel can automatically adopt to it.
Now I am extremely confused.
Documentation/ABI/stable/sysfs-block:
"[RW] Maximum number of kilobytes to read-ahead for filesystems on this block device."
So, with your patch, will we also be changing the readahead size to exceed that, or simply allocate larger folios and not exceeding the readahead size (e.g., leaving them partially non-filled)?
Exceeding the readahead size for the MADV_HUGEPAGE case is straightforward; this is what the current patch accomplishes.
Okay, so this only applies with MADV_HUGEPAGE I assume. Likely we should also make that clearer in the subject.
mm/readahead: allow exceeding configured read_ahead_kb with MADV_HUGEPAGE
If this is really a fix, especially one that deserves CC-stable, I cannot tell. Willy is the obvious expert :)
If you're also changing the readahead behavior to exceed the configuration parameter it would sound to me like "I am pushing the brake pedal and my care brakes; fix the brakes to adopt whether to brake automatically" :)
Likely I am missing something here, and how the read_ahead_kb parameter is used after your patch.
The read_ahead_kb parameter continues to function for non-MADV_HUGEPAGE scenarios, whereas special handling is required for the MADV_HUGEPAGE case. It appears that we ought to update the Documentation/ABI/stable/sysfs-block to reflect the changes related to large folios, correct?
Yes, how it related to MADV_HUGEPAGE. I would assume that it would get ignored, but ...
... staring at get_next_ra_size(), it's not quite ignored, because we still us it as a baseline to detect how much we want to bump up the limit when the requested size is small? (*2 vs *4 etc) :/
So the semantics are really starting to get weird, unless I am missing something important.
[...]
Perhaps a more straightforward solution would be to implement it directly at the callsite, as demonstrated below?
Likely something into this direction might be better, but Willy is the expert that code.
diff --git a/mm/readahead.c b/mm/readahead.c index 3dc6c7a128dd..187efae95b02 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -642,7 +642,11 @@ void page_cache_async_ra(struct readahead_control *ractl, 1UL << order); if (index == expected) { ra->start += ra->size;
ra->size = get_next_ra_size(ra, max_pages);
/*
* Allow the actual size to exceed the readahead window for a
* large folio.
"a large folio" -> "with MADV_HUGEPAGE" ? Or can this be hit on different paths that are not covered in the patch description?
This branch may also be triggered by other large folios that are not necessarily order-9. Therefore, I’ve referred to it as a 'large folio' rather than associating it specifically with MADV_HUGEPAGE. If we were to handle only the MADV_HUGEPAGE case, we would proceed as outlined in the initial RFC patch[0]. However, following Willy's recommendation, I implemented it this way, as he likely has a deeper understanding of the intended behavior.
Sorry, but this code is getting quite confusing, especially with such misleading "large folio" comments.
Even without MADV_HUGEPAGE we will be allocating large folios, as emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is *which* large folios we allocate. .. as Willy says [2]: "We were only intending to breach the 'max' for the MADV_HUGE case, not for all cases."
I have no idea how *anybody* should derive from the code here that we treat MADV_HUGEPAGE in a special way.
Simply completely confusing.
My interpretation of "I don't know if we should try to defend a stupid sysadmin against the consequences of their misconfiguration like this" means" would be "drop this patch and don't change anything".
Without this change, large folios won’t function as expected. Currently, to support MADV_HUGEPAGE, you’d need to set readahead_kb to 2MB, 4MB, or more. However, many applications run without MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for them.
No changes to API, no confusing code.
New features like large folios can often create confusion with existing rules or APIs, correct?
Maybe pr_info_once() when someone uses MADV_HUGEPAGE with such backends to tell the sysadmin that something stupid is happening ...
It's not a flawed setup; it's just that this new feature doesn’t work well with the existing settings, and updating those settings to accommodate it isn't always feasible.