On Wed, Feb 23, 2022 at 02:54:43PM +0100, Vlastimil Babka wrote:
we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read") and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") The latter commit is part of folio rework so likely not stable material, so it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from someone who understands xarray :)
[...]
I've hacked some printk on top 5.16 (attached debug.patch) which gives this output:
i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
It seems filemap_get_read_batch() should be returning pages ffffea0004340000 and ffffea0004470000 consecutively in the pvec, but returns the first one 8 times, so it's read twice and then the rest is just skipped over as it's beyond the requested read size.
I suspect these lines: xas.xa_index = head->index + thp_nr_pages(head) - 1; xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch earlier), so some self-contained fix should be possible for prior kernels? But I don't understand xarray well enough.
I figured it out!
In v5.15 (indeed, everything before commit 6b24ca4a1a8d), an order-9 page is stored in 512 consecutive slots. The XArray stores 64 entries per level. So what happens is we start looking at index 0 and we walk down to the bottom of the tree and find the THP at index 0.
xas.xa_index = head->index + thp_nr_pages(head) - 1; xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
So we've advanced xas.xa_index to 511, but advanced xas.xa_offset to 63. Then we call xas_next() which calls __xas_next(), which moves us along to array index 64 while we think we're looking at index 512.
We could make __xas_next() more resistant to this kind of abuse (by extracting the correct offset in the parent node from xa_index), but as you say, we're looking for a small fix for LTS. I suggest this will probably do the right thing:
+++ b/mm/filemap.c @@ -2354,8 +2354,7 @@ static void filemap_get_read_batch(struct address_space *mapping, break; if (PageReadahead(head)) break; - xas.xa_index = head->index + thp_nr_pages(head) - 1; - xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; + xas_set(&xas, head->index + thp_nr_pages(head) - 1); continue; put_page: put_page(head);
but I'll start trying the reproducer now.