On Sun, Aug 18, 2019 at 2:31 PM Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Aug 15, 2019 at 01:09:11AM +1000, Monthero Ronald wrote:
The patch checks for this condition of NULL pointer for the buffer_head returned from page_buffers() and also a check placed within the list traversal loop for next buffer_head structs.
crash scenario: The buffer_head returned from page_buffers() is not checked in block_invalidatepage_range function. The struct buffer_head* pointer returned by page_buffers(page) was 0x0, although this page had its private flag PG_private bit set and was expected to have buffer_head structs attached.The NULL pointer buffer_head was dereferenced in block_invalidatepage_range function at bh->b_size, where bh returned by page_buffers(page) was 0x0.
The stack frames were truncate_inode_page() => do_invalidatepage_range() => xfs_vm_invalidatepage() => [exception RIP: block_invalidatepage_range+132]
The inode for truncate in this case was valid and had proper inode.i_state = 0x20 - FREEING and had a valid mapped address space to xfs. And the struct page in context of block_invalidatepage_range() had its page flag PG_private set but the page.private was 0x0. So page_buffers(page) returned 0x0 and hence the crash. This patch performs NULL pointer check for returned buffer_head. Applies to 3.16 and later kernels.
... and adds BUG_ON() for that. The only real difference from an oops is that it's a bit easier to recognize. Which may or may not be a good debugging strategy, but what's the point of having it in -stable? Or anywhere other than the build on the boxen you are testing on...
It doesn't fix the underlying bug. It doesn't tell where the problem is. It's definitely *not* a way to fix any bugs. And while we are at it, the stuff in -stable ought to be backports from mainline.
Can you reproduce your crashes on mainline?
No I don't have a reproducer. This was an abrupt crash, where the page has its private flag set but the buffer_head returned by page_buffers( ) was 0x0. Having the BUG_ON check for the returned buffer_head from page_buffers( ) would help to to see the BUG line readily than to analyze the vmcore and check the fault condition why kernel panicked. Two cents here, but open to thoughts. ( This crash to reproduce is not trivial, nevertheless will attempt to check if there are any mm tests or page mapping tests that can induce this condition )