Hi Trond and Greg:
LTS 4.19 reported null-ptr-deref BUG as follows:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 Call Trace: nfs_inode_add_request+0x1cc/0x5b8 nfs_setup_write_request+0x1fa/0x1fc nfs_writepage_setup+0x2d/0x7d nfs_updatepage+0x8b8/0x936 nfs_write_end+0x61d/0xd45 generic_perform_write+0x19a/0x3f0 nfs_file_write+0x2cc/0x6e5 new_sync_write+0x442/0x560 __vfs_write+0xda/0xef vfs_write+0x176/0x48b ksys_write+0x10a/0x1e9 __se_sys_write+0x24/0x29 __x64_sys_write+0x79/0x93 do_syscall_64+0x16d/0x4bb entry_SYSCALL_64_after_hwframe+0x5c/0xc1
The reason is: generic_error_remove_page set page->mapping to NULL when nfs server have a fatal error:
nfs_updatepage nfs_writepage_setup nfs_setup_write_request nfs_try_to_update_request // return NULL nfs_wb_page // return 0 nfs_writepage_locked // return 0 nfs_do_writepage // return 0 nfs_page_async_flush // return 0 nfs_error_is_fatal_on_server generic_error_remove_page truncate_inode_page delete_from_page_cache __delete_from_page_cache page_cache_tree_delete page->mapping = NULL // this is point nfs_create_request req->wb_page = page // the page is freed nfs_inode_add_request mapping = page_file_mapping(req->wb_page) return page->mapping spin_lock(&mapping->private_lock) // mapping is NULL
It is reasonable by reverting the patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors" to fix this bug?
This patch is one patch of patchset [Fix up soft mounts for NFSv4.x](https://lore.kernel.org/all/20190407175912.23528-1-trond.myklebust@hammerspa...), the patchset replace custom error reporting mechanism. it seams that we should merge all the patchset to LTS 4.19, or all patchs should not be merged. And the "Fixes:" label is not correct, this patch is a refactoring patch, not for fixing bugs.
On Mon, Oct 30, 2023 at 04:39:11PM +0800, ChenXiaoSong wrote:
Hi Trond and Greg:
LTS 4.19 reported null-ptr-deref BUG as follows:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 Call Trace: nfs_inode_add_request+0x1cc/0x5b8 nfs_setup_write_request+0x1fa/0x1fc nfs_writepage_setup+0x2d/0x7d nfs_updatepage+0x8b8/0x936 nfs_write_end+0x61d/0xd45 generic_perform_write+0x19a/0x3f0 nfs_file_write+0x2cc/0x6e5 new_sync_write+0x442/0x560 __vfs_write+0xda/0xef vfs_write+0x176/0x48b ksys_write+0x10a/0x1e9 __se_sys_write+0x24/0x29 __x64_sys_write+0x79/0x93 do_syscall_64+0x16d/0x4bb entry_SYSCALL_64_after_hwframe+0x5c/0xc1
The reason is: generic_error_remove_page set page->mapping to NULL when nfs server have a fatal error:
nfs_updatepage nfs_writepage_setup nfs_setup_write_request nfs_try_to_update_request // return NULL nfs_wb_page // return 0 nfs_writepage_locked // return 0 nfs_do_writepage // return 0 nfs_page_async_flush // return 0 nfs_error_is_fatal_on_server generic_error_remove_page truncate_inode_page delete_from_page_cache __delete_from_page_cache page_cache_tree_delete page->mapping = NULL // this is point nfs_create_request req->wb_page = page // the page is freed nfs_inode_add_request mapping = page_file_mapping(req->wb_page) return page->mapping spin_lock(&mapping->private_lock) // mapping is NULL
It is reasonable by reverting the patch "89047634f5ce NFS: Don't interrupt file writeout due to fatal errors" to fix this bug?
Try it and see, but note, that came from the 4.19.99 release which was released years ago, are you sure you are using the most recent 4.19.y release?
This patch is one patch of patchset [Fix up soft mounts for NFSv4.x](https://lore.kernel.org/all/20190407175912.23528-1-trond.myklebust@hammerspa...), the patchset replace custom error reporting mechanism. it seams that we should merge all the patchset to LTS 4.19, or all patchs should not be merged. And the "Fixes:" label is not correct, this patch is a refactoring patch, not for fixing bugs.
If we missed some patches, that should be added on top of the current tree, please let us know the git commit ids of them after you have tested them that they work properly, and we will gladly apply them.
thanks,
greg k-h
On 2023/10/30 16:43, Greg KH wrote:
Try it and see, but note, that came from the 4.19.99 release which was released years ago, are you sure you are using the most recent 4.19.y release?
I use the most recent release 1b540579cf66 Linux 4.19.296.
If we missed some patches, that should be added on top of the current tree, please let us know the git commit ids of them after you have tested them that they work properly, and we will gladly apply them.
Merging the entire patchset may not be the best option. Perhaps a better choice is to revert this patch. And I would like to see Trond's suggestion.
On Mon, Oct 30, 2023 at 04:54:02PM +0800, ChenXiaoSong wrote:
On 2023/10/30 16:43, Greg KH wrote:
Try it and see, but note, that came from the 4.19.99 release which was released years ago, are you sure you are using the most recent 4.19.y release?
I use the most recent release 1b540579cf66 Linux 4.19.296.
If we missed some patches, that should be added on top of the current tree, please let us know the git commit ids of them after you have tested them that they work properly, and we will gladly apply them.
Merging the entire patchset may not be the best option. Perhaps a better choice is to revert this patch. And I would like to see Trond's suggestion.
If you just revert that one patch, is the issue resolved? And what about other stable releases that have that change in it?
If this does need to be reverted, please submit a patch that does so and we can review it that way.
thanks,
greg k-h
On 2023/10/30 16:58, Greg KH wrote:
If you just revert that one patch, is the issue resolved? And what about other stable releases that have that change in it?
If this does need to be reverted, please submit a patch that does so and we can review it that way.
thanks,
greg k-h
In my opinion, this patch is not for fixing a bug, but is part of a refactoring patchset. The 'Fixes:' tag should not be added.
On Mon, Oct 30, 2023 at 05:04:35PM +0800, ChenXiaoSong wrote:
On 2023/10/30 16:58, Greg KH wrote:
If you just revert that one patch, is the issue resolved? And what about other stable releases that have that change in it?
If this does need to be reverted, please submit a patch that does so and we can review it that way.
thanks,
greg k-h
In my opinion, this patch is not for fixing a bug, but is part of a refactoring patchset. The 'Fixes:' tag should not be added.
Nothing we can do about that now, right? And to try to ask developers about a change they made in 2019 is a bit rough, don't you think? If you think the change is incorrect, please submit a patch to resolve it after you have tested that it works properly.
thanks,
greg k-h
On Mon, 2023-10-30 at 17:04 +0800, ChenXiaoSong wrote:
[You don't often get email from chenxiaosongemail@foxmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification%C2%A0]
On 2023/10/30 16:58, Greg KH wrote:
If you just revert that one patch, is the issue resolved? And what about other stable releases that have that change in it?
If this does need to be reverted, please submit a patch that does so and we can review it that way.
thanks,
greg k-h
In my opinion, this patch is not for fixing a bug, but is part of a refactoring patchset. The 'Fixes:' tag should not be added.
A refactoring is by definition a change that does not affect code behaviour. It is obvious that this was never intended to be such a patch.
The reason that the bug is occurring in 4.19.x, and not in the latest kernels, is because the former is missing another bugfix (one which actually is missing a "Fixes:" tag).
Can you therefore please check if applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks") fixes the issue.
Note that the latter patch is needed in any case in order to fix a read deadlock (as indicated on the label).
Thanks, Trond
On 2023/10/30 22:56, Trond Myklebust wrote:
A refactoring is by definition a change that does not affect code behaviour. It is obvious that this was never intended to be such a patch.
The reason that the bug is occurring in 4.19.x, and not in the latest kernels, is because the former is missing another bugfix (one which actually is missing a "Fixes:" tag).
Can you therefore please check if applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks") fixes the issue.
Note that the latter patch is needed in any case in order to fix a read deadlock (as indicated on the label).
Thanks, Trond
After applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks"), I encountered an issue of infinite loop:
write ... nfs_updatepage nfs_writepage_setup nfs_setup_write_request nfs_try_to_update_request nfs_wb_page if (clear_page_dirty_for_io(page)) // true nfs_writepage_locked // return 0 nfs_do_writepage // return 0 nfs_page_async_flush // return 0 nfs_error_is_fatal_on_server nfs_write_error_remove_page SetPageError // instead of generic_error_remove_page // loop begin if (clear_page_dirty_for_io(page)) // false if (!PagePrivate(page)) // false ret = nfs_commit_inode = 0 // loop again, never quit
before applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks"), generic_error_remove_page() will clear PG_private, and infinite loop will never happen:
generic_error_remove_page truncate_inode_page truncate_cleanup_page do_invalidatepage nfs_invalidate_page nfs_wb_page_cancel nfs_inode_remove_request ClearPagePrivate(head->wb_page)
If applying this patch, are other patches required? And I cannot reproducethe read deadlock bug that the patch want to fix, are there specific conditions required to reproduce this read deadlock bug?
On 2023/10/30 22:56, Trond Myklebust wrote:
A refactoring is by definition a change that does not affect code behaviour. It is obvious that this was never intended to be such a patch.
The reason that the bug is occurring in 4.19.x, and not in the latest kernels, is because the former is missing another bugfix (one which actually is missing a "Fixes:" tag).
Can you therefore please check if applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks") fixes the issue.
Note that the latter patch is needed in any case in order to fix a read deadlock (as indicated on the label).
Thanks, Trond
Sorry, the previous email had formatting issues. I'll resend it.
After applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks"), I encountered an issue of infinite loop:
write ... nfs_updatepage nfs_writepage_setup nfs_setup_write_request nfs_try_to_update_request nfs_wb_page if (clear_page_dirty_for_io(page)) // true nfs_writepage_locked // return 0 nfs_do_writepage // return 0 nfs_page_async_flush // return 0 nfs_error_is_fatal_on_server nfs_write_error_remove_page SetPageError // instead of generic_error_remove_page // loop begin if (clear_page_dirty_for_io(page)) // false if (!PagePrivate(page)) // false ret = nfs_commit_inode = 0 // loop again, never quit
before applying commit 22876f540bdf ("NFS: Don't call generic_error_remove_page() while holding locks"), generic_error_remove_page() will clear PG_private, and infinite loop will never happen:
generic_error_remove_page truncate_inode_page truncate_cleanup_page do_invalidatepage nfs_invalidate_page nfs_wb_page_cancel nfs_inode_remove_request ClearPagePrivate(head->wb_page)
If applying this patch, are other patches required? And I cannot reproducethe read deadlock bug that the patch want to fix, are there specific conditions required to reproduce this read deadlock bug?
linux-stable-mirror@lists.linaro.org