On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara jack@suse.cz wrote:
On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
We recently got an Oops report:
BUG: unable to handle kernel NULL pointer dereference at (null) IP: jbd2__journal_start+0x38/0x1a2 [...] Call Trace: ext4_page_mkwrite+0x307/0x52b _ext4_get_block+0xd8/0xd8 do_page_mkwrite+0x6e/0xd8 handle_mm_fault+0x686/0xf9b mntput_no_expire+0x1f/0x21e __do_page_fault+0x21d/0x465 dput+0x4a/0x2f7 page_fault+0x22/0x30 copy_user_generic_string+0x2c/0x40 copy_page_to_iter+0x8c/0x2b8 generic_file_read_iter+0x26e/0x845 timerqueue_del+0x31/0x90 ceph_read_iter+0x697/0xa33 [ceph] hrtimer_cancel+0x23/0x41 futex_wait+0x1c8/0x24d get_futex_key+0x32c/0x39a __vfs_read+0xe0/0x130 vfs_read.part.1+0x6c/0x123 handle_mm_fault+0x831/0xf9b __fget+0x7e/0xbf SyS_read+0x4d/0xb5
ceph_read_iter() uses current->journal_info to pass context info to ceph_readpages(). Because ceph_readpages() needs to know if its caller has already gotten capability of using page cache (distinguish read from readahead/fadvise). ceph_read_iter() set current->journal_info, then calls generic_file_read_iter().
In above Oops, page fault happened when copying data to userspace. Page fault handler called ext4_page_mkwrite(). Ext4 code read current->journal_info and assumed it is journal handle.
I checked other filesystems, btrfs probably suffers similar problem for its readpage. (page fault happens when write() copies data from userspace memory and the memory is mapped to a file in btrfs. verify_parent_transid() can be called during readpage)
Cc: stable@vger.kernel.org Signed-off-by: "Yan, Zheng" zyan@redhat.com
I agree with the analysis but the patch is too ugly too live. Ceph just should not be abusing current->journal_info for passing information between two random functions or when it does a hackery like this, it should just make sure the pieces hold together. Poluting generic code to accommodate this hack in Ceph is not acceptable. Also bear in mind there are likely other code paths (e.g. memory reclaim) which could recurse into another filesystem confusing it with non-NULL current->journal_info in the same way.
But ...
some filesystem set journal_info in its write_begin(), then clear it in write_end(). If buffer for write is mapped to another filesystem, current->journal can leak to the later filesystem's page_readpage(). The later filesystem may read current->journal and treat it as its own journal handle. Besides, most filesystem's vm fault handle is filemap_fault(), filemap also may tigger memory reclaim.
In this particular case I'm not sure why does ceph pass 'filp' into readpage() / readpages() handler when it already gets that pointer as part of arguments...
It actually a flag which tells ceph_readpages() if its caller is ceph_read_iter or readahead/fadvise/madvise. because when there are multiple clients read/write a file a the same time, page cache should be disabled.
Regards Yan, Zheng
Honza
diff --git a/mm/memory.c b/mm/memory.c index a728bed16c20..db2a50233c49 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { int ret;
void *old_journal_info; __set_current_state(TASK_RUNNING);
@@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (flags & FAULT_FLAG_USER) mem_cgroup_oom_enable();
/*
* Fault can happen when filesystem A's read_iter()/write_iter()
* copies data to/from userspace. Filesystem A may have set
* current->journal_info. If the userspace memory is MAP_SHARED
* mapped to a file in filesystem B, we later may call filesystem
* B's vm operation. Filesystem B may also want to read/set
* current->journal_info.
*/
old_journal_info = current->journal_info;
current->journal_info = NULL;
if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags);
current->journal_info = old_journal_info;
if (flags & FAULT_FLAG_USER) { mem_cgroup_oom_disable(); /*
-- 2.13.6
-- Jan Kara jack@suse.com SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html