On Wed, Mar 16, 2022 at 07:49:38PM +0530, Charan Teja Kalla wrote:
Thanks Andrew and Minchan.
On 3/16/2022 7:13 AM, Minchan Kim wrote:
On Tue, Mar 15, 2022 at 04:48:07PM -0700, Andrew Morton wrote:
On Tue, 15 Mar 2022 15:58:28 -0700 Minchan Kim minchan@kernel.org wrote:
On Fri, Mar 11, 2022 at 08:59:06PM +0530, Charan Teja Kalla wrote:
The process_madvise() system call is expected to skip holes in vma passed through 'struct iovec' vector list. But do_madvise, which process_madvise() calls for each vma, returns ENOMEM in case of unmapped holes, despite the VMA is processed. Thus process_madvise() should treat ENOMEM as expected and consider the VMA passed to as processed and continue processing other vma's in the vector list. Returning -ENOMEM to user, despite the VMA is processed, will be unable to figure out where to start the next madvise. Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Cc: stable@vger.kernel.org # 5.10+
Hmm, not sure whether it's stable material since it changes semantic of API. It would be better to change the semantic from 5.19 with man page update to specify the change.
It's a very desirable change and it makes the code match the manpage and it's cc:stable. I think we should just absorb any transitory damage which this causes people. I doubt if there will be much - if anyone was affected by this they would have already told us that it's broken?
process_madvise fails to return exact processed bytes at several cases if it encounters the error, such as, -EINVAL, -EINTR, -ENOMEM in the middle of processing vmas. And now we are trying to make exception for change for only hole?
I think EINTR will never return in the middle of processing VMA's for the behaviours supported by process_madvise().
It can return EINTR when:
- PTRACE_MODE_READ is being checked in mm_access() where it is waiting
on task->signal->exec_update_lock. EINTR returned from here guarantees that process_madvise() didn't event start processing. https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1264 --> https://elixir.bootlin.com/linux/v5.16.14/source/kernel/fork.c#L1318
- The process_madvise() started processing VMA's but the required
behavior on a VMA needs mmap_write_lock_killable(), from where EINTR is returned. The current behaviours supported by process_madvise(), MADV_COLD, PAGEOUT, WILLNEED, just need read lock here. https://elixir.bootlin.com/linux/v5.16.14/source/mm/madvise.c#L1164 **Thus I think no way for EINTR can be returned by process_madvise() in the middle of processing.** . No?
for EINVAL:
The only case, I can think of, where EINVAL can be returned in the middle of processing is in examples like, given range contains VMA's with a hole in between and one of the VMA contains the pages that fails can_madv_lru_vma() condition. So, it's a limitation that this returns -EINVAL though some bytes are processed. OR Since there exists still some invalid bytes processed it is valid to return -EINVAL here and user has to check the address range sent?
for ENOMEM:
Though complete range is processed still returns ENOMEM. IMO, This shouldn't be treated as error which the patch is targeted for. Then there is limitation case that you mentioned below where it returns positive processes bytes even though it didn't process anything if it couldn't find any vma for the first iteration in madvise_walk_vmas
I think the above limitations with EINVAL and ENOMEM are arising because we are relying on do_madvise() functionality which madvise() call uses to process a single VMA. When 'struct iovec' vector processing interface is given in a system call, it is the expectation by the caller that this system call should return the correct bytes processed to help the user to take the correct decisions. Please correct me If i am wrong here.
So, should we add the new function say do_process_madvise(), which take cares of above limitations? or any alternative suggestions here please?
What I am thinking now is that the process_madvise needs own iterator(i.e., do_process_madvise) and it should represent exact bytes it addressed with exacts ranges like process_vm_readv/writev. Poviding valid ranges is responsiblity from the user.
IMO, it's worth to note in man page.
Or the current patch for just ENOMEM is sufficient here and we just have to update the man page?
In addition, this change returns positive processes bytes even though it didn't process anything if it couldn't find any vma for the first iteration in madvise_walk_vmas.
Thanks, Charan