On Wed, 13 Nov 2024 00:13:13 +0000 Michael Pratt mcpratt@pm.me wrote:
Why is stable Cc'd?
I believe this should be backported, if accepted, so that the behavior between kernel versions is matching.
That's not the purpose of stable. In fact, I would argue that it's the opposite of what stable is for. A stable kernel should *not* change behavior as that can cause regressions. If you want the newest behavior, then you should use the newest kernels.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation due to it being represented in a different way.
True it is exposed via files, but wouldn't this be the first change to make it visible via a system call?
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
I read a little about pidfd, but I'm not seeing the exact connection here, perhaps it will reduce the race condition but it cannot eliminate it as far as I see. For example, I am not finding a function that uses it to adjust niceness.
We can always add a system call do to that ;-) In fact, there's a lot of system calls that need to be converted to use pidfd over pid.
It's not that the "exit before modify" race condition is the only concern, it's just one of the less obvious factors making up my rationale for this change. I'm also concerned with efficiency. Why do we need to call another syscall if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler() the policy can be changed but not the priority, when there is a standardized function dedicated to just that.
My concern is the man page that has (in Debian):
$ man sched_setscheduler [..] Currently, Linux supports the following "normal" (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER the standard round-robin time-sharing policy;
SCHED_BATCH for "batch" style execution of processes; and
SCHED_IDLE for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Where we already document that the sched_priority "must be 0".
The difference between RT and normal processes is simply that for normal processes, we use "niceness" instead, so this patch simply translates the priority to "niceness", which cannot be expressed separately with the relevant POSIX functions.
I agree that POSIX has never been that great, but instead of modifying an existing documented system call to do something that is documented not to do, I believe we should either use other existing system calls or make a new one.
-- Steve