Hi Steven, thanks for the reply,
On Tuesday, November 12th, 2024 at 10:34, Steven Rostedt rostedt@goodmis.org wrote:
Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c
Why is stable Cc'd?
I believe this should be backported, if accepted, so that the behavior between kernel versions is matching.
Signed-off-by: Michael C. Pratt mcpratt@pm.me
kernel/sched/syscalls.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 24f9f90b6574..43eb283e6281 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy, attr.sched_policy = policy; }
- if (attr.sched_priority > MAX_PRIO-1)
- return -EINVAL;
- /*
- If priority is set for SCHED_NORMAL or SCHED_BATCH,
- set the niceness instead, but only for user calls.
- */
- if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
- ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
- attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
- attr.sched_priority = 0;
- }
This really looks like a hack.
If you have what you would consider to be a "non-hack" fix in order for standard function posix_spawnattr_setschedparam() to be usable instead of always fail for non-RT processes, let me know...
Looking at the larger picture, because Linux translates the sched_param struct into the local sched_attr struct, and because of changes in history, there are several caveats to be handled. This internal preparation function _sched_setscheduler() is essentially only consisting of "hacks", except for the member copying between structs.
Specifically that we are exposing how the kernel records priority to user space. That is the greater than MAX_RT_PRIO-1. 120 may be the priority the kernel sees on nice values, but it is not something that we should expose to user space system calls.
Is the default static priority value not already exposed, perhaps everywhere??? It's not a secret, but rather common knowledge that when observing a "niceness" of 0 or a "priority" of 20 as seen through common programs like "top", that these values actually represent 120 as the real priority value, and that the niceness value is a simple addition to the default for the final effective value.
Also, we have in newer kernel versions, maybe or maybe not dependent on configuration, the procfs object called "sched" which already shows the actual value.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation due to it being represented in a different way.
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
I read a little about pidfd, but I'm not seeing the exact connection here, perhaps it will reduce the race condition but it cannot eliminate it as far as I see. For example, I am not finding a function that uses it to adjust niceness.
It's not that the "exit before modify" race condition is the only concern, it's just one of the less obvious factors making up my rationale for this change. I'm also concerned with efficiency. Why do we need to call another syscall if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler() the policy can be changed but not the priority, when there is a standardized function dedicated to just that.
The difference between RT and normal processes is simply that for normal processes, we use "niceness" instead, so this patch simply translates the priority to "niceness", which cannot be expressed separately with the relevant POSIX functions.
-- MCP