On Mon, 11 Nov 2024 07:03:51 +0000 "Michael C. Pratt" mcpratt@pm.me wrote:
From userspace, spawning a new process with, for example,
posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior that existed since Linux 2.6 or earlier.
Linux translates the usage of the sched_param struct into it's own internal sched_attr struct during the syscall, but the user currently has no way to manage the other values within the sched_attr struct using only POSIX functions.
The only other way to adjust niceness when using posix_spawn() would be to set the value after the process has started, but this introduces the risk of the process being dead before the syscall can set the priority afterward.
To resolve this, allow the use of the priority value originally from the POSIX sched_param struct in order to set the niceness value instead of rejecting the priority value.
Edit the sched_get_priority_*() POSIX syscalls in order to reflect the range of values accepted.
Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c
Why is stable Cc'd?
Signed-off-by: Michael C. Pratt mcpratt@pm.me
kernel/sched/syscalls.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 24f9f90b6574..43eb283e6281 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy, attr.sched_policy = policy; }
- if (attr.sched_priority > MAX_PRIO-1)
return -EINVAL;
- /*
* If priority is set for SCHED_NORMAL or SCHED_BATCH,
* set the niceness instead, but only for user calls.
*/
- if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
attr.sched_priority = 0;
- }
This really looks like a hack. Specifically that we are exposing how the kernel records priority to user space. That is the greater than MAX_RT_PRIO-1. 120 may be the priority the kernel sees on nice values, but it is not something that we should every expose to user space system calls.
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
-- Steve
- return __sched_setscheduler(p, &attr, check, true);
} /** @@ -1532,9 +1545,11 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy) case SCHED_RR: ret = MAX_RT_PRIO-1; break;
- case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH:
ret = MAX_PRIO-1;
break;
- case SCHED_DEADLINE: case SCHED_IDLE: case SCHED_EXT: ret = 0;
@@ -1560,9 +1575,11 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy) case SCHED_RR: ret = 1; break;
- case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH:
ret = MAX_RT_PRIO;
break;
- case SCHED_DEADLINE: case SCHED_IDLE: case SCHED_EXT: ret = 0;
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623