From userspace, spawning a new process with, for example, posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior that existed since Linux 2.6 or earlier.
Linux translates the usage of the sched_param struct into it's own internal sched_attr struct during the syscall, but the user currently has no way to manage the other values within the sched_attr struct using only POSIX functions.
The only other way to adjust niceness when using posix_spawn() would be to set the value after the process has started, but this introduces the risk of the process being dead before the syscall can set the priority afterward.
To resolve this, allow the use of the priority value originally from the POSIX sched_param struct in order to set the niceness value instead of rejecting the priority value.
Edit the sched_get_priority_*() POSIX syscalls in order to reflect the range of values accepted.
Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c Signed-off-by: Michael C. Pratt mcpratt@pm.me --- kernel/sched/syscalls.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 24f9f90b6574..43eb283e6281 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy, attr.sched_policy = policy; }
+ if (attr.sched_priority > MAX_PRIO-1) + return -EINVAL; + + /* + * If priority is set for SCHED_NORMAL or SCHED_BATCH, + * set the niceness instead, but only for user calls. + */ + if (check && attr.sched_priority > MAX_RT_PRIO-1 && + ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) { + attr.sched_nice = PRIO_TO_NICE(attr.sched_priority); + attr.sched_priority = 0; + } + return __sched_setscheduler(p, &attr, check, true); } /** @@ -1532,9 +1545,11 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy) case SCHED_RR: ret = MAX_RT_PRIO-1; break; - case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH: + ret = MAX_PRIO-1; + break; + case SCHED_DEADLINE: case SCHED_IDLE: case SCHED_EXT: ret = 0; @@ -1560,9 +1575,11 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy) case SCHED_RR: ret = 1; break; - case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH: + ret = MAX_RT_PRIO; + break; + case SCHED_DEADLINE: case SCHED_IDLE: case SCHED_EXT: ret = 0;
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
On Mon, 11 Nov 2024 07:03:51 +0000 "Michael C. Pratt" mcpratt@pm.me wrote:
From userspace, spawning a new process with, for example,
posix_spawn(), only allows the user to work with the scheduling priority value defined by POSIX in the sched_param struct.
However, sched_setparam() and similar syscalls lead to __sched_setscheduler() which rejects any new value for the priority other than 0 for non-RT schedule classes, a behavior that existed since Linux 2.6 or earlier.
Linux translates the usage of the sched_param struct into it's own internal sched_attr struct during the syscall, but the user currently has no way to manage the other values within the sched_attr struct using only POSIX functions.
The only other way to adjust niceness when using posix_spawn() would be to set the value after the process has started, but this introduces the risk of the process being dead before the syscall can set the priority afterward.
To resolve this, allow the use of the priority value originally from the POSIX sched_param struct in order to set the niceness value instead of rejecting the priority value.
Edit the sched_get_priority_*() POSIX syscalls in order to reflect the range of values accepted.
Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c
Why is stable Cc'd?
Signed-off-by: Michael C. Pratt mcpratt@pm.me
kernel/sched/syscalls.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 24f9f90b6574..43eb283e6281 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy, attr.sched_policy = policy; }
- if (attr.sched_priority > MAX_PRIO-1)
return -EINVAL;
- /*
* If priority is set for SCHED_NORMAL or SCHED_BATCH,
* set the niceness instead, but only for user calls.
*/
- if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
attr.sched_priority = 0;
- }
This really looks like a hack. Specifically that we are exposing how the kernel records priority to user space. That is the greater than MAX_RT_PRIO-1. 120 may be the priority the kernel sees on nice values, but it is not something that we should every expose to user space system calls.
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
-- Steve
- return __sched_setscheduler(p, &attr, check, true);
} /** @@ -1532,9 +1545,11 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy) case SCHED_RR: ret = MAX_RT_PRIO-1; break;
- case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH:
ret = MAX_PRIO-1;
break;
- case SCHED_DEADLINE: case SCHED_IDLE: case SCHED_EXT: ret = 0;
@@ -1560,9 +1575,11 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy) case SCHED_RR: ret = 1; break;
- case SCHED_DEADLINE: case SCHED_NORMAL: case SCHED_BATCH:
ret = MAX_RT_PRIO;
break;
- case SCHED_DEADLINE: case SCHED_IDLE: case SCHED_EXT: ret = 0;
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
Hi Steven, thanks for the reply,
On Tuesday, November 12th, 2024 at 10:34, Steven Rostedt rostedt@goodmis.org wrote:
Cc: stable@vger.kernel.org # Apply to kernel/sched/core.c
Why is stable Cc'd?
I believe this should be backported, if accepted, so that the behavior between kernel versions is matching.
Signed-off-by: Michael C. Pratt mcpratt@pm.me
kernel/sched/syscalls.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index 24f9f90b6574..43eb283e6281 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy, attr.sched_policy = policy; }
- if (attr.sched_priority > MAX_PRIO-1)
- return -EINVAL;
- /*
- If priority is set for SCHED_NORMAL or SCHED_BATCH,
- set the niceness instead, but only for user calls.
- */
- if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
- ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
- attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
- attr.sched_priority = 0;
- }
This really looks like a hack.
If you have what you would consider to be a "non-hack" fix in order for standard function posix_spawnattr_setschedparam() to be usable instead of always fail for non-RT processes, let me know...
Looking at the larger picture, because Linux translates the sched_param struct into the local sched_attr struct, and because of changes in history, there are several caveats to be handled. This internal preparation function _sched_setscheduler() is essentially only consisting of "hacks", except for the member copying between structs.
Specifically that we are exposing how the kernel records priority to user space. That is the greater than MAX_RT_PRIO-1. 120 may be the priority the kernel sees on nice values, but it is not something that we should expose to user space system calls.
Is the default static priority value not already exposed, perhaps everywhere??? It's not a secret, but rather common knowledge that when observing a "niceness" of 0 or a "priority" of 20 as seen through common programs like "top", that these values actually represent 120 as the real priority value, and that the niceness value is a simple addition to the default for the final effective value.
Also, we have in newer kernel versions, maybe or maybe not dependent on configuration, the procfs object called "sched" which already shows the actual value.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation due to it being represented in a different way.
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
I read a little about pidfd, but I'm not seeing the exact connection here, perhaps it will reduce the race condition but it cannot eliminate it as far as I see. For example, I am not finding a function that uses it to adjust niceness.
It's not that the "exit before modify" race condition is the only concern, it's just one of the less obvious factors making up my rationale for this change. I'm also concerned with efficiency. Why do we need to call another syscall if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler() the policy can be changed but not the priority, when there is a standardized function dedicated to just that.
The difference between RT and normal processes is simply that for normal processes, we use "niceness" instead, so this patch simply translates the priority to "niceness", which cannot be expressed separately with the relevant POSIX functions.
-- MCP
On Wed, 13 Nov 2024 00:13:13 +0000 Michael Pratt mcpratt@pm.me wrote:
Why is stable Cc'd?
I believe this should be backported, if accepted, so that the behavior between kernel versions is matching.
That's not the purpose of stable. In fact, I would argue that it's the opposite of what stable is for. A stable kernel should *not* change behavior as that can cause regressions. If you want the newest behavior, then you should use the newest kernels.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation due to it being represented in a different way.
True it is exposed via files, but wouldn't this be the first change to make it visible via a system call?
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
I read a little about pidfd, but I'm not seeing the exact connection here, perhaps it will reduce the race condition but it cannot eliminate it as far as I see. For example, I am not finding a function that uses it to adjust niceness.
We can always add a system call do to that ;-) In fact, there's a lot of system calls that need to be converted to use pidfd over pid.
It's not that the "exit before modify" race condition is the only concern, it's just one of the less obvious factors making up my rationale for this change. I'm also concerned with efficiency. Why do we need to call another syscall if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler() the policy can be changed but not the priority, when there is a standardized function dedicated to just that.
My concern is the man page that has (in Debian):
$ man sched_setscheduler [..] Currently, Linux supports the following "normal" (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER the standard round-robin time-sharing policy;
SCHED_BATCH for "batch" style execution of processes; and
SCHED_IDLE for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Where we already document that the sched_priority "must be 0".
The difference between RT and normal processes is simply that for normal processes, we use "niceness" instead, so this patch simply translates the priority to "niceness", which cannot be expressed separately with the relevant POSIX functions.
I agree that POSIX has never been that great, but instead of modifying an existing documented system call to do something that is documented not to do, I believe we should either use other existing system calls or make a new one.
-- Steve
Hi again Steven,
On Tuesday, November 12th, 2024 at 19:36, Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 13 Nov 2024 00:13:13 +0000 Michael Pratt mcpratt@pm.me wrote:
Why is stable Cc'd?
I believe this should be backported, if accepted, so that the behavior between kernel versions is matching.
That's not the purpose of stable. In fact, I would argue that it's the opposite of what stable is for. A stable kernel should not change behavior as that can cause regressions. If you want the newest behavior, then you should use the newest kernels.
Ok that's fair. I assumed that the backport policy would be similar in this case as it would be for downstream distributions. Maybe that's a bad assumption from me.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation due to it being represented in a different way.
True it is exposed via files, but wouldn't this be the first change to make it visible via a system call?
If the "it" means "the accepted range" then no, but if "it" means "the (priority + niceness) range" then yes. I still don't see the impact of whatever number happens to get returned. You would have to explain to me whatever magical security implication you have in mind.
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
I read a little about pidfd, but I'm not seeing the exact connection here, perhaps it will reduce the race condition but it cannot eliminate it as far as I see. For example, I am not finding a function that uses it to adjust niceness.
We can always add a system call do to that ;-) In fact, there's a lot of system calls that need to be converted to use pidfd over pid.
We can also convert system calls to be fully functional instead of mostly functional. I consider this a functionality gap, not just something annoying.
It's not that the "exit before modify" race condition is the only concern, it's just one of the less obvious factors making up my rationale for this change. I'm also concerned with efficiency. Why do we need to call another syscall if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler() the policy can be changed but not the priority, when there is a standardized function dedicated to just that.
My concern is the man page that has (in Debian):
$ man sched_setscheduler [..] SCHED_OTHER the standard round-robin time-sharing policy;
SCHED_BATCH for "batch" style execution of processes; and
SCHED_IDLE for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Where we already document that the sched_priority "must be 0".
I think we should all agree that documentation is a summary of development, not the other way around. Not only that, but this is poor documentation. The kernel is subject to change, imagine using the word "always" for design decisions that are not standardized. A more appropriate description would be "for each policy, sched_priority must be within the range provided by the return of [the query system calls]" just as POSIX describes the relationship.
As far as I can see, the "must be 0" requirement is completely arbitrary, or, if there is a reason, it must be a fairly poor one. However, I do recognize that the actual static priority cannot change, hence the adjustment to niceness instead is the obvious intention to any attempt to adjust the priority on the kernel-side from userspace.
I consider this patch to be a fix for a design decision that makes no sense when reading about the intended purpose of these values, not that it's the only way to achieve the priority adjustment. If anyone considers that something this simple should have been done already, the fact that documentation would have to be adjusted should not block it. Besides, a well-written program would already have been using the functions that return the accepted range before executing the sched_setscheduler() system call with a value that would be rejected.
Am I really the only one to read that you can't set the priority with this system call when I can do it on the command line with the "nice" program which uses a different system call, and ask "what's the point of this restriction?"
The difference between RT and normal processes is simply that for normal processes, we use "niceness" instead, so this patch simply translates the priority to "niceness", which cannot be expressed separately with the relevant POSIX functions.
I agree that POSIX has never been that great, but instead of modifying an existing documented system call to do something that is documented not to do, I believe we should either use other existing system calls or make a new one.
Is a POSIX function going to allow me a way to decide which set of system calls will get used to process it? Again, a functionality gap exists in functions that already exist and that gap would continue to exist...
This system call is not exactly allowing the user to do what POSIX says its purpose is for when it's clearly capable of doing so. I got it to work in about 8 LOC. Which set of documentations matters more? To me, anything else is a workaround that leaves this system call in an inconsistent state, instead, this is a solution.
-- MCP
On Wed, 13 Nov 2024 06:04:59 +0000 Michael Pratt mcpratt@pm.me wrote:
$ man sched_setscheduler [..] SCHED_OTHER the standard round-robin time-sharing policy;
SCHED_BATCH for "batch" style execution of processes; and
SCHED_IDLE for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Where we already document that the sched_priority "must be 0".
I think we should all agree that documentation is a summary of development, not the other way around. Not only that, but this is poor documentation. The kernel is subject to change, imagine using the word "always" for design decisions that are not standardized. A more appropriate description would be "for each policy, sched_priority must be within the range provided by the return of [the query system calls]" just as POSIX describes the relationship.
As far as I can see, the "must be 0" requirement is completely arbitrary, or, if there is a reason, it must be a fairly poor one. However, I do recognize that the actual static priority cannot change, hence the adjustment to niceness instead is the obvious intention to any attempt to adjust the priority on the kernel-side from userspace.
I consider this patch to be a fix for a design decision that makes no sense when reading about the intended purpose of these values, not that it's the only way to achieve the priority adjustment. If anyone considers that something this simple should have been done already, the fact that documentation would have to be adjusted should not block it. Besides, a well-written program would already have been using the functions that return the accepted range before executing the sched_setscheduler() system call with a value that would be rejected.
Am I really the only one to read that you can't set the priority with this system call when I can do it on the command line with the "nice" program which uses a different system call, and ask "what's the point of this restriction?"
Honestly, I would actually prefer your change. But modifying an existing API is above my pay grade ;-) I think you really need Linus to answer that.
Linus?
-- Steve
Hello,
kernel test robot noticed "ltp.sched_get_priority_min01.fail" on:
commit: ecd04eb1e1d00bbd158b2f7d1353af709d8131a7 ("[PATCH RESEND 2 1/1] sched/syscalls: Allow setting niceness using sched_param struct") url: https://github.com/intel-lab-lkp/linux/commits/Michael-C-Pratt/sched-syscall... base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git fe9beaaa802d44d881b165430b3239a9d7bebf30 patch link: https://lore.kernel.org/all/20241111070152.9781-2-mcpratt@pm.me/ patch subject: [PATCH RESEND 2 1/1] sched/syscalls: Allow setting niceness using sched_param struct
in testcase: ltp version: ltp-x86_64-14c1f76-1_20241111 with following parameters:
disk: 1HDD fs: xfs test: syscalls-07
config: x86_64-rhel-8.3-ltp compiler: gcc-12 test machine: 4 threads 1 sockets Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz (Ivy Bridge) with 8G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot oliver.sang@intel.com | Closes: https://lore.kernel.org/oe-lkp/202411252241.2d75c2f6-lkp@intel.com
<<<test_start>>> tag=sched_get_priority_min01 stime=1732078544 cmdline="sched_get_priority_min01" contacts="" analysis=exit <<<test_output>>> tst_test.c:1890: TINFO: LTP version: 20240930-63-g6408294d8 tst_test.c:1894: TINFO: Tested kernel: 6.12.0-rc4-00035-gecd04eb1e1d0 #1 SMP PREEMPT_DYNAMIC Sun Nov 17 15:27:30 CST 2024 x86_64 tst_test.c:1725: TINFO: Timeout per run is 0h 02m 30s sched_get_priority_min01.c:42: TFAIL: SCHED_BATCH retval 100 != 0: SUCCESS (0) sched_get_priority_min01.c:42: TPASS: SCHED_DEADLINE passed sched_get_priority_min01.c:42: TPASS: SCHED_FIFO passed sched_get_priority_min01.c:42: TPASS: SCHED_IDLE passed sched_get_priority_min01.c:42: TFAIL: SCHED_OTHER retval 100 != 0: SUCCESS (0) sched_get_priority_min01.c:42: TPASS: SCHED_RR passed
Summary: passed 4 failed 2 broken 0 skipped 0 warnings 0 <<<execution_status>>> initiation_status="ok" duration=0 termination_type=exited termination_id=1 corefile=no cutime=0 cstime=1 <<<test_end>>>
The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20241125/202411252241.2d75c2f6-lkp@i...
linux-stable-mirror@lists.linaro.org