Hi again Steven,
On Tuesday, November 12th, 2024 at 19:36, Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 13 Nov 2024 00:13:13 +0000 Michael Pratt mcpratt@pm.me wrote:
Why is stable Cc'd?
I believe this should be backported, if accepted, so that the behavior between kernel versions is matching.
That's not the purpose of stable. In fact, I would argue that it's the opposite of what stable is for. A stable kernel should not change behavior as that can cause regressions. If you want the newest behavior, then you should use the newest kernels.
Ok that's fair. I assumed that the backport policy would be similar in this case as it would be for downstream distributions. Maybe that's a bad assumption from me.
I can do:
$ cat /proc/$$/sched
and see the 120 without needing interpretation due to it being represented in a different way.
True it is exposed via files, but wouldn't this be the first change to make it visible via a system call?
If the "it" means "the accepted range" then no, but if "it" means "the (priority + niceness) range" then yes. I still don't see the impact of whatever number happens to get returned. You would have to explain to me whatever magical security implication you have in mind.
That said, you are worried about the race of spawning a new task and setting its nice value because the new task may have exited. What about using pidfd? Create a task returning the pidfd and use that to set its nice value.
I read a little about pidfd, but I'm not seeing the exact connection here, perhaps it will reduce the race condition but it cannot eliminate it as far as I see. For example, I am not finding a function that uses it to adjust niceness.
We can always add a system call do to that ;-) In fact, there's a lot of system calls that need to be converted to use pidfd over pid.
We can also convert system calls to be fully functional instead of mostly functional. I consider this a functionality gap, not just something annoying.
It's not that the "exit before modify" race condition is the only concern, it's just one of the less obvious factors making up my rationale for this change. I'm also concerned with efficiency. Why do we need to call another syscall if the syscall we are already in can handle it?
Personally, I find it strange that in sched_setscheduler() the policy can be changed but not the priority, when there is a standardized function dedicated to just that.
My concern is the man page that has (in Debian):
$ man sched_setscheduler [..] SCHED_OTHER the standard round-robin time-sharing policy;
SCHED_BATCH for "batch" style execution of processes; and
SCHED_IDLE for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Where we already document that the sched_priority "must be 0".
I think we should all agree that documentation is a summary of development, not the other way around. Not only that, but this is poor documentation. The kernel is subject to change, imagine using the word "always" for design decisions that are not standardized. A more appropriate description would be "for each policy, sched_priority must be within the range provided by the return of [the query system calls]" just as POSIX describes the relationship.
As far as I can see, the "must be 0" requirement is completely arbitrary, or, if there is a reason, it must be a fairly poor one. However, I do recognize that the actual static priority cannot change, hence the adjustment to niceness instead is the obvious intention to any attempt to adjust the priority on the kernel-side from userspace.
I consider this patch to be a fix for a design decision that makes no sense when reading about the intended purpose of these values, not that it's the only way to achieve the priority adjustment. If anyone considers that something this simple should have been done already, the fact that documentation would have to be adjusted should not block it. Besides, a well-written program would already have been using the functions that return the accepted range before executing the sched_setscheduler() system call with a value that would be rejected.
Am I really the only one to read that you can't set the priority with this system call when I can do it on the command line with the "nice" program which uses a different system call, and ask "what's the point of this restriction?"
The difference between RT and normal processes is simply that for normal processes, we use "niceness" instead, so this patch simply translates the priority to "niceness", which cannot be expressed separately with the relevant POSIX functions.
I agree that POSIX has never been that great, but instead of modifying an existing documented system call to do something that is documented not to do, I believe we should either use other existing system calls or make a new one.
Is a POSIX function going to allow me a way to decide which set of system calls will get used to process it? Again, a functionality gap exists in functions that already exist and that gap would continue to exist...
This system call is not exactly allowing the user to do what POSIX says its purpose is for when it's clearly capable of doing so. I got it to work in about 8 LOC. Which set of documentations matters more? To me, anything else is a workaround that leaves this system call in an inconsistent state, instead, this is a solution.
-- MCP