On Thu, Oct 26, 2023 at 9:49 AM Luca Boccassi bluca@debian.org wrote:
On Thu, 26 Oct 2023 at 17:41, Johannes Weiner hannes@cmpxchg.org wrote:
519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") breaks unprivileged psi polling on cgroups.
Historically, we had a privilege check for polling in the open() of a pressure file in /proc, but were erroneously missing it for the open() of cgroup pressure files.
When unprivileged polling was introduced in d82caa273565 ("sched/psi: Allow unprivileged polling of N*2s period"), it needed to filter privileges depending on the exact polling parameters, and as such moved the CAP_SYS_RESOURCE check from the proc open() callback to psi_trigger_create(). Both the proc files as well as cgroup files go through this during write(). This implicitly added the missing check for privileges required for HT polling for cgroups.
When 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") followed right after to remove further restrictions on the RT polling window, it incorrectly assumed the cgroup privilege check was still missing and added it to the cgroup open(), mirroring what we used to do for proc files in the past.
As a result, unprivileged poll requests that would be supported now get rejected when opening the cgroup pressure file for writing.
Ah, I see the problem. In our discussion https://lore.kernel.org/all/ZADj4YX4uftK%2FFrh@cmpxchg.org/ we decided to have the check in open() to fail early but we never considered unprivileged processes which only poll and never create any triggers. Makes sense.
Remove the cgroup open() check. psi_trigger_create() handles it.
Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") Cc: stable@vger.kernel.org # 6.5+ Reported-by: Luca Boccassi bluca@debian.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org
Acked-by: Luca Boccassi bluca@debian.org
Acked-by: Suren Baghdasaryan surenb@google.com
Thank you very much for the quick fix - this was reported originally on the systemd bug tracker by Daniel Black (I do not have an email address):
https://github.com/systemd/systemd/issues/29723
It is very important for systemd services to be able to do this without capabilities, as using capabilities means in turn user namespaces cannot be used (PrivateUsers=yes in systemd parlance).
Kind regards, Luca Boccassi