On Thu, May 25, 2023 at 12:06:07PM -0400, Waiman Long wrote:
On 5/25/23 12:01, Waiman Long wrote:
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg and some blkgs from being freed after they are made offline.
It is less a problem if the cgroup to be destroyed also has other controllers like memory that will call cgroup_rstat_flush() which will clean up the reference count. If block is the only controller that uses rstat, these offline blkcg and blkgs may never be freed leaking more and more memory over time.
To prevent this potential memory leak:
flush blkcg per-cpu stats list in __blkg_release(), when no new stat can be added to avoid use-after-free of the percpu blkg_iostat_set in futue cgroup_rstat_flush*() calls.
add a cgroup_rstat_flush_acquire() helper and call it to acquire cgroup_rstat_lock to block concurrent execution of other cgroup_rstat_flush*() calls
don't grab bio->bi_blkg reference when adding the stats into blkcg's per-cpu stat list since all stats are guaranteed to be consumed before releasing blkg instance, and grabbing blkg reference for stats was the most fragile part of original patch
Based on Waiman's patch:
https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.c...
Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Cc: stable@vger.kernel.org Reported-by: Jay Shin jaeshin@redhat.com Cc: Waiman Long longman@redhat.com Cc: Tejun Heo tj@kernel.org Cc: mkoutny@suse.com Cc: Yosry Ahmed yosryahmed@google.com Co-developed-by: Ming Lei ming.lei@redhat.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Waiman Long longman@redhat.com
block/blk-cgroup.c | 57 +++++++++++++++++++++++++++++++----------- include/linux/cgroup.h | 1 + kernel/cgroup/rstat.c | 15 ++++++++++- 3 files changed, 57 insertions(+), 16 deletions(-)
This is my counter-proposal to Ming's v3 patch. The major difference is that I used the existing cgroup_rstat_lock instead of adding a new internal lock. This minimizes performance impact to existing cgroup_rstat_flush*() call
The added internal lock has ~zero perf impact on rstat flush cause the lock won't be contended basically.
while achieving the same objective. I am fine with Ming current v3 patch if we decide to go that way.
As I mentioned, the main motivation with internal lock is to make the fix as simple as possible since cross-subsystem change isn't involved, and I am fine with any following cleanup or improvement on current blkg rstat flush.
Another benefit with this internal lock is that race in blkcg_reset_stats() can be avoided.
Thanks, Ming