Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The above dependencies can cause an ABBA deadlock. For example in the following scenario:
(1) Task A running on CPU #1: crypto_alloc_acomp_node() Holds scomp_lock Enters reclaim Reads per_cpu_ptr(pool->acomp_ctx, 1)
(2) Task A is descheduled
(3) CPU #1 goes offline zswap_cpu_comp_dead(CPU #1) Holds per_cpu_ptr(pool->acomp_ctx, 1)) Calls crypto_free_acomp() Waits for scomp_lock
(4) Task A running on CPU #2: Waits for per_cpu_ptr(pool->acomp_ctx, 1) // Read on CPU #1 DEADLOCK
Since there is no requirement to call crypto_free_acomp() with the per-CPU acomp_ctx mutex held in zswap_cpu_comp_dead(), move it after the mutex is unlocked. Also move the acomp_request_free() and kfree() calls for consistency and to avoid any potential sublte locking dependencies in the future.
With this, only setting acomp_ctx fields to NULL occurs with the mutex held. This is similar to how zswap_cpu_comp_prepare() only initializes acomp_ctx fields with the mutex held, after performing all allocations before holding the mutex.
Opportunistically, move the NULL check on acomp_ctx so that it takes place before the mutex dereference.
Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") Reported-by: syzbot+1a517ccfcbc6a7ab0f82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67bcea51.050a0220.bbfd1.0096.GAE@google.com/ Cc: stable@vger.kernel.org Co-developed-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Yosry Ahmed yosry.ahmed@linux.dev Acked-by: Herbert Xu herbert@gondor.apana.org.au ---
v1 -> v2: - Explained the problem more clearly in the commit message. - Moved all freeing calls outside the lock critical section. v1: https://lore.kernel.org/all/Z72FJnbA39zWh4zS@gondor.apana.org.au/
--- mm/zswap.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c index ac9d299e7d0c1..adf745c66aa1d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -881,18 +881,32 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + struct acomp_req *req; + struct crypto_acomp *acomp; + u8 *buffer; + + if (IS_ERR_OR_NULL(acomp_ctx)) + return 0;
mutex_lock(&acomp_ctx->mutex); - if (!IS_ERR_OR_NULL(acomp_ctx)) { - if (!IS_ERR_OR_NULL(acomp_ctx->req)) - acomp_request_free(acomp_ctx->req); - acomp_ctx->req = NULL; - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) - crypto_free_acomp(acomp_ctx->acomp); - kfree(acomp_ctx->buffer); - } + req = acomp_ctx->req; + acomp = acomp_ctx->acomp; + buffer = acomp_ctx->buffer; + acomp_ctx->req = NULL; + acomp_ctx->acomp = NULL; + acomp_ctx->buffer = NULL; mutex_unlock(&acomp_ctx->mutex);
+ /* + * Do the actual freeing after releasing the mutex to avoid subtle + * locking dependencies causing deadlocks. + */ + if (!IS_ERR_OR_NULL(req)) + acomp_request_free(req); + if (!IS_ERR_OR_NULL(acomp)) + crypto_free_acomp(acomp); + kfree(buffer); + return 0; }
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed.
But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason...
- Eric
On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed.
crypto_free_acomp() does not explicitly wait for an allocation, but it waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be held while allocating memory from crypto_scomp_init_tfm().
Are you suggesting that crypto_exit_scomp_ops_async() should not be holding scomp_lock?
But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason...
I am assuming this about scomp_scratch logic, which is what we need to hold the scomp_lock for, resulting in this problem.
If this is something that can be done right away I am fine with dropping this patch for an alternative fix, although it may be nice to reduce the lock critical section in zswap_cpu_comp_dead() to the bare minimum anyway.
On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote:
On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed.
crypto_free_acomp() does not explicitly wait for an allocation, but it waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be held while allocating memory from crypto_scomp_init_tfm().
Are you suggesting that crypto_exit_scomp_ops_async() should not be holding scomp_lock?
I think the solution while keeping the bounce buffer in place would be to do what the patch https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, i.e. make the actual allocation and free happen outside the lock.
But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason...
I am assuming this about scomp_scratch logic, which is what we need to hold the scomp_lock for, resulting in this problem.
Yes.
If this is something that can be done right away I am fine with dropping this patch for an alternative fix, although it may be nice to reduce the lock critical section in zswap_cpu_comp_dead() to the bare minimum anyway.
Well, unfortunately the whole Crypto API philosophy of having a single interface for software and for hardware offload doesn't really work. This is just yet another example of that; it's a problem caused by shoehorning software compression into an interface designed for hardware offload. zcomp really should just use the compression libs directly (like most users of compression in the kernel already do), and have an alternate code path specifically for hardware offload (using acomp) for the few people who really want that.
- Eric
On Wed, Feb 26, 2025 at 09:16:28PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote:
On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed.
crypto_free_acomp() does not explicitly wait for an allocation, but it waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be held while allocating memory from crypto_scomp_init_tfm().
Are you suggesting that crypto_exit_scomp_ops_async() should not be holding scomp_lock?
I think the solution while keeping the bounce buffer in place would be to do what the patch https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, i.e. make the actual allocation and free happen outside the lock.
I am fine with a solution like that if Herbert is fine with it. Although as I mentioned, I think this patch is nice to have anyway.
But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason...
I am assuming this about scomp_scratch logic, which is what we need to hold the scomp_lock for, resulting in this problem.
Yes.
If this is something that can be done right away I am fine with dropping this patch for an alternative fix, although it may be nice to reduce the lock critical section in zswap_cpu_comp_dead() to the bare minimum anyway.
Well, unfortunately the whole Crypto API philosophy of having a single interface for software and for hardware offload doesn't really work. This is just yet another example of that; it's a problem caused by shoehorning software compression into an interface designed for hardware offload. zcomp really should just use the compression libs directly (like most users of compression in the kernel already do), and have an alternate code path specifically for hardware offload (using acomp) for the few people who really want that.
zcomp is for zram, zswap does not use it. If zswap is not going to use the crypto API we'll want something like zcomp or maybe reuse zcomp itself. That's a problem for another day :)
On Wed, Feb 26, 2025 at 1:23 PM Yosry Ahmed yosry.ahmed@linux.dev wrote:
On Wed, Feb 26, 2025 at 09:16:28PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote:
On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed.
crypto_free_acomp() does not explicitly wait for an allocation, but it waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be held while allocating memory from crypto_scomp_init_tfm().
Are you suggesting that crypto_exit_scomp_ops_async() should not be holding scomp_lock?
I think the solution while keeping the bounce buffer in place would be to do what the patch https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, i.e. make the actual allocation and free happen outside the lock.
I am fine with a solution like that if Herbert is fine with it. Although as I mentioned, I think this patch is nice to have anyway.
But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason...
I am assuming this about scomp_scratch logic, which is what we need to hold the scomp_lock for, resulting in this problem.
Yes.
If this is something that can be done right away I am fine with dropping this patch for an alternative fix, although it may be nice to reduce the lock critical section in zswap_cpu_comp_dead() to the bare minimum anyway.
Well, unfortunately the whole Crypto API philosophy of having a single interface for software and for hardware offload doesn't really work. This is just yet another example of that; it's a problem caused by shoehorning software compression into an interface designed for hardware offload. zcomp really should just use the compression libs directly (like most users of compression in the kernel already do), and have an alternate code path specifically for hardware offload (using acomp) for the few people who really want that.
zcomp is for zram, zswap does not use it. If zswap is not going to use the crypto API we'll want something like zcomp or maybe reuse zcomp itself. That's a problem for another day :)
I'm actually thinking whether we should expose the zcomp API and use it for zswap. There are a couple of parameters for zstd I wanna play with, which zcomp/zram seems to already support, but not the crypto API (zstd level, dictionary, etc.).
But yes, a different problem for another day :)
On 2025/2/27 07:47, Nhat Pham wrote:
On Wed, Feb 26, 2025 at 1:23 PM Yosry Ahmed yosry.ahmed@linux.dev wrote:
On Wed, Feb 26, 2025 at 09:16:28PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote:
On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote:
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed.
crypto_free_acomp() does not explicitly wait for an allocation, but it waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be held while allocating memory from crypto_scomp_init_tfm().
Are you suggesting that crypto_exit_scomp_ops_async() should not be holding scomp_lock?
I think the solution while keeping the bounce buffer in place would be to do what the patch https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, i.e. make the actual allocation and free happen outside the lock.
I am fine with a solution like that if Herbert is fine with it. Although as I mentioned, I think this patch is nice to have anyway.
But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason...
I am assuming this about scomp_scratch logic, which is what we need to hold the scomp_lock for, resulting in this problem.
Yes.
If this is something that can be done right away I am fine with dropping this patch for an alternative fix, although it may be nice to reduce the lock critical section in zswap_cpu_comp_dead() to the bare minimum anyway.
Well, unfortunately the whole Crypto API philosophy of having a single interface for software and for hardware offload doesn't really work. This is just yet another example of that; it's a problem caused by shoehorning software compression into an interface designed for hardware offload. zcomp really should just use the compression libs directly (like most users of compression in the kernel already do), and have an alternate code path specifically for hardware offload (using acomp) for the few people who really want that.
zcomp is for zram, zswap does not use it. If zswap is not going to use the crypto API we'll want something like zcomp or maybe reuse zcomp itself. That's a problem for another day :)
I'm actually thinking whether we should expose the zcomp API and use it for zswap. There are a couple of parameters for zstd I wanna play with, which zcomp/zram seems to already support, but not the crypto API (zstd level, dictionary, etc.).
Ah, agree! Actually I also think we should use the zcomp API in zswap, if its API meets our requirements.
But yes, a different problem for another day :)
On 2025/2/27 02:56, Yosry Ahmed wrote:
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()).
On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex.
The above dependencies can cause an ABBA deadlock. For example in the following scenario:
(1) Task A running on CPU #1: crypto_alloc_acomp_node() Holds scomp_lock Enters reclaim Reads per_cpu_ptr(pool->acomp_ctx, 1)
(2) Task A is descheduled
(3) CPU #1 goes offline zswap_cpu_comp_dead(CPU #1) Holds per_cpu_ptr(pool->acomp_ctx, 1)) Calls crypto_free_acomp() Waits for scomp_lock
(4) Task A running on CPU #2: Waits for per_cpu_ptr(pool->acomp_ctx, 1) // Read on CPU #1 DEADLOCK
Since there is no requirement to call crypto_free_acomp() with the per-CPU acomp_ctx mutex held in zswap_cpu_comp_dead(), move it after the mutex is unlocked. Also move the acomp_request_free() and kfree() calls for consistency and to avoid any potential sublte locking dependencies in the future.
With this, only setting acomp_ctx fields to NULL occurs with the mutex held. This is similar to how zswap_cpu_comp_prepare() only initializes acomp_ctx fields with the mutex held, after performing all allocations before holding the mutex.
Opportunistically, move the NULL check on acomp_ctx so that it takes place before the mutex dereference.
Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") Reported-by: syzbot+1a517ccfcbc6a7ab0f82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67bcea51.050a0220.bbfd1.0096.GAE@google.com/ Cc: stable@vger.kernel.org Co-developed-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Yosry Ahmed yosry.ahmed@linux.dev Acked-by: Herbert Xu herbert@gondor.apana.org.au
Looks good to me:
Reviewed-by: Chengming Zhou chengming.zhou@linux.dev
Thanks!
v1 -> v2:
- Explained the problem more clearly in the commit message.
- Moved all freeing calls outside the lock critical section.
v1: https://lore.kernel.org/all/Z72FJnbA39zWh4zS@gondor.apana.org.au/
mm/zswap.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c index ac9d299e7d0c1..adf745c66aa1d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -881,18 +881,32 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
- struct acomp_req *req;
- struct crypto_acomp *acomp;
- u8 *buffer;
- if (IS_ERR_OR_NULL(acomp_ctx))
return 0;
mutex_lock(&acomp_ctx->mutex);
- if (!IS_ERR_OR_NULL(acomp_ctx)) {
if (!IS_ERR_OR_NULL(acomp_ctx->req))
acomp_request_free(acomp_ctx->req);
acomp_ctx->req = NULL;
if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
crypto_free_acomp(acomp_ctx->acomp);
kfree(acomp_ctx->buffer);
- }
- req = acomp_ctx->req;
- acomp = acomp_ctx->acomp;
- buffer = acomp_ctx->buffer;
- acomp_ctx->req = NULL;
- acomp_ctx->acomp = NULL;
- acomp_ctx->buffer = NULL; mutex_unlock(&acomp_ctx->mutex);
- /*
* Do the actual freeing after releasing the mutex to avoid subtle
* locking dependencies causing deadlocks.
*/
- if (!IS_ERR_OR_NULL(req))
acomp_request_free(req);
- if (!IS_ERR_OR_NULL(acomp))
crypto_free_acomp(acomp);
- kfree(buffer);
- return 0; }
linux-stable-mirror@lists.linaro.org