sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessling the device queue. Which will be set to NULL if scsi_device_put() removes the last reference to the sg device.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de ---
This is my best shot for a real fix of the issue. I confirmed with printk's that I get the NULL pointer freeze ony when scsi_device_put() is deleting the last reference to the device. In the cases where it's not crashing there is still a reference left after the call.
I don't see any obvious down side of simply swapping the calls. The alternative would by my first patch, just without the WARN_ON.
Alexander --- drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 86210e4dd0d3..80e0d1981191 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -2232,8 +2232,8 @@ sg_remove_sfp_usercontext(struct work_struct *work) "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp);
- scsi_device_put(sdp->device); kref_put(&sdp->d_ref, sg_device_destroy); + scsi_device_put(sdp->device); module_put(THIS_MODULE); }
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de --- Changes compared to V1: Reworked the commit message
Alexander --- drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 86210e4dd0d3..80e0d1981191 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -2232,8 +2232,8 @@ sg_remove_sfp_usercontext(struct work_struct *work) "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp);
- scsi_device_put(sdp->device); kref_put(&sdp->d_ref, sg_device_destroy); + scsi_device_put(sdp->device); module_put(THIS_MODULE); }
On Wed, Mar 20, 2024 at 12:08:09PM +0100, Alexander Wetzel wrote:
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de
Changes compared to V1: Reworked the commit message
What commit id does this fix?
thanks,
greg k-h
On 20.03.24 12:16, Greg KH wrote:
On Wed, Mar 20, 2024 at 12:08:09PM +0100, Alexander Wetzel wrote:
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de
Changes compared to V1: Reworked the commit message
What commit id does this fix?
It's a combination of patches. I think db59133e9279 ("scsi: sg: fix blktrace debugfs entries leakage") was the one which finally broke it.
The in the hindsight wrong sequence was introduced via: c6517b7942fa ("[SCSI] sg: fix races during device removal") and cc833acbee9d ("sg: O_EXCL and other lock handling")
Alexander
On 3/20/24 04:08, Alexander Wetzel wrote:
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de
Changes compared to V1: Reworked the commit message
Alexander
drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 86210e4dd0d3..80e0d1981191 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -2232,8 +2232,8 @@ sg_remove_sfp_usercontext(struct work_struct *work) "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp);
- scsi_device_put(sdp->device); kref_put(&sdp->d_ref, sg_device_destroy);
- scsi_device_put(sdp->device); module_put(THIS_MODULE); }
Is it guaranteed that the above kref_put() call is the last kref_put() call on sdp->d_ref? If not, how about inserting code between the kref_put() call and the scsi_device_put() call that waits until sg_device_destroy() has finished?
Thanks,
Bart.
On 20.03.24 16:02, Bart Van Assche wrote:
On 3/20/24 04:08, Alexander Wetzel wrote:
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de
Changes compared to V1: Reworked the commit message
Alexander
drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 86210e4dd0d3..80e0d1981191 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -2232,8 +2232,8 @@ sg_remove_sfp_usercontext(struct work_struct *work) "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp); - scsi_device_put(sdp->device); kref_put(&sdp->d_ref, sg_device_destroy); + scsi_device_put(sdp->device); module_put(THIS_MODULE); }
Is it guaranteed that the above kref_put() call is the last kref_put() call on sdp->d_ref? If not, how about inserting code between the kref_put() call and the scsi_device_put() call that waits until sg_device_destroy() has finished?
While I'm not familiar with the code, I'm pretty sure kref_put() is removing the last reference to d_ref here. Anything else would be odd, based on my - really sketchy - understanding of the flows.
Also waiting for another process looks wrong. I guess we would then have to delay the call to sg_release().
And at least for me it's always the last d_ref reference. I changed the section to:
kref_put(&sdp->d_ref, sg_device_destroy); printk("XXXX scsi=%u, dref=%u\n", \ kref_read(&sdp->device->sdev_gendev.kobj.kref), \ kref_read(&sdp->d_ref)); scsi_device_put(sdp->device);
And connected/disconnected my test USB device a few times: XXXX scsi=2, dref=0 XXXX scsi=1, dref=0 XXXX scsi=2, dref=0 XXXX scsi=1, dref=0 XXXX scsi=1, dref=0 XXXX scsi=1, dref=0 XXXX scsi=1, dref=0 XXXX scsi=1, dref=0 XXXX scsi=1, dref=0 XXXX scsi=1, dref=0
(scsi=1 are the cases which would cause the NULL pointer exceptions with the unpatched driver.)
Alexander
On 3/20/24 09:58, Alexander Wetzel wrote:
While I'm not familiar with the code, I'm pretty sure kref_put() is removing the last reference to d_ref here. Anything else would be odd, based on my - really sketchy - understanding of the flows.
Please document this by adding a WARN_ON_ONCE() statement before the kref_put() call that checks that the refcount equals one.
Thanks,
Bart.
On 3/20/24 04:08, Alexander Wetzel wrote:
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 86210e4dd0d3..80e0d1981191 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -2232,8 +2232,8 @@ sg_remove_sfp_usercontext(struct work_struct *work) "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp);
- scsi_device_put(sdp->device); kref_put(&sdp->d_ref, sg_device_destroy);
- scsi_device_put(sdp->device); module_put(THIS_MODULE); }
Since sg_device_destroy() frees struct sg_device and since the scsi_device_put() call reads from struct sg_device, does this patch introduce a use-after-free? Has it been tested with KASAN enabled?
Thanks,
Bart.
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Link: https://lore.kernel.org/r/20240305150509.23896-1-Alexander@wetzel-home.de Fixes: db59133e9279 ("scsi: sg: fix blktrace debugfs entries leakage") Cc: stable@vger.kernel.org Signed-off-by: Alexander Wetzel Alexander@wetzel-home.de --- Changes compared to V2: - Fixed the use-after-free pointed out by Bart - Added the WARN_ON_ONCE() requested by Bart - added the Fixes tag pointed out by Greg
This patch has now been tested with KASAN enabled. I also verified, that db59133e9279 ("scsi: sg: fix blktrace debugfs entries leakage") introduced the issue.
Thanks for all your help!
Alexander --- drivers/scsi/sg.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 86210e4dd0d3..ff6894ce5404 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -2207,6 +2207,7 @@ sg_remove_sfp_usercontext(struct work_struct *work) { struct sg_fd *sfp = container_of(work, struct sg_fd, ew.work); struct sg_device *sdp = sfp->parentdp; + struct scsi_device *device = sdp->device; Sg_request *srp; unsigned long iflags;
@@ -2232,8 +2233,9 @@ sg_remove_sfp_usercontext(struct work_struct *work) "sg_remove_sfp: sfp=0x%p\n", sfp)); kfree(sfp);
- scsi_device_put(sdp->device); + WARN_ON_ONCE(kref_read(&sdp->d_ref) != 1); kref_put(&sdp->d_ref, sg_device_destroy); + scsi_device_put(device); module_put(THIS_MODULE); }
On 3/20/24 14:30, Alexander Wetzel wrote:
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
The resulting NULL pointer exception will then crash the kernel.
Reviewed-by: Bart Van Assche bvanassche@acm.org
On Wed, 20 Mar 2024 22:30:32 +0100, Alexander Wetzel wrote:
sg_remove_sfp_usercontext() must not use sg_device_destroy() after calling scsi_device_put().
sg_device_destroy() is accessing the parent scsi device request_queue. Which will already be set to NULL when the preceding call to scsi_device_put() removed the last reference to the parent scsi device.
[...]
Applied to 6.9/scsi-fixes, thanks!
[1/1] scsi: sg: Avoid sg device teardown race https://git.kernel.org/mkp/scsi/c/27f58c04a8f4
linux-stable-mirror@lists.linaro.org