Boris managed to create a device capable of changing its maj:min without altering its device path.
Only multi-devices can be scanned. A device that gets scanned and remains in the Btrfs kernel cache might end up with an incorrect maj:min.
Despite the tempfsid feature patch did not introduce this bug, it could lead to issues if the above multi-device is converted to a single device with a stale maj:min. Subsequently, attempting to mount the same device with the correct maj:min might mistake it for another device with the same fsid, potentially resulting in wrongly auto-enabling the tempfsid feature.
To address this, this patch validates the device's maj:min at the time of device open and updates it if it has changed since the last scan.
CC: stable@vger.kernel.org # 6.7+ Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability") Reported-by: Boris Burkov boris@bur.io Co-developed-by: Boris Burkov boris@bur.io Signed-off-by: Anand Jain anand.jain@oracle.com --- v2: Drop using lookup_bdev() instead, get it from device->bdev->bd_dev.
v1: https://lore.kernel.org/linux-btrfs/752b8526be21d984e0ee58c7f66d312664ff5ac5...
fs/btrfs/volumes.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e49935a54da0..c318640b4472 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -692,6 +692,16 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, device->bdev = bdev_handle->bdev; clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
+ if (device->devt != device->bdev->bd_dev) { + btrfs_warn(NULL, + "device %s maj:min changed from %d:%d to %d:%d", + device->name->str, MAJOR(device->devt), + MINOR(device->devt), MAJOR(device->bdev->bd_dev), + MINOR(device->bdev->bd_dev)); + + device->devt = device->bdev->bd_dev; + } + fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) {
On Fri, Mar 08, 2024 at 08:15:07AM +0530, Anand Jain wrote:
@@ -692,6 +692,16 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, device->bdev = bdev_handle->bdev; clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
- if (device->devt != device->bdev->bd_dev) {
btrfs_warn(NULL,
"device %s maj:min changed from %d:%d to %d:%d",
device->name->str, MAJOR(device->devt),
MINOR(device->devt), MAJOR(device->bdev->bd_dev),
MINOR(device->bdev->bd_dev));
device->devt = device->bdev->bd_dev;
- }
Just above this calls btrfs_get_bdev_and_sb, which calls bdev_open_by_path. bdev_open_by_path bdev_open_by_path calls lookup_bdev to translate the path to a dev_t and then calls bdev_open_by_dev on the dev_t, which stored the passes in dev_t in bdev->bd_dev. I see absolutely no way how this check could ever trigger.
On 3/8/24 21:18, Christoph Hellwig wrote:
On Fri, Mar 08, 2024 at 08:15:07AM +0530, Anand Jain wrote:
@@ -692,6 +692,16 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, device->bdev = bdev_handle->bdev; clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
- if (device->devt != device->bdev->bd_dev) {
btrfs_warn(NULL,
"device %s maj:min changed from %d:%d to %d:%d",
device->name->str, MAJOR(device->devt),
MINOR(device->devt), MAJOR(device->bdev->bd_dev),
MINOR(device->bdev->bd_dev));
device->devt = device->bdev->bd_dev;
- }
Just above this calls btrfs_get_bdev_and_sb, which calls bdev_open_by_path. bdev_open_by_path bdev_open_by_path calls lookup_bdev to translate the path to a dev_t and then calls bdev_open_by_dev on the dev_t, which stored the passes in dev_t in bdev->bd_dev. I see absolutely no way how this check could ever trigger.
Prior to this patch, the device->devt value of the device could become stale, as it might not have been updated since the last scan of the device. During this interval, the device could have undergone changes to its devt.
Thanks, Anand
On Fri, Mar 08, 2024 at 09:34:03PM +0530, Anand Jain wrote:
bdev_open_by_path. bdev_open_by_path bdev_open_by_path calls lookup_bdev to translate the path to a dev_t and then calls bdev_open_by_dev on the dev_t, which stored the passes in dev_t in bdev->bd_dev. I see absolutely no way how this check could ever trigger.
Prior to this patch, the device->devt value of the device could become stale, as it might not have been updated since the last scan of the device. During this interval, the device could have undergone changes to its devt.
How can it become stale here? btrfs_open_one_device exits early if device->bdev is set, so you set up a new device->bdev and stash the just opened bdev there. The dev_t of an existing struct block_device never changes, so it must match the one in the btrfs_device that was just initialized from it.
On 3/8/24 21:40, Christoph Hellwig wrote:
On Fri, Mar 08, 2024 at 09:34:03PM +0530, Anand Jain wrote:
bdev_open_by_path. bdev_open_by_path bdev_open_by_path calls lookup_bdev to translate the path to a dev_t and then calls bdev_open_by_dev on the dev_t, which stored the passes in dev_t in bdev->bd_dev. I see absolutely no way how this check could ever trigger.
Prior to this patch, the device->devt value of the device could become stale, as it might not have been updated since the last scan of the device. During this interval, the device could have undergone changes to its devt.
How can it become stale here? btrfs_open_one_device exits early if device->bdev is set, so you set up a new device->bdev and stash the just opened bdev there. The dev_t of an existing struct block_device never changes, so it must match the one in the btrfs_device that was just initialized from it.
It's a bit complex, as Boris discovered and has provided a testcase for here:
https://lore.kernel.org/fstests/f40e347d5a4b4b28201b1a088d38a3c75dd10ebd.170...
In essence:
- Create two devices, d1 and d2. - Both devices will be scanned into the kernel by Mfks. - Use an external method to alter the devt of the d2 device. - Mount using d1. - You end up with a 2 devices Btrfs with an incorrect device->devt. - Delete d1. - Now you have a single-device Btrfs on d2 with a stale device->devt.
Thanks, Anand
On Fri, Mar 08, 2024 at 09:53:07PM +0530, Anand Jain wrote:
It's a bit complex, as Boris discovered and has provided a testcase for here:
https://lore.kernel.org/fstests/f40e347d5a4b4b28201b1a088d38a3c75dd10ebd.170...
In essence:
- Create two devices, d1 and d2.
- Both devices will be scanned into the kernel by Mfks.
- Use an external method to alter the devt of the d2 device.
- Mount using d1.
- You end up with a 2 devices Btrfs with an incorrect device->devt.
- Delete d1.
- Now you have a single-device Btrfs on d2 with a stale device->devt.
But how do you get mismatching devices in this exact place?
- bdev->bd_dev is immutable and never updated - device->devt can be changed by device_list_add, but if that happens underneath us here between btrfs_get_bdev_and_sb and the code a few lines below the call to it in btrfs_open_one_device there is a huge synchronization problem
On Fri, Mar 08, 2024 at 09:23:52AM -0800, Christoph Hellwig wrote:
On Fri, Mar 08, 2024 at 09:53:07PM +0530, Anand Jain wrote:
It's a bit complex, as Boris discovered and has provided a testcase for here:
https://lore.kernel.org/fstests/f40e347d5a4b4b28201b1a088d38a3c75dd10ebd.170...
In essence:
- Create two devices, d1 and d2.
- Both devices will be scanned into the kernel by Mfks.
- Use an external method to alter the devt of the d2 device.
- Mount using d1.
- You end up with a 2 devices Btrfs with an incorrect device->devt.
- Delete d1.
- Now you have a single-device Btrfs on d2 with a stale device->devt.
But how do you get mismatching devices in this exact place?
- bdev->bd_dev is immutable and never updated
- device->devt can be changed by device_list_add, but if that happens underneath us here between btrfs_get_bdev_and_sb and the code a few lines below the call to it in btrfs_open_one_device there is a huge synchronization problem
You remove/add the device in a way that results in a new bd_dev while the filesystem is unmounted but btrfs is still caching the struct btrfs_device. When we unmount a multi-device fs, we don't clear the device cache, since we need it to remount with just one device name later.
The mechanism I used for getting a different bd_dev was partitioning two different devices in two different orders.
I sent the repro script as an fstest last week: https://lore.kernel.org/linux-btrfs/f40e347d5a4b4b28201b1a088d38a3c75dd10ebd...
Boris
On Fri, Mar 08, 2024 at 09:32:54AM -0800, Boris Burkov wrote:
You remove/add the device in a way that results in a new bd_dev while the filesystem is unmounted but btrfs is still caching the struct btrfs_device. When we unmount a multi-device fs, we don't clear the device cache, since we need it to remount with just one device name later.
The mechanism I used for getting a different bd_dev was partitioning two different devices in two different orders.
Ok, so we have a btrfs_device without a bdev around, which seems a bit dangerous. Also relying on the dev_t for any kind of device identify seems very dangerous. Aren't there per-device UUIDs or similar identifiers that are actually reliabe and can be used instead of the dev_t?
On Fri, Mar 08, 2024 at 09:42:56AM -0800, Christoph Hellwig wrote:
On Fri, Mar 08, 2024 at 09:32:54AM -0800, Boris Burkov wrote:
You remove/add the device in a way that results in a new bd_dev while the filesystem is unmounted but btrfs is still caching the struct btrfs_device. When we unmount a multi-device fs, we don't clear the device cache, since we need it to remount with just one device name later.
The mechanism I used for getting a different bd_dev was partitioning two different devices in two different orders.
Ok, so we have a btrfs_device without a bdev around, which seems a bit dangerous. Also relying on the dev_t for any kind of device identify seems very dangerous. Aren't there per-device UUIDs or similar identifiers that are actually reliabe and can be used instead of the dev_t?
I was led to believe this wasn't possible while still actually implementing temp_fsid. But now that I think of it again, I am less sure. You could imagine them having identical images except a device uuid and the code being smart enough to handle that.
Maybe Anand can explain why that wouldn't work :)
On 3/8/24 23:21, Boris Burkov wrote:
On Fri, Mar 08, 2024 at 09:42:56AM -0800, Christoph Hellwig wrote:
On Fri, Mar 08, 2024 at 09:32:54AM -0800, Boris Burkov wrote:
You remove/add the device in a way that results in a new bd_dev while the filesystem is unmounted but btrfs is still caching the struct btrfs_device. When we unmount a multi-device fs, we don't clear the device cache, since we need it to remount with just one device name later.
The mechanism I used for getting a different bd_dev was partitioning two different devices in two different orders.
Ok, so we have a btrfs_device without a bdev around, which seems a bit dangerous.
The 'btrfs_device' without a 'bdev' is just a hint for assembly in the kernel, with validation occurring before 'bdev' is saved. Hence, it's not clear how this could be dangerous.
Also relying on the dev_t for any kind of device identify seems very dangerous.
dev_t is used as the device identity for the tempfsid feature. In the case of two cloned single devices, the fsid + UUID + devid will match, but they will differ by their dev_t. Therefore, the tempfsid feature will mount them separately, assigning a temporary fsid (in memory only) to one of the latter mounting device.
However, in the mount thread, if the dev_t also matches, it is a subvol mount.
The actual use case for tempfsid is from the Steam Deck, where two identical images created by disk dump are simultaneously mounted on the host for validation. Ext4 supports cloned device mounting.
Aren't there per-device UUIDs or similar identifiers that are actually reliabe and can be used instead of the dev_t?
In this use case, when cloning the entire disk, the per-device UUID also gets copied/duplicated. Using special clone tools to update the device UUID will result in non-identical images, making them unsuitable for the use case.
Thanks, Anand
I was led to believe this wasn't possible while still actually implementing temp_fsid. But now that I think of it again, I am less sure. You could imagine them having identical images except a device uuid and the code being smart enough to handle that.
Maybe Anand can explain why that wouldn't work :)
On Fri, Mar 08, 2024 at 08:15:07AM +0530, Anand Jain wrote:
Boris managed to create a device capable of changing its maj:min without altering its device path.
Only multi-devices can be scanned. A device that gets scanned and remains in the Btrfs kernel cache might end up with an incorrect maj:min.
Despite the tempfsid feature patch did not introduce this bug, it could lead to issues if the above multi-device is converted to a single device with a stale maj:min. Subsequently, attempting to mount the same device with the correct maj:min might mistake it for another device with the same fsid, potentially resulting in wrongly auto-enabling the tempfsid feature.
To address this, this patch validates the device's maj:min at the time of device open and updates it if it has changed since the last scan.
You and Dave have convinced me that it is important to fix this in the kernel. I still have a hope of simplifying this further, while we are here and have the code kicking around in our heads.
CC: stable@vger.kernel.org # 6.7+ Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability") Reported-by: Boris Burkov boris@bur.io Co-developed-by: Boris Burkov boris@bur.io Signed-off-by: Anand Jain anand.jain@oracle.com
v2: Drop using lookup_bdev() instead, get it from device->bdev->bd_dev.
v1: https://lore.kernel.org/linux-btrfs/752b8526be21d984e0ee58c7f66d312664ff5ac5...
fs/btrfs/volumes.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e49935a54da0..c318640b4472 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -692,6 +692,16 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, device->bdev = bdev_handle->bdev; clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
- if (device->devt != device->bdev->bd_dev) {
btrfs_warn(NULL,
"device %s maj:min changed from %d:%d to %d:%d",
device->name->str, MAJOR(device->devt),
MINOR(device->devt), MAJOR(device->bdev->bd_dev),
MINOR(device->bdev->bd_dev));
device->devt = device->bdev->bd_dev;
- }
If we are permanently maintaining an invariant that device->devt == device->bdev->bd_dev, do we even need device->devt? As far as I can tell, all the logic that uses device->devt assumes that the device is not stale, both in the temp_fsid found_by_devt lookup and in the "device changed name" check. If so, we could just always use device->bdev->bd_dev and eliminate this confusion/source of bugs entirely.
fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { -- 2.38.1
On Fri, Mar 08, 2024 at 09:41:38AM -0800, Boris Burkov wrote:
On Fri, Mar 08, 2024 at 08:15:07AM +0530, Anand Jain wrote:
Boris managed to create a device capable of changing its maj:min without altering its device path.
Only multi-devices can be scanned. A device that gets scanned and remains in the Btrfs kernel cache might end up with an incorrect maj:min.
Despite the tempfsid feature patch did not introduce this bug, it could lead to issues if the above multi-device is converted to a single device with a stale maj:min. Subsequently, attempting to mount the same device with the correct maj:min might mistake it for another device with the same fsid, potentially resulting in wrongly auto-enabling the tempfsid feature.
To address this, this patch validates the device's maj:min at the time of device open and updates it if it has changed since the last scan.
You and Dave have convinced me that it is important to fix this in the kernel. I still have a hope of simplifying this further, while we are here and have the code kicking around in our heads.
I don't want to get stuck on this forever, so feel free to add Reviewed-by: Boris Burkov boris@bur.io
However, I would still love to get rid of device->devt if possible. It seems like it might be needed for that other grub bug you fixed. Though perhaps not, since we do skip stale devices in much of the logic.
Anyway, let's move forward with this! Thanks for hacking on it with me.
CC: stable@vger.kernel.org # 6.7+ Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability") Reported-by: Boris Burkov boris@bur.io Co-developed-by: Boris Burkov boris@bur.io Signed-off-by: Anand Jain anand.jain@oracle.com
v2: Drop using lookup_bdev() instead, get it from device->bdev->bd_dev.
v1: https://lore.kernel.org/linux-btrfs/752b8526be21d984e0ee58c7f66d312664ff5ac5...
fs/btrfs/volumes.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e49935a54da0..c318640b4472 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -692,6 +692,16 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, device->bdev = bdev_handle->bdev; clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
- if (device->devt != device->bdev->bd_dev) {
btrfs_warn(NULL,
"device %s maj:min changed from %d:%d to %d:%d",
device->name->str, MAJOR(device->devt),
MINOR(device->devt), MAJOR(device->bdev->bd_dev),
MINOR(device->bdev->bd_dev));
device->devt = device->bdev->bd_dev;
- }
If we are permanently maintaining an invariant that device->devt == device->bdev->bd_dev, do we even need device->devt? As far as I can tell, all the logic that uses device->devt assumes that the device is not stale, both in the temp_fsid found_by_devt lookup and in the "device changed name" check. If so, we could just always use device->bdev->bd_dev and eliminate this confusion/source of bugs entirely.
fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { -- 2.38.1
On 3/13/24 00:47, Boris Burkov wrote:
On Fri, Mar 08, 2024 at 09:41:38AM -0800, Boris Burkov wrote:
On Fri, Mar 08, 2024 at 08:15:07AM +0530, Anand Jain wrote:
Boris managed to create a device capable of changing its maj:min without altering its device path.
Only multi-devices can be scanned. A device that gets scanned and remains in the Btrfs kernel cache might end up with an incorrect maj:min.
Despite the tempfsid feature patch did not introduce this bug, it could lead to issues if the above multi-device is converted to a single device with a stale maj:min. Subsequently, attempting to mount the same device with the correct maj:min might mistake it for another device with the same fsid, potentially resulting in wrongly auto-enabling the tempfsid feature.
To address this, this patch validates the device's maj:min at the time of device open and updates it if it has changed since the last scan.
You and Dave have convinced me that it is important to fix this in the kernel. I still have a hope of simplifying this further, while we are here and have the code kicking around in our heads.
I don't want to get stuck on this forever, so feel free to add Reviewed-by: Boris Burkov boris@bur.io
Thanks. ...
However, I would still love to get rid of device->devt if possible. It seems like it might be needed for that other grub bug you fixed. Though perhaps not, since we do skip stale devices in much of the logic.
Removing btrfs_device::devt and then performing lookup_bdev() when needed or access it as bdev->bd_dev is possible. I wrote a patch for it but didn't send it. It contains too many lines changed, which is not a good idea for backporting. We have other critical issues such as the device disappearing and reappearing with a newer devt, while the device is still mounted. In this case, both devts will still be valid as per lookup_bdev().
Anyway, let's move forward with this! Thanks for hacking on it with me.
Yep, this fix is fine. It resolves the reported problem.
CC: stable@vger.kernel.org # 6.7+ Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability") Reported-by: Boris Burkov boris@bur.io Co-developed-by: Boris Burkov boris@bur.io Signed-off-by: Anand Jain anand.jain@oracle.com
v2: Drop using lookup_bdev() instead, get it from device->bdev->bd_dev.
v1: https://lore.kernel.org/linux-btrfs/752b8526be21d984e0ee58c7f66d312664ff5ac5...
fs/btrfs/volumes.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e49935a54da0..c318640b4472 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -692,6 +692,16 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, device->bdev = bdev_handle->bdev; clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
- if (device->devt != device->bdev->bd_dev) {
btrfs_warn(NULL,
"device %s maj:min changed from %d:%d to %d:%d",
device->name->str, MAJOR(device->devt),
MINOR(device->devt), MAJOR(device->bdev->bd_dev),
MINOR(device->bdev->bd_dev));
device->devt = device->bdev->bd_dev;
- }
If we are permanently maintaining an invariant that device->devt == device->bdev->bd_dev, do we even need device->devt? As far as I can tell, all the logic that uses device->devt assumes that the device is not stale, both in the temp_fsid found_by_devt lookup and in the "device changed name" check. If so, we could just always use device->bdev->bd_dev and eliminate this confusion/source of bugs entirely.
Yeah, it's possible to do cleanup, but there's something more urgent and critical to fix.
Thanks, Anand
fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { -- 2.38.1
linux-stable-mirror@lists.linaro.org