On 2022/4/11 15:18, Christoph Hellwig wrote:
On Mon, Apr 11, 2022 at 02:12:50PM +0800, Qu Wenruo wrote:
[BUG] Test case generic/475 have a very high chance (almost 100%) to hit a fs hang, where a data page will never be unlocked and hang all later operations.
Question: how can we even get an error? The submission already stopped return errors with patch 1. btrfs_get_chunk_map called from btrfs_zoned_get_device and calc_bio_boundaries really just has sanity checks that should be fatal if not met, same for btrfs_get_io_geometry.
Yep, btrfs_get_chunk_map() related call sites can only get error due to sanity check.
But the sanity check still makes sense, and those can not be easily rejected by our existing tree-checkers.
So we still need to do the error handling.
So yes, we could fix the nasty error handling here. Or just remove it entirely, which would reduce the possibility of bugs even more.
I don't see a super good way to remove the sanity check.
The get_chunk_map() one is here to catch possible bad bio which wants to write into non-existing chunk. To completely get rid that, we need a bullet proof solution to make sure all of our bio will never point to some non-existing logical bytenr.
Especially considering we have extra mechanisms to make chunk allocation/deletion more dynamic and harder to catch such situation at other locations.
Which can be more challenging than the error handling.
Anyway, we still need to handle the error for __get_extent_map(), we just need to do the same one here.
Thanks, Qu