Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic.
When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache device, writing meta data e.g. cache_set->disk_buckets to remote SSD may trigger a kernel panic due to the above problem. Bcause the meta data pages for cache_set->disk_buckets are allocated by __get_free_pages() without __GFP_COMP.
This problem should be fixed both in upper layer driver (bcache) and nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking whether the page refcount is 0, if yes then don't use kernel_sendpage() and call sock_no_sendpage() to send the page into network stack.
The code comments in this patch is copied and modified from drbd where the similar problem already gets solved by Philipp Reisner. This is the best code comment including my own version.
Signed-off-by: Coly Li colyli@suse.de Cc: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Cc: Christoph Hellwig hch@lst.de Cc: Hannes Reinecke hare@suse.de Cc: Jan Kara jack@suse.com Cc: Jens Axboe axboe@kernel.dk Cc: Mikhail Skorzhinskii mskorzhinskiy@solarflare.com Cc: Philipp Reisner philipp.reisner@linbit.com Cc: Sagi Grimberg sagi@grimberg.me Cc: Vlastimil Babka vbabka@suse.com Cc: stable@vger.kernel.org --- Changelog: v2: fix typo in patch subject. v1: the initial version. drivers/nvme/host/tcp.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 79ef2b8e2b3c..faa71db7522a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
- /* can't zcopy slab pages */ - if (unlikely(PageSlab(page))) { + /* + * e.g. XFS meta- & log-data is in slab pages, or bcache meta + * data pages, or other high order pages allocated by + * __get_free_pages() without __GFP_COMP, which have a page_count + * of 0 and/or have PageSlab() set. We cannot use send_page for + * those, as that does get_page(); put_page(); and would cause + * either a VM_BUG directly, or __page_cache_release a page that + * would actually still be referenced by someone, leading to some + * obscure delayed Oops somewhere else. + */ + if (unlikely(PageSlab(page) || page_count(page) < 1)) { ret = sock_no_sendpage(queue->sock, page, offset, len, flags); } else {
Hi
[This is an automated email]
This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: all
The bot has tested the following trees: v5.7.8, v5.4.51, v4.19.132, v4.14.188, v4.9.230, v4.4.230.
v5.7.8: Build OK! v5.4.51: Build OK! v4.19.132: Failed to apply! Possible dependencies: 37c15219599f7 ("nvme-tcp: don't use sendpage for SLAB pages") 3f2304f8c6d6e ("nvme-tcp: add NVMe over TCP host driver")
v4.14.188: Failed to apply! Possible dependencies: 37c15219599f7 ("nvme-tcp: don't use sendpage for SLAB pages") 3f2304f8c6d6e ("nvme-tcp: add NVMe over TCP host driver")
v4.9.230: Failed to apply! Possible dependencies: 37c15219599f7 ("nvme-tcp: don't use sendpage for SLAB pages") 3f2304f8c6d6e ("nvme-tcp: add NVMe over TCP host driver") b1ad1475b447a ("nvme-fabrics: Add FC transport FC-NVME definitions") d6d20012e1169 ("nvme-fabrics: Add FC transport LLDD api definitions") e399441de9115 ("nvme-fabrics: Add host support for FC transport")
v4.4.230: Failed to apply! Possible dependencies: 07bfcd09a2885 ("nvme-fabrics: add a generic NVMe over Fabrics library") 1673f1f08c887 ("nvme: move block_device_operations and ns/ctrl freeing to common code") 1c63dc66580d4 ("nvme: split a new struct nvme_ctrl out of struct nvme_dev") 21d147880e489 ("nvme: fix Kconfig description for BLK_DEV_NVME_SCSI") 21d34711e1b59 ("nvme: split command submission helpers out of pci.c") 3f2304f8c6d6e ("nvme-tcp: add NVMe over TCP host driver") 4160982e75944 ("nvme: split __nvme_submit_sync_cmd") 4490733250b8b ("nvme: make SG_IO support optional") 6f3b0e8bcf3cb ("blk-mq: add a flags parameter to blk_mq_alloc_request") 7110230719602 ("nvme-rdma: add a NVMe over Fabrics RDMA host driver") a07b4970f464f ("nvmet: add a generic NVMe target") b1ad1475b447a ("nvme-fabrics: Add FC transport FC-NVME definitions") d6d20012e1169 ("nvme-fabrics: Add FC transport LLDD api definitions") e399441de9115 ("nvme-fabrics: Add host support for FC transport")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
On 7/13/20 5:44 AM, Coly Li wrote:
Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic.
When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache device, writing meta data e.g. cache_set->disk_buckets to remote SSD may trigger a kernel panic due to the above problem. Bcause the meta data pages for cache_set->disk_buckets are allocated by __get_free_pages() without __GFP_COMP.
This problem should be fixed both in upper layer driver (bcache) and nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking whether the page refcount is 0, if yes then don't use kernel_sendpage() and call sock_no_sendpage() to send the page into network stack.
The code comments in this patch is copied and modified from drbd where the similar problem already gets solved by Philipp Reisner. This is the best code comment including my own version.
Signed-off-by: Coly Li colyli@suse.de Cc: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Cc: Christoph Hellwig hch@lst.de Cc: Hannes Reinecke hare@suse.de Cc: Jan Kara jack@suse.com Cc: Jens Axboe axboe@kernel.dk Cc: Mikhail Skorzhinskii mskorzhinskiy@solarflare.com Cc: Philipp Reisner philipp.reisner@linbit.com Cc: Sagi Grimberg sagi@grimberg.me Cc: Vlastimil Babka vbabka@suse.com Cc: stable@vger.kernel.org
Changelog: v2: fix typo in patch subject. v1: the initial version. drivers/nvme/host/tcp.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 79ef2b8e2b3c..faa71db7522a 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
/* can't zcopy slab pages */
if (unlikely(PageSlab(page))) {
/*
* e.g. XFS meta- & log-data is in slab pages, or bcache meta
* data pages, or other high order pages allocated by
* __get_free_pages() without __GFP_COMP, which have a page_count
* of 0 and/or have PageSlab() set. We cannot use send_page for
* those, as that does get_page(); put_page(); and would cause
* either a VM_BUG directly, or __page_cache_release a page that
* would actually still be referenced by someone, leading to some
* obscure delayed Oops somewhere else.
*/
if (unlikely(PageSlab(page) || page_count(page) < 1)) {
Can we unify these checks to a common sendpage_ok(page) ?
linux-stable-mirror@lists.linaro.org