The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages.
This patch introduces a helper sendpage_ok(), it returns true if the checking page, - is not slab page: PageSlab(page) is false. - has page refcount: page_count(page) is not zero
All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page.
Signed-off-by: Coly Li colyli@suse.de Cc: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Cc: Christoph Hellwig hch@lst.de Cc: Hannes Reinecke hare@suse.de Cc: Jan Kara jack@suse.com Cc: Jens Axboe axboe@kernel.dk Cc: Mikhail Skorzhinskii mskorzhinskiy@solarflare.com Cc: Philipp Reisner philipp.reisner@linbit.com Cc: Sagi Grimberg sagi@grimberg.me Cc: Vlastimil Babka vbabka@suse.com Cc: stable@vger.kernel.org --- Changelog: v5, include linux/mm.h in include/linux/net.h v4, change sendpage_ok() as an inline helper, and post it as separate patch. v3, introduce a more common sendpage_ok() v2, fix typo in patch subject v1, the initial version. include/linux/net.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/include/linux/net.h b/include/linux/net.h index d48ff1180879..a807fad31958 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -21,6 +21,7 @@ #include <linux/rcupdate.h> #include <linux/once.h> #include <linux/fs.h> +#include <linux/mm.h> #include <linux/sockptr.h>
#include <uapi/linux/net.h> @@ -286,6 +287,21 @@ do { \ #define net_get_random_once_wait(buf, nbytes) \ get_random_once_wait((buf), (nbytes))
+/* + * E.g. XFS meta- & log-data is in slab pages, or bcache meta + * data pages, or other high order pages allocated by + * __get_free_pages() without __GFP_COMP, which have a page_count + * of 0 and/or have PageSlab() set. We cannot use send_page for + * those, as that does get_page(); put_page(); and would cause + * either a VM_BUG directly, or __page_cache_release a page that + * would actually still be referenced by someone, leading to some + * obscure delayed Oops somewhere else. + */ +static inline bool sendpage_ok(struct page *page) +{ + return (!PageSlab(page) && page_count(page) >= 1); +} + int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t len); int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to send slab pages. But for pages allocated by __get_free_pages() without __GFP_COMP, which also have refcount as 0, they are still sent by kernel_sendpage() to remote end, this is problematic.
The new introduced helper sendpage_ok() checks both PageSlab tag and page_count counter, and returns true if the checking page is OK to be sent by kernel_sendpage().
This patch fixes the page checking issue of nvme_tcp_try_send_data() with sendpage_ok(). If sendpage_ok() returns true, send this page by kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
Signed-off-by: Coly Li colyli@suse.de Cc: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Cc: Christoph Hellwig hch@lst.de Cc: Hannes Reinecke hare@suse.de Cc: Jan Kara jack@suse.com Cc: Jens Axboe axboe@kernel.dk Cc: Mikhail Skorzhinskii mskorzhinskiy@solarflare.com Cc: Philipp Reisner philipp.reisner@linbit.com Cc: Sagi Grimberg sagi@grimberg.me Cc: Vlastimil Babka vbabka@suse.com Cc: stable@vger.kernel.org --- Changelog: v4, change sendpage_ok() as an inline helper, and post it as separate patch. v3, introduce a more common sendpage_ok() v2, fix typo in patch subject v1, the initial version. drivers/nvme/host/tcp.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 62fbaecdc960..902fe742762b 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -912,12 +912,11 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) else flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
- /* can't zcopy slab pages */ - if (unlikely(PageSlab(page))) { - ret = sock_no_sendpage(queue->sock, page, offset, len, + if (sendpage_ok(page)) { + ret = kernel_sendpage(queue->sock, page, offset, len, flags); } else { - ret = kernel_sendpage(queue->sock, page, offset, len, + ret = sock_no_sendpage(queue->sock, page, offset, len, flags); } if (ret <= 0)
In _drbd_send_page() a page is checked by following code before sending it by kernel_sendpage(), (page_count(page) < 1) || PageSlab(page) If the check is true, this page won't be send by kernel_sendpage() and handled by sock_no_sendpage().
This kind of check is exactly what macro sendpage_ok() does, which is introduced into include/linux/net.h to solve a similar send page issue in nvme-tcp code.
This patch uses macro sendpage_ok() to replace the open coded checks to page type and refcount in _drbd_send_page(), as a code cleanup.
Signed-off-by: Coly Li colyli@suse.de Cc: Philipp Reisner philipp.reisner@linbit.com Cc: Sagi Grimberg sagi@grimberg.me --- Changelog: v3, introduce a more common sendpage_ok() v2, fix typo in patch subject v1, the initial version. drivers/block/drbd/drbd_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index cb687ccdbd96..55dc0c91781e 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -1553,7 +1553,7 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa * put_page(); and would cause either a VM_BUG directly, or * __page_cache_release a page that would actually still be referenced * by someone, leading to some obscure delayed Oops somewhere else. */ - if (drbd_disable_sendpage || (page_count(page) < 1) || PageSlab(page)) + if (drbd_disable_sendpage || !sendpage_ok(page)) return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
msg_flags |= MSG_NOSIGNAL;
On Sun, Aug 16, 2020 at 1:36 AM Coly Li colyli@suse.de wrote:
The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages.
This patch introduces a helper sendpage_ok(), it returns true if the checking page,
- is not slab page: PageSlab(page) is false.
- has page refcount: page_count(page) is not zero
All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page.
Can we leave this helper to mm subsystem?
I know it is for sendpage, but its implementation is all about some mm details and its two callers do not belong to net subsystem either.
Think this in another way: who would fix it if it is buggy? I bet mm people should. ;)
Thanks.
On Sun, Aug 16, 2020 at 10:55:09AM -0700, Cong Wang wrote:
On Sun, Aug 16, 2020 at 1:36 AM Coly Li colyli@suse.de wrote:
The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages.
This patch introduces a helper sendpage_ok(), it returns true if the checking page,
- is not slab page: PageSlab(page) is false.
- has page refcount: page_count(page) is not zero
All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page.
Can we leave this helper to mm subsystem?
I know it is for sendpage, but its implementation is all about some mm details and its two callers do not belong to net subsystem either.
Think this in another way: who would fix it if it is buggy? I bet mm people should. ;)
No. This is all about a really unusual imitation in sendpage, which is pretty much unexpected. In fact the best thing would be to make sock_sendpage do the right thing and call sock_no_sendpage based on this condition, so that driver writers don't have to worry at all.
On Sun, Aug 16, 2020 at 10:45 PM Christoph Hellwig hch@lst.de wrote:
On Sun, Aug 16, 2020 at 10:55:09AM -0700, Cong Wang wrote:
On Sun, Aug 16, 2020 at 1:36 AM Coly Li colyli@suse.de wrote:
The original problem was from nvme-over-tcp code, who mistakenly uses kernel_sendpage() to send pages allocated by __get_free_pages() without __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on tail pages, sending them by kernel_sendpage() may trigger a kernel panic from a corrupted kernel heap, because these pages are incorrectly freed in network stack as page_count 0 pages.
This patch introduces a helper sendpage_ok(), it returns true if the checking page,
- is not slab page: PageSlab(page) is false.
- has page refcount: page_count(page) is not zero
All drivers who want to send page to remote end by kernel_sendpage() may use this helper to check whether the page is OK. If the helper does not return true, the driver should try other non sendpage method (e.g. sock_no_sendpage()) to handle the page.
Can we leave this helper to mm subsystem?
I know it is for sendpage, but its implementation is all about some mm details and its two callers do not belong to net subsystem either.
Think this in another way: who would fix it if it is buggy? I bet mm people should. ;)
No. This is all about a really unusual imitation in sendpage, which
So netdev people will have to understand and support PageSlab() or page_count()?
If it is unusual even for mm people, how could netdev people suppose to understand this unusual mm bug? At least not any better.
is pretty much unexpected. In fact the best thing would be to make sock_sendpage do the right thing and call sock_no_sendpage based on this condition, so that driver writers don't have to worry at all.
Agreed, but kernel_sendpage() still relies on mm to provide a helper to make the decision and ensure this helper is always up-to-date.
In short, it is all about ownership.
Thanks.
On Mon, Aug 17, 2020 at 12:12:12PM -0700, Cong Wang wrote:
So netdev people will have to understand and support PageSlab() or page_count()?
Yes. As they came up with that contrived rule what is acceptable for sendpage. No one else really knows and other subsystems like the block layer are perfectly fine with it.
If it is unusual even for mm people, how could netdev people suppose to understand this unusual mm bug? At least not any better.
It is not a mm bug, it is a networking quirk.
linux-stable-mirror@lists.linaro.org