On Tue, Feb 14, 2023 at 08:51:09AM +0100, Juergen Gross wrote:
On 13.02.23 22:01, Demi Marie Obenour wrote:
On Mon, Feb 13, 2023 at 10:26:11AM +0100, Juergen Gross wrote:
On 07.02.23 03:10, Demi Marie Obenour wrote:
When a grant entry is still in use by the remote domain, Linux must put it on a deferred list. Normally, this list is very short, because the PV network and block protocols expect the backend to unmap the grant first. However, Qubes OS's GUI protocol is subject to the constraints of the X Window System, and as such winds up with the frontend unmapping the window first. As a result, the list can grow very large, resulting in a massive memory leak and eventual VM freeze.
Fix this problem by bumping the number of entries that the VM will attempt to free at each iteration to 10000. This is an ugly hack that may well make a denial of service easier, but for Qubes OS that is less bad than the problem Qubes OS users are facing today.
There really needs to be a way for a frontend to be notified when the backend has unmapped the grants.
Please remove this sentence from the commit message, or move it below the "---" marker.
Will fix in v2.
There are still some flag bits unallocated in struct grant_entry_v1 or struct grant_entry_header. You could suggest some patches for Xen to use one of the bits as a marker to get an event from the hypervisor if a grant with such a bit set has been unmapped.
That is indeed a good idea. There are other problems with the grant interface as well, but those can be dealt with later.
I have no idea, whether such an interface would be accepted by the maintainers, though.
Additionally, a module parameter is provided to allow tuning the reclaim speed.
The code previously used printk(KERN_DEBUG) whenever it had to defer reclaiming a page because the grant was still mapped. This resulted in a large volume of log messages that bothered users. Use pr_debug instead, which suppresses the messages by default. Developers can enable them using the dynamic debug mechanism.
Fixes: QubesOS/qubes-issues#7410 (memory leak) Fixes: QubesOS/qubes-issues#7359 (excessive logging) Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic") Cc: stable@vger.kernel.org Signed-off-by: Demi Marie Obenour demi@invisiblethingslab.com
Anyone have suggestions for improving the grant mechanism? Argo isn't a good option, as in the GUI protocol there are substantial performance wins to be had by using true shared memory. Resending as I forgot the Signed-off-by on the first submission. Sorry about that.
drivers/xen/grant-table.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c index 5c83d41..2c2faa7 100644 --- a/drivers/xen/grant-table.c +++ b/drivers/xen/grant-table.c @@ -355,14 +355,20 @@ static void gnttab_handle_deferred(struct timer_list *); static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred); +static atomic64_t deferred_count; +static atomic64_t leaked_count; +static unsigned int free_per_iteration = 10000;
As you are adding a kernel parameter to change this value, please set the default to a value not potentially causing any DoS problems. Qubes OS can still use a higher value then.
Do you have any suggestions? I don’t know if this is actually a DoS concern anymore. Shrinking the interval between iterations would be.
Why don't you use today's value of 10 for the default?
Will do. I now remember that the DoS concern is that the kernel could be made to use excess CPU trying and failing to reclaim memory.
- static void gnttab_handle_deferred(struct timer_list *unused) {
- unsigned int nr = 10;
- unsigned int nr = READ_ONCE(free_per_iteration);
I don't see why you are needing READ_ONCE() here.
free_per_iteration can be concurrently modified via sysfs.
My remark was based on the wrong assumption that ignore_limit could be dropped.
Even if ignore_limit could not be dropped, READ_ONCE is still necessary to avoid a data race.