On Fri, Apr 04, 2025 at 12:55:09PM +0200, David Hildenbrand wrote:
On 04.04.25 12:00, David Hildenbrand wrote:
On 04.04.25 06:36, Halil Pasic wrote:
On Thu, 3 Apr 2025 16:28:31 +0200 David Hildenbrand david@redhat.com wrote:
Sorry I have to have a look at that discussion. Maybe it will answer some my questions.
Yes, I think so.
Let's fix it without affecting existing setups for now by properly ignoring the non-existing queues, so the indicator bits will match the queue indexes.
Just one question. My understanding is that the crux is that Linux and QEMU (or the driver and the device) disagree at which index reporting_vq is actually sitting. Is that right?
I thought I made it clear: this is only about the airq indicator bit. That's where both disagree.
Not the actual queue index (see above).
I did some more research including having a look at that discussion. Let me try to sum up how did we end up here.
Let me add some more details after digging as well:
Before commit a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") the kernel behavior used to be in spec, but QEMU and possibly other hypervisor were out of spec and things did not work.
It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that, we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very end. So there was no possibility for holes "in-between".
In the Linux driver, we created the stats queue only if the feature bit VIRTIO_BALLOON_F_STATS_VQ was actually around:
nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would unconditionally create 4 queues. QEMU always supported the first 3 queues unconditionally, but old QEMU did obviously not support the (new) VIRTIO_BALLOON_F_FREE_PAGE_HINT queue.
390x didn't particularly like getting queried for non-existing queues. [1] So the fix was not for a hypervisor that was out of spec, but because quering non-existing queues didn't work.
The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well.
Again, as QEMU always implemented the 3 first queues unconditionally, this was not a problem.
[1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/...
Possibly because of the complexity of fixing the hypervisor(s) commit a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted for changing the guest side so that it does not fit the spec but fits the hypervisor(s). It unfortunately also broke notifiers (for the with holes) scenario for virtio-ccw only.
Yes, it broke the notifiers.
But note that everything was in spec at that point, because we only documented "free_page_vq == 3" in the spec *2 years later*, in 2020:
commit 38448268eba0c105200d131c3f7f660129a4d673 Author: Alexander Duyck alexander.h.duyck@linux.intel.com Date: Tue Aug 25 07:45:02 2020 -0700
content: Document balloon feature free page hints Free page hints allow the balloon driver to provide information on what pages are not currently in use so that we can avoid the cost of copying them in migration scenarios. Add a feature description for free page hints describing basic functioning and requirements.
At that point, what we documented in the spec *did not match reality* in Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is unconditionally set.
QEMU and Linux kept using that queue index assignment model, and the spec was wrong (out of sync?) at that point. The spec got more wrong with
commit d917d4a8d552c003e046b0e3b1b529d98f7e695b Author: Alexander Duyck alexander.h.duyck@linux.intel.com Date: Tue Aug 25 07:45:17 2020 -0700
content: Document balloon feature free page reporting Free page reporting is a feature that allows the guest to proactively report unused pages to the host. By making use of this feature is is possible to reduce the overall memory footprint of the guest in cases where some significant portion of the memory is idle. Add documentation for the free page reporting feature describing the functionality and requirements.
Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to QEMU+Linux implementation, so the spec did not reflect reality.
I'll note also cloud-hypervisor [2] today follows that model.
In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented in the spec to be *4*.
So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending on the availability of the other two features/queues.
[2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-device...
Now we had another look at this, and have concluded that fixing the hypervisor(s) and fixing the kernel, and making sure that the fixed kernel can tolerate the old broken hypervisor(s) is way to complicated if possible at all. So we decided to give the spec a reality check and fix the notifier bit assignment for virtio-ccw which is broken beyond doubt if we accept that the correct virtqueue index is the one that the hypervisor(s) use and not the one that the spec says they should use.
In case of virtio-balloon, it's unfortunate that it went that way, but the spec simply did not / does not reflect reality when it was added to the spec.
With the spec fixed, the whole notion of "holes" will be something that does not make sense any more. With that the merit of the kernel interface virtio_find_vqs() supporting "holes" is quite questionable. Now we need it because the drivers within the Linux kernel still think of the queues in terms of the current spec, i.e. they try to have the "holes" as mandated by the spec, and the duty of making it work with the broken device implementations falls to the transports.
Right, the "holes" only exist in the input array.
Under the assumption that the spec is indeed going to be fixed:
For virito-balloon, we should probably do the following:
From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand david@redhat.com Date: Fri, 4 Apr 2025 12:53:16 +0200 Subject: [PATCH] virtio-balloon: Fix queue index assignment for non-existing queues
Signed-off-by: David Hildenbrand david@redhat.com
device-types/balloon/description.tex | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex index a1d9603..a7396ff 100644 --- a/device-types/balloon/description.tex +++ b/device-types/balloon/description.tex @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I 5 \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
+\begin{description} +\item[inflateq] Exists unconditionally. +\item[deflateq] Exists unconditionally. +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. +\end{description}
+\begin{note} +Virtqueue indexes are assigned sequentially for existing queues, starting +with index 0; consequently, if a virtqueue does not exist, it does not get +an index assigned. Assuming all virtqueues exist for a device, the indexes +are:
\begin{description} \item[0] inflateq \item[1] deflateq @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque \item[3] free_page_vq \item[4] reporting_vq \end{description}
- statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
- free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
- reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
+\end{note} \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} \begin{description} -- 2.48.1
If something along these lines sounds reasonable, I can send a proper patch to the proper audience.
Indeed, but do we want to add a note about previous spec versions saying something different? Maybe, with a hint how devices following old spec can be detected?
-- Cheers,
David / dhildenb