[PATCH 0/3] exec: further refine address_space_get_iotlb_entry()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[PATCH 0/3] exec: further refine address_space_get_iotlb_entry()

Peter Xu
With the patch applied:

  [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
  (already in Paolo's pull request but not yet merged)

Now we can have valid address masks. However it is still not ideal,
considering that the mask may not be aligned to guest page sizes. One
example would be when huge page is used in guest (please see commit
message in patch 1 for details). It applies to normal pages too. So we
not only need a valid address mask, we should make sure it is page
mask (for x86, it should be either 4K/2M/1G pages).

Patch 1+2 fixes the problem. Tested with both kernel net driver or
testpmd, on either 4K/2M pages, to make sure the page mask is correct.

Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
definitely want patch 3 now. Here's the simplest TCP streaming test
using vhost dmar and iommu=pt in guest:

  without patch 3:    12.0Gbps
  with patch 3:       33.5Gbps

Please review, thanks.

Peter Xu (3):
  exec: add page_mask for address_space_do_translate
  exec: simplify address_space_get_iotlb_entry
  vhost: iommu: cache static mapping if there is

 exec.c                 | 73 +++++++++++++++++++++++++++++++++-----------------
 hw/virtio/trace-events |  4 +++
 hw/virtio/vhost.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+), 24 deletions(-)

--
2.7.4


Reply | Threaded
Open this post in threaded view
|

[PATCH 1/3] exec: add page_mask for address_space_do_translate

Peter Xu
The function is originally used for address_space_translate() and what
we care about most is (xlat, plen) range. However for iotlb requests, we
don't really care about "plen", but the size of the page that "xlat" is
located on. While, plen cannot really contain this information.

A simple example to show why "plen" is not good for IOTLB translations:

E.g., for huge pages, it is possible that guest mapped 1G huge page on
device side that used this GPA range:

  0x100000000 - 0x13fffffff

Then let's say we want to translate one IOVA that finally mapped to GPA
0x13ffffe00 (which is located on this 1G huge page). Then here we'll
get:

  (xlat, plen) = (0x13fffe00, 0x200)

So the IOTLB would be only covering a very small range since from
"plen" (which is 0x200 bytes) we cannot tell the size of the page.

Actually we can really know that this is a huge page - we just throw the
information away in address_space_do_translate().

This patch introduced "page_mask" optional parameter to capture that
page mask info. Also, I made "plen" an optional parameter as well, with
some comments for the whole function.

No functional change yet.

Signed-off-by: Peter Xu <[hidden email]>
---
 exec.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 8fc0e78..63a3ff0 100644
--- a/exec.c
+++ b/exec.c
@@ -465,21 +465,45 @@ address_space_translate_internal(AddressSpaceDispatch *d, hwaddr addr, hwaddr *x
     return section;
 }
 
-/* Called from RCU critical section */
+/**
+ * address_space_do_translate - translate an address in AddressSpace
+ *
+ * @as: the address space that we want to translate on
+ * @addr: the address to be translated in above address space
+ * @xlat: the translated address offset within memory region. It
+ *        cannot be @NULL.
+ * @plen_out: valid read/write length of the translated address. It
+ *            can be @NULL when we don't care about it.
+ * @page_mask_out: page mask for the translated address. This
+ *            should only be meaningful for IOMMU translated
+ *            addresses, since there may be huge pages that this bit
+ *            would tell. It can be @NULL if we don't care about it.
+ * @is_write: whether the translation operation is for write
+ * @is_mmio: whether this can be MMIO, set true if it can
+ *
+ * This function is called from RCU critical section
+ */
 static MemoryRegionSection address_space_do_translate(AddressSpace *as,
                                                       hwaddr addr,
                                                       hwaddr *xlat,
-                                                      hwaddr *plen,
+                                                      hwaddr *plen_out,
+                                                      hwaddr *page_mask_out,
                                                       bool is_write,
                                                       bool is_mmio)
 {
     IOMMUTLBEntry iotlb;
     MemoryRegionSection *section;
     MemoryRegion *mr;
+    hwaddr page_mask = TARGET_PAGE_MASK;
+    hwaddr plen = (hwaddr)(-1);
+
+    if (plen_out) {
+        plen = *plen_out;
+    }
 
     for (;;) {
         AddressSpaceDispatch *d = atomic_rcu_read(&as->dispatch);
-        section = address_space_translate_internal(d, addr, &addr, plen, is_mmio);
+        section = address_space_translate_internal(d, addr, &addr, &plen, is_mmio);
         mr = section->mr;
 
         if (!mr->iommu_ops) {
@@ -490,7 +514,8 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
                                          IOMMU_WO : IOMMU_RO);
         addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                 | (addr & iotlb.addr_mask));
-        *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);
+        page_mask = iotlb.addr_mask;
+        plen = MIN(plen, (addr | iotlb.addr_mask) - addr + 1);
         if (!(iotlb.perm & (1 << is_write))) {
             goto translate_fail;
         }
@@ -500,6 +525,14 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
 
     *xlat = addr;
 
+    if (page_mask_out) {
+        *page_mask_out = page_mask;
+    }
+
+    if (plen_out) {
+        *plen_out = plen;
+    }
+
     return *section;
 
 translate_fail:
@@ -518,7 +551,7 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
 
     /* This can never be MMIO. */
     section = address_space_do_translate(as, addr, &xlat, &plen,
-                                         is_write, false);
+                                         NULL, is_write, false);
 
     /* Illegal translation */
     if (section.mr == &io_mem_unassigned) {
@@ -560,7 +593,8 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
     MemoryRegionSection section;
 
     /* This can be MMIO, so setup MMIO bit. */
-    section = address_space_do_translate(as, addr, xlat, plen, is_write, true);
+    section = address_space_do_translate(as, addr, xlat, plen,
+                                         NULL, is_write, true);
     mr = section.mr;
 
     if (xen_enabled() && memory_access_is_direct(mr, is_write)) {
--
2.7.4


Reply | Threaded
Open this post in threaded view
|

[PATCH 2/3] exec: simplify address_space_get_iotlb_entry

Peter Xu
In reply to this post by Peter Xu
This patch let address_space_get_iotlb_entry() to use the newly
introduced page_mask parameter in address_space_do_translate(). Then we
will be sure the IOTLB can be aligned to page mask, also we should
nicely support huge pages now when introducing a764040.

Fixes: a764040 ("exec: abstract address_space_do_translate()")
Signed-off-by: Peter Xu <[hidden email]>
---
 exec.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/exec.c b/exec.c
index 63a3ff0..1f86253 100644
--- a/exec.c
+++ b/exec.c
@@ -544,14 +544,14 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
                                             bool is_write)
 {
     MemoryRegionSection section;
-    hwaddr xlat, plen;
+    hwaddr xlat, page_mask;
 
-    /* Try to get maximum page mask during translation. */
-    plen = (hwaddr)-1;
-
-    /* This can never be MMIO. */
-    section = address_space_do_translate(as, addr, &xlat, &plen,
-                                         NULL, is_write, false);
+    /*
+     * This can never be MMIO, and we don't really care about plen,
+     * but page mask.
+     */
+    section = address_space_do_translate(as, addr, &xlat, NULL,
+                                         &page_mask, is_write, false);
 
     /* Illegal translation */
     if (section.mr == &io_mem_unassigned) {
@@ -562,20 +562,11 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
     xlat += section.offset_within_address_space -
         section.offset_within_region;
 
-    if (plen == (hwaddr)-1) {
-        /* If not specified during translation, use default mask */
-        plen = TARGET_PAGE_MASK;
-    } else {
-        /* Make it a valid page mask */
-        assert(plen);
-        plen = pow2floor(plen) - 1;
-    }
-
     return (IOMMUTLBEntry) {
         .target_as = section.address_space,
-        .iova = addr & ~plen,
-        .translated_addr = xlat & ~plen,
-        .addr_mask = plen,
+        .iova = addr & ~page_mask,
+        .translated_addr = xlat & ~page_mask,
+        .addr_mask = page_mask,
         /* IOTLBs are for DMAs, and DMA only allows on RAMs. */
         .perm = IOMMU_RW,
     };
--
2.7.4


Reply | Threaded
Open this post in threaded view
|

[PATCH 3/3] vhost: iommu: cache static mapping if there is

Peter Xu
In reply to this post by Peter Xu
This patch pre-heat vhost iotlb cache when passthrough mode enabled.

Sometimes, even if user specified iommu_platform for vhost devices,
IOMMU might still be disabled. One case is passthrough mode in VT-d
implementation. We can detect this by observing iommu_list. If it's
empty, it means IOMMU translation is disabled, then we can actually
pre-heat the translation (it'll be static mapping then) by first
invalidating all IOTLB, then cache existing memory ranges into vhost
backend iotlb using 1:1 mapping.

Reviewed-by: Jason Wang <[hidden email]>
Signed-off-by: Peter Xu <[hidden email]>
---
 hw/virtio/trace-events |  4 +++
 hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 1f7a7c1..54dcbb3 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
+
+# hw/virtio/vhost.c
+vhost_iommu_commit(void) ""
+vhost_iommu_static_preheat(void) ""
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 03a46a7..d03d720 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -27,6 +27,7 @@
 #include "hw/virtio/virtio-access.h"
 #include "migration/blocker.h"
 #include "sysemu/dma.h"
+#include "trace.h"
 
 /* enabled until disconnected backend stabilizes */
 #define _VHOST_DEBUG 1
@@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 }
 
+static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
+{
+    return !QLIST_EMPTY(&dev->iommu_list);
+}
+
 static void vhost_iommu_region_add(MemoryListener *listener,
                                    MemoryRegionSection *section)
 {
@@ -782,6 +788,65 @@ static void vhost_iommu_region_del(MemoryListener *listener,
     }
 }
 
+static void vhost_iommu_commit(MemoryListener *listener)
+{
+    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
+                                         iommu_listener);
+    struct vhost_memory_region *r;
+    int i;
+
+    trace_vhost_iommu_commit();
+
+    if (!vhost_iommu_mr_enabled(dev)) {
+        /*
+        * This means iommu_platform is enabled, however iommu memory
+        * region is disabled, e.g., when device passthrough is setup.
+        * Then, no translation is needed any more.
+        *
+        * Let's first invalidate the whole IOTLB, then pre-heat the
+        * static mapping by looping over vhost memory ranges.
+        */
+
+        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
+                                                          UINT64_MAX)) {
+            error_report("%s: flush existing IOTLB failed", __func__);
+            return;
+        }
+
+        /*
+         * Current VHOST_IOTLB_INVALIDATE API has a small defect that
+         * the invalidation for (start=0, size=UINT64_MAX) cannot
+         * really invalidate an cached range of (start=UINT64_MAX-1,
+         * size=1). We send this 2nd invalidation to workaround this.
+         * But, frankly speaking for QEMU we don't have a problem with
+         * this since we will never have a vhost cache with range
+         * (start=UINT64_MAX-1, size=1) - if you see
+         * address_space_get_iotlb_entry() all IOTLBs are page
+         * aligned.
+         */
+        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, UINT64_MAX,
+                                                          1)) {
+            error_report("%s: flush existing IOTLB failed", __func__);
+            return;
+        }
+
+        for (i = 0; i < dev->mem->nregions; i++) {
+            r = &dev->mem->regions[i];
+            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
+            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
+                                                          r->guest_phys_addr,
+                                                          r->userspace_addr,
+                                                          r->memory_size,
+                                                          IOMMU_RW)) {
+                error_report("%s: pre-heat static mapping failed", __func__);
+                return;
+            }
+        }
+
+        trace_vhost_iommu_static_preheat();
+    }
+}
+
 static void vhost_region_nop(MemoryListener *listener,
                              MemoryRegionSection *section)
 {
@@ -1298,6 +1363,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->iommu_listener = (MemoryListener) {
         .region_add = vhost_iommu_region_add,
         .region_del = vhost_iommu_region_del,
+        .commit = vhost_iommu_commit,
     };
 
     if (hdev->migration_blocker == NULL) {
--
2.7.4


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Fri, Jun 02, 2017 at 07:50:51PM +0800, Peter Xu wrote:

> With the patch applied:
>
>   [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
>   (already in Paolo's pull request but not yet merged)
>
> Now we can have valid address masks. However it is still not ideal,
> considering that the mask may not be aligned to guest page sizes. One
> example would be when huge page is used in guest (please see commit
> message in patch 1 for details). It applies to normal pages too. So we
> not only need a valid address mask, we should make sure it is page
> mask (for x86, it should be either 4K/2M/1G pages).

Why should we? To get better performance, right?

> Patch 1+2 fixes the problem. Tested with both kernel net driver or
> testpmd, on either 4K/2M pages, to make sure the page mask is correct.
>
> Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
> definitely want patch 3 now. Here's the simplest TCP streaming test
> using vhost dmar and iommu=pt in guest:
>
>   without patch 3:    12.0Gbps

And what happens without patches 1-2?

>   with patch 3:       33.5Gbps

This is the part I don't get. Patches 1-2 will return a bigger region to
callers. The result should be better performance - instead it seems to
slow down vhost for some reason and we need tricks to get
performance back. What's going on?

> Please review, thanks.
>
> Peter Xu (3):
>   exec: add page_mask for address_space_do_translate
>   exec: simplify address_space_get_iotlb_entry
>   vhost: iommu: cache static mapping if there is
>
>  exec.c                 | 73 +++++++++++++++++++++++++++++++++-----------------
>  hw/virtio/trace-events |  4 +++
>  hw/virtio/vhost.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 119 insertions(+), 24 deletions(-)
>
> --
> 2.7.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] vhost: iommu: cache static mapping if there is

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:

> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
>
> Sometimes, even if user specified iommu_platform for vhost devices,
> IOMMU might still be disabled. One case is passthrough mode in VT-d
> implementation. We can detect this by observing iommu_list. If it's
> empty, it means IOMMU translation is disabled, then we can actually
> pre-heat the translation (it'll be static mapping then) by first
> invalidating all IOTLB, then cache existing memory ranges into vhost
> backend iotlb using 1:1 mapping.
>
> Reviewed-by: Jason Wang <[hidden email]>
> Signed-off-by: Peter Xu <[hidden email]>

This is still a hack I think. What if there's an invalidation?
I think the right thing is to send updates only when requested,
but sent the largest mapping including the iova, not from iova until end
of page. Thoughts?

> ---
>  hw/virtio/trace-events |  4 +++
>  hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
>
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 1f7a7c1..54dcbb3 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> +
> +# hw/virtio/vhost.c
> +vhost_iommu_commit(void) ""
> +vhost_iommu_static_preheat(void) ""
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 03a46a7..d03d720 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -27,6 +27,7 @@
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
>  #include "sysemu/dma.h"
> +#include "trace.h"
>  
>  /* enabled until disconnected backend stabilizes */
>  #define _VHOST_DEBUG 1
> @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      }
>  }
>  
> +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> +{
> +    return !QLIST_EMPTY(&dev->iommu_list);
> +}
> +
>  static void vhost_iommu_region_add(MemoryListener *listener,
>                                     MemoryRegionSection *section)
>  {
> @@ -782,6 +788,65 @@ static void vhost_iommu_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static void vhost_iommu_commit(MemoryListener *listener)
> +{
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         iommu_listener);
> +    struct vhost_memory_region *r;
> +    int i;
> +
> +    trace_vhost_iommu_commit();
> +
> +    if (!vhost_iommu_mr_enabled(dev)) {
> +        /*
> +        * This means iommu_platform is enabled, however iommu memory
> +        * region is disabled, e.g., when device passthrough is setup.
> +        * Then, no translation is needed any more.
> +        *
> +        * Let's first invalidate the whole IOTLB, then pre-heat the
> +        * static mapping by looping over vhost memory ranges.
> +        */
> +
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> +                                                          UINT64_MAX)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        /*
> +         * Current VHOST_IOTLB_INVALIDATE API has a small defect that
> +         * the invalidation for (start=0, size=UINT64_MAX) cannot
> +         * really invalidate an cached range of (start=UINT64_MAX-1,
> +         * size=1). We send this 2nd invalidation to workaround this.
> +         * But, frankly speaking for QEMU we don't have a problem with
> +         * this since we will never have a vhost cache with range
> +         * (start=UINT64_MAX-1, size=1) - if you see
> +         * address_space_get_iotlb_entry() all IOTLBs are page
> +         * aligned.
> +         */
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, UINT64_MAX,
> +                                                          1)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        for (i = 0; i < dev->mem->nregions; i++) {
> +            r = &dev->mem->regions[i];
> +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> +                                                          r->guest_phys_addr,
> +                                                          r->userspace_addr,
> +                                                          r->memory_size,
> +                                                          IOMMU_RW)) {
> +                error_report("%s: pre-heat static mapping failed", __func__);
> +                return;
> +            }
> +        }
> +
> +        trace_vhost_iommu_static_preheat();
> +    }
> +}
> +
>  static void vhost_region_nop(MemoryListener *listener,
>                               MemoryRegionSection *section)
>  {
> @@ -1298,6 +1363,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>      hdev->iommu_listener = (MemoryListener) {
>          .region_add = vhost_iommu_region_add,
>          .region_del = vhost_iommu_region_del,
> +        .commit = vhost_iommu_commit,
>      };
>  
>      if (hdev->migration_blocker == NULL) {
> --
> 2.7.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/3] exec: add page_mask for address_space_do_translate

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Fri, Jun 02, 2017 at 07:50:52PM +0800, Peter Xu wrote:

> The function is originally used for address_space_translate() and what
> we care about most is (xlat, plen) range. However for iotlb requests, we
> don't really care about "plen", but the size of the page that "xlat" is
> located on. While, plen cannot really contain this information.
>
> A simple example to show why "plen" is not good for IOTLB translations:
>
> E.g., for huge pages, it is possible that guest mapped 1G huge page on
> device side that used this GPA range:
>
>   0x100000000 - 0x13fffffff
>
> Then let's say we want to translate one IOVA that finally mapped to GPA
> 0x13ffffe00 (which is located on this 1G huge page). Then here we'll
> get:
>
>   (xlat, plen) = (0x13fffe00, 0x200)
>
> So the IOTLB would be only covering a very small range since from
> "plen" (which is 0x200 bytes) we cannot tell the size of the page.
>
> Actually we can really know that this is a huge page - we just throw the
> information away in address_space_do_translate().
>
> This patch introduced "page_mask" optional parameter to capture that
> page mask info. Also, I made "plen" an optional parameter as well, with
> some comments for the whole function.
>
> No functional change yet.
>
> Signed-off-by: Peter Xu <[hidden email]>
> ---
>  exec.c | 46 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 8fc0e78..63a3ff0 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -465,21 +465,45 @@ address_space_translate_internal(AddressSpaceDispatch *d, hwaddr addr, hwaddr *x
>      return section;
>  }
>  
> -/* Called from RCU critical section */
> +/**
> + * address_space_do_translate - translate an address in AddressSpace
> + *
> + * @as: the address space that we want to translate on
> + * @addr: the address to be translated in above address space
> + * @xlat: the translated address offset within memory region. It
> + *        cannot be @NULL.
> + * @plen_out: valid read/write length of the translated address. It
> + *            can be @NULL when we don't care about it.
> + * @page_mask_out: page mask for the translated address. This
> + *            should only be meaningful for IOMMU translated
> + *            addresses, since there may be huge pages that this bit
> + *            would tell. It can be @NULL if we don't care about it.

Why do we need plen or mask at all? It seems MemoryRegionSection
has address and length already. So if you want to find out
distance to section end, do section.size - xlat and you are done.


> + * @is_write: whether the translation operation is for write
> + * @is_mmio: whether this can be MMIO, set true if it can
> + *
> + * This function is called from RCU critical section
> + */
>  static MemoryRegionSection address_space_do_translate(AddressSpace *as,
>                                                        hwaddr addr,
>                                                        hwaddr *xlat,
> -                                                      hwaddr *plen,
> +                                                      hwaddr *plen_out,
> +                                                      hwaddr *page_mask_out,
>                                                        bool is_write,
>                                                        bool is_mmio)
>  {
>      IOMMUTLBEntry iotlb;
>      MemoryRegionSection *section;
>      MemoryRegion *mr;
> +    hwaddr page_mask = TARGET_PAGE_MASK;
> +    hwaddr plen = (hwaddr)(-1);
> +
> +    if (plen_out) {
> +        plen = *plen_out;
> +    }
>  
>      for (;;) {
>          AddressSpaceDispatch *d = atomic_rcu_read(&as->dispatch);
> -        section = address_space_translate_internal(d, addr, &addr, plen, is_mmio);
> +        section = address_space_translate_internal(d, addr, &addr, &plen, is_mmio);
>          mr = section->mr;
>  
>          if (!mr->iommu_ops) {
> @@ -490,7 +514,8 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
>                                           IOMMU_WO : IOMMU_RO);
>          addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
>                  | (addr & iotlb.addr_mask));
> -        *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);
> +        page_mask = iotlb.addr_mask;
> +        plen = MIN(plen, (addr | iotlb.addr_mask) - addr + 1);
>          if (!(iotlb.perm & (1 << is_write))) {
>              goto translate_fail;
>          }
> @@ -500,6 +525,14 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
>  
>      *xlat = addr;
>  
> +    if (page_mask_out) {
> +        *page_mask_out = page_mask;
> +    }
> +
> +    if (plen_out) {
> +        *plen_out = plen;
> +    }
> +
>      return *section;
>  
>  translate_fail:
> @@ -518,7 +551,7 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>  
>      /* This can never be MMIO. */
>      section = address_space_do_translate(as, addr, &xlat, &plen,
> -                                         is_write, false);
> +                                         NULL, is_write, false);
>  
>      /* Illegal translation */
>      if (section.mr == &io_mem_unassigned) {
> @@ -560,7 +593,8 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
>      MemoryRegionSection section;
>  
>      /* This can be MMIO, so setup MMIO bit. */
> -    section = address_space_do_translate(as, addr, xlat, plen, is_write, true);
> +    section = address_space_do_translate(as, addr, xlat, plen,
> +                                         NULL, is_write, true);
>      mr = section.mr;
>  
>      if (xen_enabled() && memory_access_is_direct(mr, is_write)) {
> --
> 2.7.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/3] exec: simplify address_space_get_iotlb_entry

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Fri, Jun 02, 2017 at 07:50:53PM +0800, Peter Xu wrote:

> This patch let address_space_get_iotlb_entry() to use the newly
> introduced page_mask parameter in address_space_do_translate(). Then we
> will be sure the IOTLB can be aligned to page mask, also we should
> nicely support huge pages now when introducing a764040.
>
> Fixes: a764040 ("exec: abstract address_space_do_translate()")
> Signed-off-by: Peter Xu <[hidden email]>
> ---
>  exec.c | 29 ++++++++++-------------------
>  1 file changed, 10 insertions(+), 19 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 63a3ff0..1f86253 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -544,14 +544,14 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>                                              bool is_write)
>  {
>      MemoryRegionSection section;
> -    hwaddr xlat, plen;
> +    hwaddr xlat, page_mask;
>  
> -    /* Try to get maximum page mask during translation. */
> -    plen = (hwaddr)-1;
> -
> -    /* This can never be MMIO. */
> -    section = address_space_do_translate(as, addr, &xlat, &plen,
> -                                         NULL, is_write, false);
> +    /*
> +     * This can never be MMIO, and we don't really care about plen,
> +     * but page mask.
> +     */
> +    section = address_space_do_translate(as, addr, &xlat, NULL,
> +                                         &page_mask, is_write, false);
>  
>      /* Illegal translation */
>      if (section.mr == &io_mem_unassigned) {


Can we just use section.size - xlat here?

> @@ -562,20 +562,11 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>      xlat += section.offset_within_address_space -
>          section.offset_within_region;
>  
> -    if (plen == (hwaddr)-1) {
> -        /* If not specified during translation, use default mask */
> -        plen = TARGET_PAGE_MASK;
> -    } else {
> -        /* Make it a valid page mask */
> -        assert(plen);
> -        plen = pow2floor(plen) - 1;
> -    }
> -
>      return (IOMMUTLBEntry) {
>          .target_as = section.address_space,
> -        .iova = addr & ~plen,
> -        .translated_addr = xlat & ~plen,
> -        .addr_mask = plen,
> +        .iova = addr & ~page_mask,
> +        .translated_addr = xlat & ~page_mask,
> +        .addr_mask = page_mask,
>          /* IOTLBs are for DMAs, and DMA only allows on RAMs. */

BTW this comment is pretty confusing. What does it mean?

>          .perm = IOMMU_RW,
>      };

Looks like we should change IOMMUTLBEntry to pass size and not mask -
then we could simply pass info from section as is. iova would be
addr - xlat.

> --
> 2.7.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] vhost: iommu: cache static mapping if there is

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:

> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
>
> Sometimes, even if user specified iommu_platform for vhost devices,
> IOMMU might still be disabled. One case is passthrough mode in VT-d
> implementation. We can detect this by observing iommu_list. If it's
> empty, it means IOMMU translation is disabled, then we can actually
> pre-heat the translation (it'll be static mapping then) by first
> invalidating all IOTLB, then cache existing memory ranges into vhost
> backend iotlb using 1:1 mapping.
>
> Reviewed-by: Jason Wang <[hidden email]>
> Signed-off-by: Peter Xu <[hidden email]>

Can't say I like this, this does not help the more important
use-case of dpdk which has a small static mapping but can't
be detected by QEMU. vhost should just heat up whatever's
actually used. Why isn't this enough?

> ---
>  hw/virtio/trace-events |  4 +++
>  hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
>
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 1f7a7c1..54dcbb3 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> +
> +# hw/virtio/vhost.c
> +vhost_iommu_commit(void) ""
> +vhost_iommu_static_preheat(void) ""
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 03a46a7..d03d720 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -27,6 +27,7 @@
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
>  #include "sysemu/dma.h"
> +#include "trace.h"
>  
>  /* enabled until disconnected backend stabilizes */
>  #define _VHOST_DEBUG 1
> @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      }
>  }
>  
> +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> +{
> +    return !QLIST_EMPTY(&dev->iommu_list);
> +}
> +
>  static void vhost_iommu_region_add(MemoryListener *listener,
>                                     MemoryRegionSection *section)
>  {
> @@ -782,6 +788,65 @@ static void vhost_iommu_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static void vhost_iommu_commit(MemoryListener *listener)
> +{
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         iommu_listener);
> +    struct vhost_memory_region *r;
> +    int i;
> +
> +    trace_vhost_iommu_commit();
> +
> +    if (!vhost_iommu_mr_enabled(dev)) {
> +        /*
> +        * This means iommu_platform is enabled, however iommu memory
> +        * region is disabled, e.g., when device passthrough is setup.
> +        * Then, no translation is needed any more.
> +        *
> +        * Let's first invalidate the whole IOTLB, then pre-heat the
> +        * static mapping by looping over vhost memory ranges.
> +        */
> +
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> +                                                          UINT64_MAX)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        /*
> +         * Current VHOST_IOTLB_INVALIDATE API has a small defect that
> +         * the invalidation for (start=0, size=UINT64_MAX) cannot
> +         * really invalidate an cached range of (start=UINT64_MAX-1,
> +         * size=1). We send this 2nd invalidation to workaround this.
> +         * But, frankly speaking for QEMU we don't have a problem with
> +         * this since we will never have a vhost cache with range
> +         * (start=UINT64_MAX-1, size=1) - if you see
> +         * address_space_get_iotlb_entry() all IOTLBs are page
> +         * aligned.
> +         */
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, UINT64_MAX,
> +                                                          1)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        for (i = 0; i < dev->mem->nregions; i++) {
> +            r = &dev->mem->regions[i];
> +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> +                                                          r->guest_phys_addr,
> +                                                          r->userspace_addr,
> +                                                          r->memory_size,
> +                                                          IOMMU_RW)) {
> +                error_report("%s: pre-heat static mapping failed", __func__);
> +                return;
> +            }
> +        }
> +
> +        trace_vhost_iommu_static_preheat();
> +    }
> +}
> +
>  static void vhost_region_nop(MemoryListener *listener,
>                               MemoryRegionSection *section)
>  {
> @@ -1298,6 +1363,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>      hdev->iommu_listener = (MemoryListener) {
>          .region_add = vhost_iommu_region_add,
>          .region_del = vhost_iommu_region_del,
> +        .commit = vhost_iommu_commit,
>      };
>  
>      if (hdev->migration_blocker == NULL) {
> --
> 2.7.4

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 1/3] exec: add page_mask for address_space_do_translate

Peter Xu
In reply to this post by Michael S. Tsirkin-4
On Fri, Jun 02, 2017 at 07:45:05PM +0300, Michael S. Tsirkin wrote:

> On Fri, Jun 02, 2017 at 07:50:52PM +0800, Peter Xu wrote:
> > The function is originally used for address_space_translate() and what
> > we care about most is (xlat, plen) range. However for iotlb requests, we
> > don't really care about "plen", but the size of the page that "xlat" is
> > located on. While, plen cannot really contain this information.
> >
> > A simple example to show why "plen" is not good for IOTLB translations:
> >
> > E.g., for huge pages, it is possible that guest mapped 1G huge page on
> > device side that used this GPA range:
> >
> >   0x100000000 - 0x13fffffff
> >
> > Then let's say we want to translate one IOVA that finally mapped to GPA
> > 0x13ffffe00 (which is located on this 1G huge page). Then here we'll
> > get:
> >
> >   (xlat, plen) = (0x13fffe00, 0x200)
> >
> > So the IOTLB would be only covering a very small range since from
> > "plen" (which is 0x200 bytes) we cannot tell the size of the page.
> >
> > Actually we can really know that this is a huge page - we just throw the
> > information away in address_space_do_translate().
> >
> > This patch introduced "page_mask" optional parameter to capture that
> > page mask info. Also, I made "plen" an optional parameter as well, with
> > some comments for the whole function.
> >
> > No functional change yet.
> >
> > Signed-off-by: Peter Xu <[hidden email]>
> > ---
> >  exec.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/exec.c b/exec.c
> > index 8fc0e78..63a3ff0 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -465,21 +465,45 @@ address_space_translate_internal(AddressSpaceDispatch *d, hwaddr addr, hwaddr *x
> >      return section;
> >  }
> >  
> > -/* Called from RCU critical section */
> > +/**
> > + * address_space_do_translate - translate an address in AddressSpace
> > + *
> > + * @as: the address space that we want to translate on
> > + * @addr: the address to be translated in above address space
> > + * @xlat: the translated address offset within memory region. It
> > + *        cannot be @NULL.
> > + * @plen_out: valid read/write length of the translated address. It
> > + *            can be @NULL when we don't care about it.
> > + * @page_mask_out: page mask for the translated address. This
> > + *            should only be meaningful for IOMMU translated
> > + *            addresses, since there may be huge pages that this bit
> > + *            would tell. It can be @NULL if we don't care about it.
>
> Why do we need plen or mask at all? It seems MemoryRegionSection
> has address and length already. So if you want to find out
> distance to section end, do section.size - xlat and you are done.

Hi, Michael,

When you say:

  section.size - xlat

Do you really mean this?

  section.offset_within_address_space + section.size - xlat

Since otherwise it will make no much sense to me.

Anyway, I don't know whether it'll be okay we remove the plen...

In address_space_do_translate(), the logic is basically:

1. do internal translation (basically to find the section info from
   current address space)
2. do IOMMU translation if the MR is IOMMU typed
3. goto 1.

Along the way (1 -> 2 -> 3 -> 1 -> ...) until we finished the
translation (I don't really know whether we'll have cases for nested
IOMMU translation, but anyway we have a while loop there, so assume
the loop can be executed many times), plen can be shrinking all the
time, either by this in address_space_translate_internal():

    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));

Or this in address_space_do_translate():

    *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);

And I don't know only using the final section.size to decide plen
would be enough.

Also, for page_mask information - I don't quite sure
MemoryRegionSection can express that info. Again, huge page can be one
example: MemoryRegionSection doesn't really contain huge page
information, while MemoryRegionIOMMUOps.translate() does contain that
information (via addr_mask field).

(I see that you would like IOTLB to be using arbitary length rather
 than page masks. Maybe we can first decide which would be the best
 interface for IOTLB. I'll reply in that context later.)

Thanks,

--
Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/3] exec: simplify address_space_get_iotlb_entry

Peter Xu
In reply to this post by Michael S. Tsirkin-4
On Fri, Jun 02, 2017 at 07:49:58PM +0300, Michael S. Tsirkin wrote:

> On Fri, Jun 02, 2017 at 07:50:53PM +0800, Peter Xu wrote:
> > This patch let address_space_get_iotlb_entry() to use the newly
> > introduced page_mask parameter in address_space_do_translate(). Then we
> > will be sure the IOTLB can be aligned to page mask, also we should
> > nicely support huge pages now when introducing a764040.
> >
> > Fixes: a764040 ("exec: abstract address_space_do_translate()")
> > Signed-off-by: Peter Xu <[hidden email]>
> > ---
> >  exec.c | 29 ++++++++++-------------------
> >  1 file changed, 10 insertions(+), 19 deletions(-)
> >
> > diff --git a/exec.c b/exec.c
> > index 63a3ff0..1f86253 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -544,14 +544,14 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
> >                                              bool is_write)
> >  {
> >      MemoryRegionSection section;
> > -    hwaddr xlat, plen;
> > +    hwaddr xlat, page_mask;
> >  
> > -    /* Try to get maximum page mask during translation. */
> > -    plen = (hwaddr)-1;
> > -
> > -    /* This can never be MMIO. */
> > -    section = address_space_do_translate(as, addr, &xlat, &plen,
> > -                                         NULL, is_write, false);
> > +    /*
> > +     * This can never be MMIO, and we don't really care about plen,
> > +     * but page mask.
> > +     */
> > +    section = address_space_do_translate(as, addr, &xlat, NULL,
> > +                                         &page_mask, is_write, false);
> >  
> >      /* Illegal translation */
> >      if (section.mr == &io_mem_unassigned) {
>
>
> Can we just use section.size - xlat here?

I replied in the other thread about what I thought... So will skip
here.

>
> > @@ -562,20 +562,11 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
> >      xlat += section.offset_within_address_space -
> >          section.offset_within_region;
> >  
> > -    if (plen == (hwaddr)-1) {
> > -        /* If not specified during translation, use default mask */
> > -        plen = TARGET_PAGE_MASK;
> > -    } else {
> > -        /* Make it a valid page mask */
> > -        assert(plen);
> > -        plen = pow2floor(plen) - 1;
> > -    }
> > -
> >      return (IOMMUTLBEntry) {
> >          .target_as = section.address_space,
> > -        .iova = addr & ~plen,
> > -        .translated_addr = xlat & ~plen,
> > -        .addr_mask = plen,
> > +        .iova = addr & ~page_mask,
> > +        .translated_addr = xlat & ~page_mask,
> > +        .addr_mask = page_mask,
> >          /* IOTLBs are for DMAs, and DMA only allows on RAMs. */
>
> BTW this comment is pretty confusing. What does it mean?

This function, address_space_get_iotlb_entry(), is for device to get
IOTLB entry when they want to request for page translations for DMA,
and DMA should only be allowed to RAM, right? Then we need IOMMU_RW
permission here.

Maybe I should add one more check above on the returned MR - it should
be RAM typed as well. But I don't really know whether that's too
strict, since if guest setup the IOMMU page table to allow one IOVA
points to a non-RAM region, I thought it should still be legal from
hypervisor POV (I see it a guest OS bug though)?

>
> >          .perm = IOMMU_RW,
> >      };
>
> Looks like we should change IOMMUTLBEntry to pass size and not mask -
> then we could simply pass info from section as is. iova would be
> addr - xlat.

I don't sure whether it'll be a good interface for IOTLB. AFAIU at
least for VT-d, the IOMMU translation is page aligned which is defined
by spec, so it makes sense that (again at least for VT-d) here we'd
better just use page_mask/addr_mask.

That's also how I know about IOMMU in general - I assume it do the
translations always with page masks (never arbitary length), though
page size can differ from platfrom to platform, that's why here the
IOTLB interface used addr_mask, then it works for all platforms. I
don't know whether I'm 100% correct here though.

Maybe David/Paolo/... would comment as well?

(CC David)

Thanks,

--
Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] vhost: iommu: cache static mapping if there is

Peter Xu
In reply to this post by Michael S. Tsirkin-4
On Fri, Jun 02, 2017 at 06:45:05PM +0300, Michael S. Tsirkin wrote:

> On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
> > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> >
> > Sometimes, even if user specified iommu_platform for vhost devices,
> > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > implementation. We can detect this by observing iommu_list. If it's
> > empty, it means IOMMU translation is disabled, then we can actually
> > pre-heat the translation (it'll be static mapping then) by first
> > invalidating all IOTLB, then cache existing memory ranges into vhost
> > backend iotlb using 1:1 mapping.
> >
> > Reviewed-by: Jason Wang <[hidden email]>
> > Signed-off-by: Peter Xu <[hidden email]>
>
> This is still a hack I think. What if there's an invalidation?
> I think the right thing is to send updates only when requested,
> but sent the largest mapping including the iova, not from iova until end
> of page. Thoughts?

Indeed it's kind of a hack, but it does not hurt anything but will
definitely boost performance in most cases...

Yes "sent the largest mapping including the iova" is okay, but the
first IO on one region would be delayed as well, so IMHO it's not the
best solution as well. I think the best solution should be (for sure)
that vhost knows it's PT, then it just skips the translation
completely. I just don't sure whether there's simple/good way to do
this.

Thanks,

--
Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()

Peter Xu
In reply to this post by Michael S. Tsirkin-4
On Fri, Jun 02, 2017 at 05:51:07PM +0300, Michael S. Tsirkin wrote:

> On Fri, Jun 02, 2017 at 07:50:51PM +0800, Peter Xu wrote:
> > With the patch applied:
> >
> >   [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
> >   (already in Paolo's pull request but not yet merged)
> >
> > Now we can have valid address masks. However it is still not ideal,
> > considering that the mask may not be aligned to guest page sizes. One
> > example would be when huge page is used in guest (please see commit
> > message in patch 1 for details). It applies to normal pages too. So we
> > not only need a valid address mask, we should make sure it is page
> > mask (for x86, it should be either 4K/2M/1G pages).
>
> Why should we? To get better performance, right?

IMHO one point is for performance, the other point is on how we should
define the IOTLB interface. My opinion is that it is better valid
masks.

>
> > Patch 1+2 fixes the problem. Tested with both kernel net driver or
> > testpmd, on either 4K/2M pages, to make sure the page mask is correct.
> >
> > Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
> > definitely want patch 3 now. Here's the simplest TCP streaming test
> > using vhost dmar and iommu=pt in guest:
> >
> >   without patch 3:    12.0Gbps
>
> And what happens without patches 1-2?

Without 1-2, performance is good. But I think it is hacky to have such
a good result (I explained why the performance is good in the VT-d PT
support thread with some logs)...

>
> >   with patch 3:       33.5Gbps
>
> This is the part I don't get. Patches 1-2 will return a bigger region to
> callers. The result should be better performance - instead it seems to
> slow down vhost for some reason and we need tricks to get
> performance back. What's going on?

Yes. The problem is that if without patch 1/2 I think the codes lacks
correctness. With correctness, we lost performance, then I picked
patch 3 as well.

Again, I think the first thing we need to settle is what should be the
best definition for IOTLB (addr_mask or arbitary length).

Thanks,

--
Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] vhost: iommu: cache static mapping if there is

Jason Wang-8
In reply to this post by Peter Xu


On 2017年06月05日 11:15, Peter Xu wrote:

> On Fri, Jun 02, 2017 at 06:45:05PM +0300, Michael S. Tsirkin wrote:
>> On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
>>> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
>>>
>>> Sometimes, even if user specified iommu_platform for vhost devices,
>>> IOMMU might still be disabled. One case is passthrough mode in VT-d
>>> implementation. We can detect this by observing iommu_list. If it's
>>> empty, it means IOMMU translation is disabled, then we can actually
>>> pre-heat the translation (it'll be static mapping then) by first
>>> invalidating all IOTLB, then cache existing memory ranges into vhost
>>> backend iotlb using 1:1 mapping.
>>>
>>> Reviewed-by: Jason Wang <[hidden email]>
>>> Signed-off-by: Peter Xu <[hidden email]>
>> This is still a hack I think. What if there's an invalidation?
>> I think the right thing is to send updates only when requested,
>> but sent the largest mapping including the iova, not from iova until end
>> of page. Thoughts?
> Indeed it's kind of a hack, but it does not hurt anything but will
> definitely boost performance in most cases...
>
> Yes "sent the largest mapping including the iova" is okay, but the
> first IO on one region would be delayed as well, so IMHO it's not the
> best solution as well. I think the best solution should be (for sure)
> that vhost knows it's PT, then it just skips the translation
> completely. I just don't sure whether there's simple/good way to do
> this.
>
> Thanks,

We can disabled device IOTLB completely in this case. But looks like
there's a minor kernel bug prevent us from doing this. Let me post a fix
and let's see then.

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 3/3] vhost: iommu: cache static mapping if there is

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Mon, Jun 05, 2017 at 11:15:11AM +0800, Peter Xu wrote:

> On Fri, Jun 02, 2017 at 06:45:05PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
> > > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> > >
> > > Sometimes, even if user specified iommu_platform for vhost devices,
> > > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > > implementation. We can detect this by observing iommu_list. If it's
> > > empty, it means IOMMU translation is disabled, then we can actually
> > > pre-heat the translation (it'll be static mapping then) by first
> > > invalidating all IOTLB, then cache existing memory ranges into vhost
> > > backend iotlb using 1:1 mapping.
> > >
> > > Reviewed-by: Jason Wang <[hidden email]>
> > > Signed-off-by: Peter Xu <[hidden email]>
> >
> > This is still a hack I think. What if there's an invalidation?
> > I think the right thing is to send updates only when requested,
> > but sent the largest mapping including the iova, not from iova until end
> > of page. Thoughts?
>
> Indeed it's kind of a hack, but it does not hurt anything but will
> definitely boost performance in most cases...
>
> Yes "sent the largest mapping including the iova" is okay, but the
> first IO on one region would be delayed as well, so IMHO it's not the
> best solution as well. I think the best solution should be (for sure)
> that vhost knows it's PT, then it just skips the translation
> completely. I just don't sure whether there's simple/good way to do
> this.
>
> Thanks,

If you send the whole 64 bit area, then backend can detect this
easily.


> --
> Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/3] exec: simplify address_space_get_iotlb_entry

Paolo Bonzini-5
In reply to this post by Peter Xu


On 05/06/2017 05:07, Peter Xu wrote:

> I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> least for VT-d, the IOMMU translation is page aligned which is defined
> by spec, so it makes sense that (again at least for VT-d) here we'd
> better just use page_mask/addr_mask.
>
> That's also how I know about IOMMU in general - I assume it do the
> translations always with page masks (never arbitary length), though
> page size can differ from platfrom to platform, that's why here the
> IOTLB interface used addr_mask, then it works for all platforms. I
> don't know whether I'm 100% correct here though.
>
> Maybe David/Paolo/... would comment as well?

I would ask David.  There are PowerPC MMUs that allow fast lookup of
arbitrarily-sized windows (not necessarily power of two), so maybe the
IOMMUs can do the same.

Paolo

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()

Michael S. Tsirkin-4
In reply to this post by Peter Xu
On Mon, Jun 05, 2017 at 11:20:13AM +0800, Peter Xu wrote:

> On Fri, Jun 02, 2017 at 05:51:07PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 02, 2017 at 07:50:51PM +0800, Peter Xu wrote:
> > > With the patch applied:
> > >
> > >   [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
> > >   (already in Paolo's pull request but not yet merged)
> > >
> > > Now we can have valid address masks. However it is still not ideal,
> > > considering that the mask may not be aligned to guest page sizes. One
> > > example would be when huge page is used in guest (please see commit
> > > message in patch 1 for details). It applies to normal pages too. So we
> > > not only need a valid address mask, we should make sure it is page
> > > mask (for x86, it should be either 4K/2M/1G pages).
> >
> > Why should we? To get better performance, right?
>
> IMHO one point is for performance, the other point is on how we should
> define the IOTLB interface. My opinion is that it is better valid
> masks.
>
> >
> > > Patch 1+2 fixes the problem. Tested with both kernel net driver or
> > > testpmd, on either 4K/2M pages, to make sure the page mask is correct.
> > >
> > > Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
> > > definitely want patch 3 now. Here's the simplest TCP streaming test
> > > using vhost dmar and iommu=pt in guest:
> > >
> > >   without patch 3:    12.0Gbps
> >
> > And what happens without patches 1-2?
>
> Without 1-2, performance is good. But I think it is hacky to have such
> a good result (I explained why the performance is good in the VT-d PT
> support thread with some logs)...
>
> >
> > >   with patch 3:       33.5Gbps
> >
> > This is the part I don't get. Patches 1-2 will return a bigger region to
> > callers. The result should be better performance - instead it seems to
> > slow down vhost for some reason and we need tricks to get
> > performance back. What's going on?
>
> Yes. The problem is that if without patch 1/2 I think the codes lacks
> correctness. With correctness, we lost performance, then I picked
> patch 3 as well.
>
> Again, I think the first thing we need to settle is what should be the
> best definition for IOTLB (addr_mask or arbitary length).
>
> Thanks,

If arbitary length means we don't require prefaulting hacks,
I'm for using arbitary length.


> --
> Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/3] exec: simplify address_space_get_iotlb_entry

David Gibson
In reply to this post by Paolo Bonzini-5
On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:

>
>
> On 05/06/2017 05:07, Peter Xu wrote:
> > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > least for VT-d, the IOMMU translation is page aligned which is defined
> > by spec, so it makes sense that (again at least for VT-d) here we'd
> > better just use page_mask/addr_mask.
> >
> > That's also how I know about IOMMU in general - I assume it do the
> > translations always with page masks (never arbitary length), though
> > page size can differ from platfrom to platform, that's why here the
> > IOTLB interface used addr_mask, then it works for all platforms. I
> > don't know whether I'm 100% correct here though.
> >
> > Maybe David/Paolo/... would comment as well?
>
> I would ask David.  There are PowerPC MMUs that allow fast lookup of
> arbitrarily-sized windows (not necessarily power of two),
Uh.. I'm not sure what you mean here.  You might be thinking of the
BATs which really old (32-bit) PowerPC MMUs had - those allow
arbitrary large block translations, but they do have to be a power of
two.

> so maybe the
> IOMMUs can do the same.

The only Power IOMMU I know about uses a fixed, power-of-two page size
per DMA window.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/3] exec: simplify address_space_get_iotlb_entry

Peter Xu
On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:

> On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> >
> >
> > On 05/06/2017 05:07, Peter Xu wrote:
> > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > better just use page_mask/addr_mask.
> > >
> > > That's also how I know about IOMMU in general - I assume it do the
> > > translations always with page masks (never arbitary length), though
> > > page size can differ from platfrom to platform, that's why here the
> > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > don't know whether I'm 100% correct here though.
> > >
> > > Maybe David/Paolo/... would comment as well?
> >
> > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > arbitrarily-sized windows (not necessarily power of two),
>
> Uh.. I'm not sure what you mean here.  You might be thinking of the
> BATs which really old (32-bit) PowerPC MMUs had - those allow
> arbitrary large block translations, but they do have to be a power of
> two.
>
> > so maybe the
> > IOMMUs can do the same.
>
> The only Power IOMMU I know about uses a fixed, power-of-two page size
> per DMA window.

If so, I would still be inclined to keep using masks for QEMU IOTLB.
Then, my first two patches should still stand.

I am just afraid that not using masks will diverge the emulation from
real hardware and brings trouble one day.

For vhost IOTLB interface, it does not need to be strictly aligned to
QEMU IOMMU IOTLB definition, and that's how it's working now (current
vhost iotlb allows arbitary length, and I think it's good). So imho we
don't really need to worry about the performance - after all, we can
do everything customized for vhost, just like what patch 3 did (yeah,
it can be better...).

Thanks,

--
Peter Xu

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/3] exec: simplify address_space_get_iotlb_entry

Paolo Bonzini-5
In reply to this post by David Gibson


On 07/06/2017 01:47, David Gibson wrote:
>> I would ask David.  There are PowerPC MMUs that allow fast lookup of
>> arbitrarily-sized windows (not necessarily power of two),
>
> Uh.. I'm not sure what you mean here.  You might be thinking of the
> BATs which really old (32-bit) PowerPC MMUs had - those allow
> arbitrary large block translations, but they do have to be a power of
> two.

Oh, that is what I was thinking about.  Thanks for figuring out! :)

Paolo

12