KVM and variable-endianness guest CPUs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

KVM and variable-endianness guest CPUs

Peter Maydell-5
[This seemed like a good jumping off point for this question.]

On 16 January 2014 17:51, Alexander Graf <[hidden email]> wrote:

> Am 16.01.2014 um 18:41 schrieb Peter Maydell <[hidden email]>:
>> Also see my remarks on the previous patch series suggesting
>> that we should look at this in a more holistic way than
>> just randomly fixing small bits of things. A good place
>> to start would be "what should the semantics of stl_p()
>> be for a QEMU where the CPU is currently operating with
>> a reversed endianness to the TARGET_WORDS_BIGENDIAN
>> setting?".
>
> That'd open a giant can of worms that I'd rather not open.

Yeah, but you kind of have to open that can, because stl_p()
is used in the code path for KVM MMIO accesses to devices.

Specifically, the KVM API says "here's a uint8_t[] byte
array and a length", and the current QEMU code treats that
as "this is a byte array written as if the guest CPU
(a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
I/O access to this buffer rather than to the device".

The KVM API docs don't actually specify the endianness
semantics of the byte array, but I think that that really
needs to be nailed down. I can think of a couple of options:
 * always LE
 * always BE
   [these first two are non-starters because they would
   break either x86 or PPC existing code]
 * always the endianness the guest is at the time
 * always some arbitrary endianness based purely on the
   endianness the KVM implementation used historically
 * always the endianness of the host QEMU binary
 * something else?

Any preferences? Current QEMU code basically assumes
"always the endianness of TARGET_WORDS_BIGENDIAN",
which is pretty random.

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Peter Maydell-5
On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:

> Specifically, the KVM API says "here's a uint8_t[] byte
> array and a length", and the current QEMU code treats that
> as "this is a byte array written as if the guest CPU
> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> I/O access to this buffer rather than to the device".
>
> The KVM API docs don't actually specify the endianness
> semantics of the byte array, but I think that that really
> needs to be nailed down. I can think of a couple of options:
>  * always LE
>  * always BE
>    [these first two are non-starters because they would
>    break either x86 or PPC existing code]
>  * always the endianness the guest is at the time
>  * always some arbitrary endianness based purely on the
>    endianness the KVM implementation used historically
>  * always the endianness of the host QEMU binary
>  * something else?
>
> Any preferences? Current QEMU code basically assumes
> "always the endianness of TARGET_WORDS_BIGENDIAN",
> which is pretty random.

Having thought a little more about this, my opinion is:

 * we should specify that the byte order of the mmio.data
   array is host kernel endianness (ie same endianness
   as the QEMU process itself) [this is what it actually
   is, I think, for all the cases that work today]
 * we should fix the code path in QEMU for handling
   mmio.data which currently has the implicit assumption
   that when using KVM TARGET_WORDS_BIGENDIAN is the same
   as the QEMU host process endianness (because it's using
   load/store functions which swap if TARGET_WORDS_BIGENDIAN
   is different from HOST_WORDS_BIGENDIAN)

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Christoffer Dall-6
On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:

> On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
> > Specifically, the KVM API says "here's a uint8_t[] byte
> > array and a length", and the current QEMU code treats that
> > as "this is a byte array written as if the guest CPU
> > (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> > I/O access to this buffer rather than to the device".
> >
> > The KVM API docs don't actually specify the endianness
> > semantics of the byte array, but I think that that really
> > needs to be nailed down. I can think of a couple of options:
> >  * always LE
> >  * always BE
> >    [these first two are non-starters because they would
> >    break either x86 or PPC existing code]
> >  * always the endianness the guest is at the time
> >  * always some arbitrary endianness based purely on the
> >    endianness the KVM implementation used historically
> >  * always the endianness of the host QEMU binary
> >  * something else?
> >
> > Any preferences? Current QEMU code basically assumes
> > "always the endianness of TARGET_WORDS_BIGENDIAN",
> > which is pretty random.
>
> Having thought a little more about this, my opinion is:
>
>  * we should specify that the byte order of the mmio.data
>    array is host kernel endianness (ie same endianness
>    as the QEMU process itself) [this is what it actually
>    is, I think, for all the cases that work today]

I completely agree, given that it's too late to be set on always LE/BE,
I think the natural choice is something that allows a user to cast the
byte array to an appropriate pointer type and dereference it.

And I think we need to amend the KVM API docs to specify this.

--
Christoffer

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Alexander Graf-4


> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <[hidden email]>:
>
>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>> On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>> array and a length", and the current QEMU code treats that
>>> as "this is a byte array written as if the guest CPU
>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>> I/O access to this buffer rather than to the device".
>>>
>>> The KVM API docs don't actually specify the endianness
>>> semantics of the byte array, but I think that that really
>>> needs to be nailed down. I can think of a couple of options:
>>> * always LE
>>> * always BE
>>>   [these first two are non-starters because they would
>>>   break either x86 or PPC existing code]
>>> * always the endianness the guest is at the time
>>> * always some arbitrary endianness based purely on the
>>>   endianness the KVM implementation used historically
>>> * always the endianness of the host QEMU binary
>>> * something else?
>>>
>>> Any preferences? Current QEMU code basically assumes
>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>> which is pretty random.
>>
>> Having thought a little more about this, my opinion is:
>>
>> * we should specify that the byte order of the mmio.data
>>   array is host kernel endianness (ie same endianness
>>   as the QEMU process itself) [this is what it actually
>>   is, I think, for all the cases that work today]
>
> I completely agree, given that it's too late to be set on always LE/BE,
> I think the natural choice is something that allows a user to cast the
> byte array to an appropriate pointer type and dereference it.
>
> And I think we need to amend the KVM API docs to specify this.

I don't see the problem. For ppc we always do mmio emulation as if the cpu was big endian. We've had an is_bigendian variable for that since the very first versions.


Alex

>
> --
> Christoffer

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Peter Maydell-5
On 18 January 2014 07:32, Alexander Graf <[hidden email]> wrote:

>> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <[hidden email]>:
>>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>> Having thought a little more about this, my opinion is:
>>>
>>> * we should specify that the byte order of the mmio.data
>>>   array is host kernel endianness (ie same endianness
>>>   as the QEMU process itself) [this is what it actually
>>>   is, I think, for all the cases that work today]
>>
>> I completely agree, given that it's too late to be set on always LE/BE,
>> I think the natural choice is something that allows a user to cast the
>> byte array to an appropriate pointer type and dereference it.
>>
>> And I think we need to amend the KVM API docs to specify this.
>
> I don't see the problem.

The problem is (a) the docs aren't clear about the semantics
(b) people have picked behaviour that suited them
to implement without documenting what it was.

> For ppc we always do mmio emulation
> as if the cpu was big endian.

Even if the guest, the host kernel and QEMU in userspace are
all little endian?

Also "mmio emulation as if the CPU was big endian"
doesn't make sense -- MMIO emulation doesn't depend
on CPU endianness.

> We've had an is_bigendian variable
> for that since the very first versions.

Where? In the kernel? In QEMU? What does it control?

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Alexander Graf-4

On 18.01.2014, at 11:15, Peter Maydell <[hidden email]> wrote:

> On 18 January 2014 07:32, Alexander Graf <[hidden email]> wrote:
>>> Am 18.01.2014 um 05:24 schrieb Christoffer Dall <[hidden email]>:
>>>> On Fri, Jan 17, 2014 at 06:52:57PM +0000, Peter Maydell wrote:
>>>> Having thought a little more about this, my opinion is:
>>>>
>>>> * we should specify that the byte order of the mmio.data
>>>>  array is host kernel endianness (ie same endianness
>>>>  as the QEMU process itself) [this is what it actually
>>>>  is, I think, for all the cases that work today]
>>>
>>> I completely agree, given that it's too late to be set on always LE/BE,
>>> I think the natural choice is something that allows a user to cast the
>>> byte array to an appropriate pointer type and dereference it.
>>>
>>> And I think we need to amend the KVM API docs to specify this.
>>
>> I don't see the problem.
>
> The problem is (a) the docs aren't clear about the semantics
> (b) people have picked behaviour that suited them
> to implement without documenting what it was.

I think I see the problem now. You're thinking about LE hosts, not LE guests.

I think the only really sensible options would be to

  a) Always use a statically define target endianness (big for ppc)
  b) Always use host endianness

Currently QEMU apparently implements a), but that can easily be changed. Today we don't have kvm support for ppc64le hosts yet.

I personally prefer b). It's the natural thing to do for a host interface to be in host endianness and it's exactly what we expose for LE-on-BE systems with ppc already.

>
>> For ppc we always do mmio emulation
>> as if the cpu was big endian.
>
> Even if the guest, the host kernel and QEMU in userspace are
> all little endian?
>
> Also "mmio emulation as if the CPU was big endian"
> doesn't make sense -- MMIO emulation doesn't depend
> on CPU endianness.
>
>> We've had an is_bigendian variable
>> for that since the very first versions.
>
> Where? In the kernel? In QEMU? What does it control?

In KVM. Check out https://github.com/agraf/linux-2.6/commit/1c00e7c21e39e20be7b03b111d5ab90ce938f108


Alex


Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Alexander Graf-4
In reply to this post by Peter Maydell-5

On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:

> On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>> Specifically, the KVM API says "here's a uint8_t[] byte
>> array and a length", and the current QEMU code treats that
>> as "this is a byte array written as if the guest CPU
>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>> I/O access to this buffer rather than to the device".
>>
>> The KVM API docs don't actually specify the endianness
>> semantics of the byte array, but I think that that really
>> needs to be nailed down. I can think of a couple of options:
>> * always LE
>> * always BE
>>   [these first two are non-starters because they would
>>   break either x86 or PPC existing code]
>> * always the endianness the guest is at the time
>> * always some arbitrary endianness based purely on the
>>   endianness the KVM implementation used historically
>> * always the endianness of the host QEMU binary
>> * something else?
>>
>> Any preferences? Current QEMU code basically assumes
>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>> which is pretty random.
>
> Having thought a little more about this, my opinion is:
>
> * we should specify that the byte order of the mmio.data
>   array is host kernel endianness (ie same endianness
>   as the QEMU process itself) [this is what it actually
>   is, I think, for all the cases that work today]
> * we should fix the code path in QEMU for handling
>   mmio.data which currently has the implicit assumption
>   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>   as the QEMU host process endianness (because it's using
>   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>   is different from HOST_WORDS_BIGENDIAN)

Yes, I fully agree :).


Alex


Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Peter Maydell-5
In reply to this post by Alexander Graf-4
On 20 January 2014 14:20, Alexander Graf <[hidden email]> wrote:
> I think I see the problem now. You're thinking about LE hosts, not LE guests.
>
> I think the only really sensible options would be to
>
>   a) Always use a statically define target endianness (big for ppc)
>   b) Always use host endianness

> Currently QEMU apparently implements a), but that can
> easily be changed. Today we don't have kvm support for
> ppc64le hosts yet.

Yes; I would ideally like us be able to get rid of that
statically defined target endianness eventually, so if we
have the leeway to define the kernel<->userspace ABI in a
way that doesn't care about the current guest CPU endianness
(ie we haven't actually yet claimed support for
reverse-endianness guests in a way that locks us into an
unhelpful definition of the ABI) we should take it while
we still can.

Then the current QEMU restrictions boil down to "you can
only use QEMU for KVM on a host kernel with the same
endianness as QEMU's legacy TARGET_WORDS_BIGENDIAN
setting for that CPU" (but such a QEMU can deal with
guests whatever they do with the endianness control bits).

> I personally prefer b). It's the natural thing to do for
> a host interface to be in host endianness and it's exactly
> what we expose for LE-on-BE systems with ppc already.

Yes. Strictly speaking by "host endianness" here I guess
we mean "the endianness of the kernel-to-userspace ABI",
since it is at least in theory possible to have an LE
kernel which runs BE userspace processes.

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Christoffer Dall-6
In reply to this post by Alexander Graf-4
On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:

>
> On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:
>
> > On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
> >> Specifically, the KVM API says "here's a uint8_t[] byte
> >> array and a length", and the current QEMU code treats that
> >> as "this is a byte array written as if the guest CPU
> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
> >> I/O access to this buffer rather than to the device".
> >>
> >> The KVM API docs don't actually specify the endianness
> >> semantics of the byte array, but I think that that really
> >> needs to be nailed down. I can think of a couple of options:
> >> * always LE
> >> * always BE
> >>   [these first two are non-starters because they would
> >>   break either x86 or PPC existing code]
> >> * always the endianness the guest is at the time
> >> * always some arbitrary endianness based purely on the
> >>   endianness the KVM implementation used historically
> >> * always the endianness of the host QEMU binary
> >> * something else?
> >>
> >> Any preferences? Current QEMU code basically assumes
> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
> >> which is pretty random.
> >
> > Having thought a little more about this, my opinion is:
> >
> > * we should specify that the byte order of the mmio.data
> >   array is host kernel endianness (ie same endianness
> >   as the QEMU process itself) [this is what it actually
> >   is, I think, for all the cases that work today]
> > * we should fix the code path in QEMU for handling
> >   mmio.data which currently has the implicit assumption
> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
> >   as the QEMU host process endianness (because it's using
> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
> >   is different from HOST_WORDS_BIGENDIAN)
>
> Yes, I fully agree :).
>
Great, I'll prepare a patch for the KVM API documentation.

-Christoffer

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Victor Kamensky
Hi Guys,

Christoffer and I had a bit heated chat :) on this
subject last night. Christoffer, really appreciate
your time! We did not really reach agreement
during the chat and Christoffer asked me to follow
up on this thread.
Here it goes. Sorry, it is very long email.

I don't believe we can assign any endianity to
mmio.data[] byte array. I believe mmio.data[] and
mmio.len acts just memcpy and that is all. As
memcpy does not imply any endianity of underlying
data mmio.data[] should not either.

Here is my definition:

mmio.data[] is array of bytes that contains memory
bytes in such form, for read case, that if those
bytes are placed in guest memory and guest executes
the same read access instruction with address to this
memory, result would be the same as real h/w device
memory access. Rest of KVM host and hypervisor
part of code should really take care of mmio.data[]
memory so it will be delivered to vcpu registers and
restored by hypervisor part in such way that guest CPU
register value is the same as it would be for real
non-emulated h/w read access (that is emulation part).
The same goes for write access, if guest writes into
memory and those bytes are just copied to emulated
h/w register it would have the same effect as real
mapped h/w register write.

In shorter form, i.e for len=4 access: endianity of integer
at &mmio.data[0] address should match endianity
of emulated h/w device behind phys_addr address,
regardless what is endianity of emulator, KVM host,
hypervisor, and guest

Examples that illustrate my definition
--------------------------------------

1) LE guest (E bit is off in ARM speak) reads integer
(4 bytes) from mapped h/w LE device register -
mmio.data[3] contains MSB, mmio.data[0] contains LSB.

2) BE guest (E bit is on in ARM speak) reads integer
from mapped h/w LE device register - mmio.data[3]
contains MSB, mmio.data[0] contains LSB. Note that
if &mmio.data[0] memory would be placed in guest
address space and instruction restarted with new
address, then it would meet BE guest expectations
- the guest knows that it reads LE h/w so it will byteswap
register before processing it further. This is BE guest ARM
case (regardless of what KVM host endianity is).

3) BE guest reads integer from mapped h/w BE device
register - mmio.data[0] contains MSB, mmio.data[3]
contains LSB. Note that if &mmio.data[0] memory would
be placed in guest address space and instruction
restarted with new address, then it would meet BE
guest expectation - the guest knows that it reads
BE h/w so it will proceed further without any other
work. I guess, it is BE ppc case.


Arguments in favor of memcpy semantics of mmio.data[]
------------------------------------------------------

x) What are possible values of 'len'? Previous discussions
imply that is always powers of 2. Why is that? Maybe
there will be CPU that would need to do 5 bytes mmio
access, or 6 bytes. How do you assign endianity to
such case? 'len' 5 or 6, or any works fine with
memcpy semantics. I admit it is hypothetical case, but
IMHO it tests how clean ABI definition is.

x) Byte array does not have endianity because it
does not have any structure. If one would want to
imply structure why mmio is not defined in such way
so structure reflected in mmio definition?
Something like:


                /* KVM_EXIT_MMIO */
                struct {
                          __u64 phys_addr;
                          union {
                               __u8 byte;
                               __u16 hword;
                               __u32 word;
                               __u64 dword;
                          }  data;
                          __u32 len;
                          __u8  is_write;
                } mmio;

where len is really serves as union discriminator and
only allowed len values are 1, 2, 4, 8.
In this case, I agree, endianity of integer types
should be defined. I believe, use of byte array strongly
implies that original intent was to have semantics of
byte stream copy, just like memcpy does.

x) Note there is nothing wrong with user kernel ABI to
use just bytes stream as parameter. There is already
precedents like 'read' and 'write' system calls :).

x) Consider case when KVM works with emulated memory mapped
h/w devices where some devices operate in LE mode and others
operate in BE mode. It is defined by semantics of real h/w
device which is it, and should be emulated by emulator and KVM
given all other context. As far as mmio.data[] array concerned, if the
same integer value is read from these devices registers, mmio.data[]
memory should contain integer in opposite endianity for these
two cases, i.e MSB is data[0] in one case and MSB is
data[3] is in another case. It cannot be the same, because
except emulator and guest kernel, all other, like KVM host
and hypervisor, have no clue what endianity of device
actually is - it should treat mmio.data[] in the same way.
But resulting guest target CPU register would need to contain
normal integer value in one case and byteswapped in another,
because guest kernel would use it directly in one case and
byteswap it in another. Byte stream semantics allows to do
that. I don't see how it could happen if you fixate mmio.data[]
endianity in such way that it would contain integer in
the same format for BE and LE emulated device types.

If by this point you agree, that mmio.data[] user-land/kernel
ABI semantics should be just memcpy, stop reading :). If not,
you may would like to take a look at below appendix where I
described in great details endianity of data at different
points along mmio processing code path of existing ARM LE KVM,
and proposed ARM BE KVM. Note appendix, is very long and very
detailed, sorry about that, but I feel that earlier more
digested explanations failed, so it driven me to write out
all details how I see them. If I am wrong, I hope it would be
easier for folks to point in detailed explanation places
where my logic goes bad. Also, I am not sure whether this
mail thread is good place to discuss all details described
in the appendix. Christoffer, please advise whether I should take
that one back on [1]. But I hope this bigger picture may help to
see the mmio.data[] semantics issue in context.

More inline and appendix is at the end.

On 20 January 2014 11:19, Christoffer Dall <[hidden email]> wrote:

> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>
>> On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:
>>
>> > On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>> >> Specifically, the KVM API says "here's a uint8_t[] byte
>> >> array and a length", and the current QEMU code treats that
>> >> as "this is a byte array written as if the guest CPU
>> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>> >> I/O access to this buffer rather than to the device".
>> >>
>> >> The KVM API docs don't actually specify the endianness
>> >> semantics of the byte array, but I think that that really
>> >> needs to be nailed down. I can think of a couple of options:
>> >> * always LE
>> >> * always BE
>> >>   [these first two are non-starters because they would
>> >>   break either x86 or PPC existing code]
>> >> * always the endianness the guest is at the time
>> >> * always some arbitrary endianness based purely on the
>> >>   endianness the KVM implementation used historically
>> >> * always the endianness of the host QEMU binary
>> >> * something else?
>> >>
>> >> Any preferences? Current QEMU code basically assumes
>> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
>> >> which is pretty random.
>> >
>> > Having thought a little more about this, my opinion is:
>> >
>> > * we should specify that the byte order of the mmio.data
>> >   array is host kernel endianness (ie same endianness
>> >   as the QEMU process itself) [this is what it actually
>> >   is, I think, for all the cases that work today]

In above please consider two types of mapped emulated
h/w devices: BE and LE they cannot have mmio.data in the
same endianity. Currently in all observable cases LE ARM
and BE PPC devices endianity matches kernel/qemu
endianity but it would break when BE ARM is introduced
or LE PPC or one would start emulating BE devices on LE
ARM.

>> > * we should fix the code path in QEMU for handling
>> >   mmio.data which currently has the implicit assumption
>> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>> >   as the QEMU host process endianness (because it's using
>> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>> >   is different from HOST_WORDS_BIGENDIAN)

I do not follow above. Maybe I am missing bigger context.
What is CPU under discussion in above? On ARM V7 system
when LE device is accessed as integer &mmio.data[0] address
would contain integer is in LE format, ie mmio.data[0] is LSB.

Here is gdb session of LE qemu running on V7 LE kernel and
TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
mapped LE device.
Please check run->mmio structure after read
(cpu_physical_memory_rw) completes it is in 4 bytes integer in
LE format mmio.data[0] is LSB and is equal to 1
(s->syscfgstat value):

(gdb) bt
#0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
/home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
#1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
mask=4294967295)
    at /home/root/20131219/qemu-be/memory.c:407
#2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
access_size_min=1,
    access_size_max=2357596, access=access@entry=0x23b96c
<memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
/home/root/20131219/qemu-be/memory.c:477
#3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
#4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
#5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
/home/root/20131219/qemu-be/memory.c:1743
#6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
<address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
len=4, is_write=false,
    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
#7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
    at /home/root/20131219/qemu-be/exec.c:2070
#8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1701
#9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
/home/root/20131219/qemu-be/cpus.c:874
#10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
#11 0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
#12 0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p /x s->sys_cfgstat
$25 = 0x1
(gdb) finish
Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
/home/root/20131219/qemu-be/memory.c:408
408        trace_memory_region_ops_read(mr, addr, tmp, size);
Value returned is $26 = 1
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1660
1660            kvm_arch_pre_run(cpu, run);
(gdb) bt
#0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
/home/root/20131219/qemu-be/kvm-all.c:1660
#1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
/home/root/20131219/qemu-be/cpus.c:874
#2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
#3  0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
#4  0xb69f5070 in ?? () at
../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p /x run->mmio
$27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}, len = 0x4, is_write = 0x0}

Also please look at adjust_endianness function and
struct MemoryRegion 'endianness' field. IMHO in qemu it
works quite nicely already. MemoryRegion 'read' and 'write'
callbacks return/get data in native format adjust_endianness
function checks whether emulated device endianness matches
emulator endianness and if it is different it does byteswap
according to size. As in above example arm_sysctl_ops memory
region should be marked as DEVICE_LITTLE_ENDIAN when it
returns s->sys_cfgstat value LE qemu sees that endianity
matches and it does not byteswap of result, so integer at
&mmio.data[0] address is in LE form. When qemu would
run in BE mode on BE kernel, it would see that endianity
mismatches and it will byteswap s->sys_cfgstat native value
(BE), so mmio.data would contain integer in LE format again.

Note in currently committed code arm_sysctl_ops endianity
is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
arm_sysctl device always gives/receives data in LE format regardless
of current CPSR E bit value, so it cannot be marked as NATIVE.
LE and BE kernels always read it as LE device; BE kernel follows
with byteswap. It was OK while we just run qemu in LE, but it
should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
... and actually that device and few other ARM specific devices
endianity change to LITTLE_ENDIAN was the only change in qemu
to make BE KVM to work.

>>
>> Yes, I fully agree :).
>>
> Great, I'll prepare a patch for the KVM API documentation.
>
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> [hidden email]
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186


    Appendix
    Data path endianity in ARM KVM mmio
    ===================================

This writeup considers several scenarios and tracks endianity
of data how it travels from emulator to guest CPU register, in
case of ARM KVM. It starts with currently committed code for LE
KVM host case and further discusses proposed BE KVM host
arrangement.

Just to restrict discussion writeup considers code path of
integer (4 bytes) read from h/w mapped emulated device memory.
Writeup considers endianity of essential places involved in such
code path.

For all cases when endianity is defined, it is assumed that
values under consideration are in memory (opposite to be in
register that does not have endianity). I.e even if function
variable could be actually allocated in CPU register writeup
will reference to it as it is in memory, just to keep
discussion clean, except for final guest CPU register.

Let's consider the following places along data path from
emulator to guest CPU register:

1) emulator code that holds integer value to be read, assume
it would be global 'int emulated_hw_device_val' variable.
Normally in emulator it is held in native endian format - i.e
it is CPSR E bit is the same as kernel CPSR E bit. Just for
discussion sake assume that this h/w device registers
holds 5 as its value.

2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
mmio.data byte array. Byte array does not have endianity,
but for this discussion it would track endianity of integer
at &mmio.data[0] address

3) 'data' variable type of 'unsigned long' in
kvm_handle_mmio_return function before vcpu_data_host_to_guest
call. KVM host mmio_read_buf function is used to fill this
variable from mmio.data buffer. mmio_read_buf actually
acts as memcpy from mmio.data buffer address,
just taking access size in account.

4) the same 'data' variable as above, but after
vcpu_data_host_to_guest function call, just before it is copied
to vcpu_reg target register location. Note
vcpu_data_host_to_guest function may byteswap value of 'data'
depending on current KVM host endianity and value of
guest CPSR E bit.

5) guest CPU spilled register array, location of target register
i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address

6) finally guest CPU register filled from vcpu_reg just before
guest resume execution of trapped emulated instruction. Note
it is done by hypervisor part of code and hypervisor EE bit is
the same as KVM host CPSR E bit.

Note again, KVM host, emulator, and hypervisor part of code (guest
CPU registers save and restore code) always run in the same
endianity. Endianity of accessed emulated devices and endianity
of guest varies independently of KVM host endianity.

Below sections consider all permutations of all possible cases,
it maybe quite boring to read. I've created summary table at
the end, you can jump to the table, after reading few cases.
But if you have objections and you see things happen differently
please comment inline of the use cases steps.

LE KVM host
===========

Use case 1
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in LE format, matches device
endianity
3) 'data' is LE
4) 'data' is LE (since guest CPSR E bit is off no byteswap)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 5 (0x00000005)

guest resumes execution ... Let's say after 'ldr r1, [r0]'
instruction, where r0 holds address of devices, it knows
that it reads LE mapped h/w so no addition processing is
needed

Use case 2
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in LE format; matches device
endianity
3) 'data' is LE
4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
will do byteswap: cpu_to_be)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 0x05000000

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode (E bit on), it knows that it reads
LE device memory, it needs to byteswap r1 before further
processing so it does 'rev r1, r1' and proceed with result

Use case 3
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in BE format; emulator byteswaps
it because it knows that device endianity is opposite to native,
and it should match device endianity
3) 'data' is BE
4) 'data' is BE (since guest CPSR E bit is off no byteswap)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 0x05000000

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode (E bit off), it knows that it
reads BE device memory, it need to byteswap r1 before further
processing so it does 'rev r1, r1' and proceeds with result

Use case 4
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is LE (host CPSR E bit is off); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is LE
2) &mmio.data[0] holds integer in BE format; emulator byteswaps
it because it knows that device endianity is opposite to native,
and should match device endianity
3) 'data' is BE
4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
will do byteswap: cpu_to_be)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 5 (0x00000005)

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads BE device
memory, so it does not need to do anything before further
processing.


Above uses cases that is exactly what we have now after Marc's
commit to support BE guest on LE KVM host. Further use
cases describe how it would work with BE KVM patches I proposed.
It is understood that it is subject of further discussion.


BE KVM host
===========

Use case 5
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in LE format; emulator byteswaps
it because it knows that device endianity is opposite to native;
matches device endianity
3) 'data' is LE
4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
does *not* do byteswap: cpu_to_be no effect in BE host kernel)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 0x05000000 because
hypervisor runs in BE mode, so load of LE integer will be
byteswapped value in register

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads LE device
memory, it need to byteswap r1 before further processing so it
does 'rev r1, r1' and proceeds with result

Use case 6
----------

Emulated h/w device gives data in LE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in LE format; emulator byteswaps
it because it knows that device endianity is opposite to native;
matches device endianity
3) 'data' is LE
4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
does byteswap: cpu_to_le)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 5 (0x00000005) because
hypervisor runs in BE mode, so load of BE integer will be OK

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode, it knows that it reads LE device
memory, so it does not need to do anything else it just proceeds

Use case 7
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in BE mode; and guest does access with CPSR E bit on

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in BE format; matches device
endianity
3) 'data' is BE
4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
does *not* do byteswap: cpu_to_be no effect in BE host kernel)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
6) final guest target CPU register contains 5 (0x00000005) because
hypervisor runs in BE mode, so load of BE integer will be OK

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in BE mode, it knows that it reads BE device
memory, so it does not need to do anything else it just proceeds

Use case 8
----------

Emulated h/w device gives data in BE form; emulator and KVM
host endianity is BE (host CPSR E bit is on); guest compiled
in LE mode; and guest does access with CPSR E bit off

1) 'emulated_hw_device_val' emulator variable is BE
2) &mmio.data[0] holds integer in BE format; matches device
endianity
3) 'data' is BE
4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
does byteswap: cpu_to_le)
5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
6) final guest target CPU register contains 0x05000000 because
hypervisor runs in BE mode, so load of LE integer will be
byteswapped value in register

guest resumes execution after 'ldr r1, [r0]', guest kernel
knows that it runs in LE mode, it knows that it reads BE device
memory, it need to byteswap r1 before further processing so it
does 'rev r1, r1' and proceeds with result

Note that with BE kernel we actually have some initial portion
of assembler code that is executed with CPSR bit off and it reads
LE h/w - i.e it falls into use case 1.

Summary Table (please use fixed font to see it correctly)
========================================

--------------------------------------------------------------
| Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
--------------------------------------------------------------
| KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
| Emulator,  |     |     |     |     |     |     |     |     |
| Hypervisor |     |     |     |     |     |     |     |     |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
| Access     |     |     |     |     |     |     |     |     |
| Endianity  |     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
--------------------------------------------------------------
| Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
--------------------------------------------------------------
| Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
--------------------------------------------------------------
| Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
--------------------------------------------------------------
| Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
--------------------------------------------------------------
| Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
| value      |     |     |     |     |     |     |     |     |
| byteswapped|     |     |     |     |     |     |     |     |
--------------------------------------------------------------
| Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
| Follows    |     |     |     |     |     |     |     |     |
| with rev   |     |     |     |     |     |     |     |     |
--------------------------------------------------------------

Few objservations
=================

x) Note above table is symmetric wrt to BE<->LE change:
       1<-->7
       2<-->8
       3<-->5
       4<-->6

x) &mmio.data[0] address always holds integer in the same
format as emulated device endianity

x) During step 4) when vcpu_data_host_to_guest function
is used, if guest E bit value different, but everything else
is the same, opposite result are produced (1&2, 3&4, 5&6,
7&8)

If you reached to this end :), again, thank you very much for
reading it!

- Victor

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Anup Patel-5
On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
<[hidden email]> wrote:

> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.
>
> Here is my definition:
>
> mmio.data[] is array of bytes that contains memory
> bytes in such form, for read case, that if those
> bytes are placed in guest memory and guest executes
> the same read access instruction with address to this
> memory, result would be the same as real h/w device
> memory access. Rest of KVM host and hypervisor
> part of code should really take care of mmio.data[]
> memory so it will be delivered to vcpu registers and
> restored by hypervisor part in such way that guest CPU
> register value is the same as it would be for real
> non-emulated h/w read access (that is emulation part).
> The same goes for write access, if guest writes into
> memory and those bytes are just copied to emulated
> h/w register it would have the same effect as real
> mapped h/w register write.
>
> In shorter form, i.e for len=4 access: endianity of integer
> at &mmio.data[0] address should match endianity
> of emulated h/w device behind phys_addr address,
> regardless what is endianity of emulator, KVM host,
> hypervisor, and guest
>
> Examples that illustrate my definition
> --------------------------------------
>
> 1) LE guest (E bit is off in ARM speak) reads integer
> (4 bytes) from mapped h/w LE device register -
> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>
> 2) BE guest (E bit is on in ARM speak) reads integer
> from mapped h/w LE device register - mmio.data[3]
> contains MSB, mmio.data[0] contains LSB. Note that
> if &mmio.data[0] memory would be placed in guest
> address space and instruction restarted with new
> address, then it would meet BE guest expectations
> - the guest knows that it reads LE h/w so it will byteswap
> register before processing it further. This is BE guest ARM
> case (regardless of what KVM host endianity is).
>
> 3) BE guest reads integer from mapped h/w BE device
> register - mmio.data[0] contains MSB, mmio.data[3]
> contains LSB. Note that if &mmio.data[0] memory would
> be placed in guest address space and instruction
> restarted with new address, then it would meet BE
> guest expectation - the guest knows that it reads
> BE h/w so it will proceed further without any other
> work. I guess, it is BE ppc case.
>
>
> Arguments in favor of memcpy semantics of mmio.data[]
> ------------------------------------------------------
>
> x) What are possible values of 'len'? Previous discussions
> imply that is always powers of 2. Why is that? Maybe
> there will be CPU that would need to do 5 bytes mmio
> access, or 6 bytes. How do you assign endianity to
> such case? 'len' 5 or 6, or any works fine with
> memcpy semantics. I admit it is hypothetical case, but
> IMHO it tests how clean ABI definition is.
>
> x) Byte array does not have endianity because it
> does not have any structure. If one would want to
> imply structure why mmio is not defined in such way
> so structure reflected in mmio definition?
> Something like:
>
>
>                 /* KVM_EXIT_MMIO */
>                 struct {
>                           __u64 phys_addr;
>                           union {
>                                __u8 byte;
>                                __u16 hword;
>                                __u32 word;
>                                __u64 dword;
>                           }  data;
>                           __u32 len;
>                           __u8  is_write;
>                 } mmio;
>
> where len is really serves as union discriminator and
> only allowed len values are 1, 2, 4, 8.
> In this case, I agree, endianity of integer types
> should be defined. I believe, use of byte array strongly
> implies that original intent was to have semantics of
> byte stream copy, just like memcpy does.
>
> x) Note there is nothing wrong with user kernel ABI to
> use just bytes stream as parameter. There is already
> precedents like 'read' and 'write' system calls :).
>
> x) Consider case when KVM works with emulated memory mapped
> h/w devices where some devices operate in LE mode and others
> operate in BE mode. It is defined by semantics of real h/w
> device which is it, and should be emulated by emulator and KVM
> given all other context. As far as mmio.data[] array concerned, if the
> same integer value is read from these devices registers, mmio.data[]
> memory should contain integer in opposite endianity for these
> two cases, i.e MSB is data[0] in one case and MSB is
> data[3] is in another case. It cannot be the same, because
> except emulator and guest kernel, all other, like KVM host
> and hypervisor, have no clue what endianity of device
> actually is - it should treat mmio.data[] in the same way.
> But resulting guest target CPU register would need to contain
> normal integer value in one case and byteswapped in another,
> because guest kernel would use it directly in one case and
> byteswap it in another. Byte stream semantics allows to do
> that. I don't see how it could happen if you fixate mmio.data[]
> endianity in such way that it would contain integer in
> the same format for BE and LE emulated device types.
>
> If by this point you agree, that mmio.data[] user-land/kernel
> ABI semantics should be just memcpy, stop reading :). If not,
> you may would like to take a look at below appendix where I
> described in great details endianity of data at different
> points along mmio processing code path of existing ARM LE KVM,
> and proposed ARM BE KVM. Note appendix, is very long and very
> detailed, sorry about that, but I feel that earlier more
> digested explanations failed, so it driven me to write out
> all details how I see them. If I am wrong, I hope it would be
> easier for folks to point in detailed explanation places
> where my logic goes bad. Also, I am not sure whether this
> mail thread is good place to discuss all details described
> in the appendix. Christoffer, please advise whether I should take
> that one back on [1]. But I hope this bigger picture may help to
> see the mmio.data[] semantics issue in context.
>
> More inline and appendix is at the end.
>
> On 20 January 2014 11:19, Christoffer Dall <[hidden email]> wrote:
>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>
>>> On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:
>>>
>>> > On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>>> >> Specifically, the KVM API says "here's a uint8_t[] byte
>>> >> array and a length", and the current QEMU code treats that
>>> >> as "this is a byte array written as if the guest CPU
>>> >> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>> >> I/O access to this buffer rather than to the device".
>>> >>
>>> >> The KVM API docs don't actually specify the endianness
>>> >> semantics of the byte array, but I think that that really
>>> >> needs to be nailed down. I can think of a couple of options:
>>> >> * always LE
>>> >> * always BE
>>> >>   [these first two are non-starters because they would
>>> >>   break either x86 or PPC existing code]
>>> >> * always the endianness the guest is at the time
>>> >> * always some arbitrary endianness based purely on the
>>> >>   endianness the KVM implementation used historically
>>> >> * always the endianness of the host QEMU binary
>>> >> * something else?
>>> >>
>>> >> Any preferences? Current QEMU code basically assumes
>>> >> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>> >> which is pretty random.
>>> >
>>> > Having thought a little more about this, my opinion is:
>>> >
>>> > * we should specify that the byte order of the mmio.data
>>> >   array is host kernel endianness (ie same endianness
>>> >   as the QEMU process itself) [this is what it actually
>>> >   is, I think, for all the cases that work today]
>
> In above please consider two types of mapped emulated
> h/w devices: BE and LE they cannot have mmio.data in the
> same endianity. Currently in all observable cases LE ARM
> and BE PPC devices endianity matches kernel/qemu
> endianity but it would break when BE ARM is introduced
> or LE PPC or one would start emulating BE devices on LE
> ARM.
>
>>> > * we should fix the code path in QEMU for handling
>>> >   mmio.data which currently has the implicit assumption
>>> >   that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>> >   as the QEMU host process endianness (because it's using
>>> >   load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>> >   is different from HOST_WORDS_BIGENDIAN)
>
> I do not follow above. Maybe I am missing bigger context.
> What is CPU under discussion in above? On ARM V7 system
> when LE device is accessed as integer &mmio.data[0] address
> would contain integer is in LE format, ie mmio.data[0] is LSB.
>
> Here is gdb session of LE qemu running on V7 LE kernel and
> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
> mapped LE device.
> Please check run->mmio structure after read
> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
> LE format mmio.data[0] is LSB and is equal to 1
> (s->syscfgstat value):
>
> (gdb) bt
> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
> mask=4294967295)
>     at /home/root/20131219/qemu-be/memory.c:407
> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
> access_size_min=1,
>     access_size_max=2357596, access=access@entry=0x23b96c
> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
> /home/root/20131219/qemu-be/memory.c:477
> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
> /home/root/20131219/qemu-be/memory.c:1743
> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
> len=4, is_write=false,
>     is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>     at /home/root/20131219/qemu-be/exec.c:2070
> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1701
> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
> /home/root/20131219/qemu-be/cpus.c:874
> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
> #11 0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> #12 0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb) p /x s->sys_cfgstat
> $25 = 0x1
> (gdb) finish
> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
> /home/root/20131219/qemu-be/memory.c:408
> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
> Value returned is $26 = 1
> (gdb) enable 2
> (gdb) cont
> Continuing.
>
> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1660
> 1660            kvm_arch_pre_run(cpu, run);
> (gdb) bt
> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
> /home/root/20131219/qemu-be/kvm-all.c:1660
> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
> /home/root/20131219/qemu-be/cpus.c:874
> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
> #3  0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> #4  0xb69f5070 in ?? () at
> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> (gdb) p /x run->mmio
> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>
> Also please look at adjust_endianness function and
> struct MemoryRegion 'endianness' field. IMHO in qemu it
> works quite nicely already. MemoryRegion 'read' and 'write'
> callbacks return/get data in native format adjust_endianness
> function checks whether emulated device endianness matches
> emulator endianness and if it is different it does byteswap
> according to size. As in above example arm_sysctl_ops memory
> region should be marked as DEVICE_LITTLE_ENDIAN when it
> returns s->sys_cfgstat value LE qemu sees that endianity
> matches and it does not byteswap of result, so integer at
> &mmio.data[0] address is in LE form. When qemu would
> run in BE mode on BE kernel, it would see that endianity
> mismatches and it will byteswap s->sys_cfgstat native value
> (BE), so mmio.data would contain integer in LE format again.
>
> Note in currently committed code arm_sysctl_ops endianity
> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
> arm_sysctl device always gives/receives data in LE format regardless
> of current CPSR E bit value, so it cannot be marked as NATIVE.
> LE and BE kernels always read it as LE device; BE kernel follows
> with byteswap. It was OK while we just run qemu in LE, but it
> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
> ... and actually that device and few other ARM specific devices
> endianity change to LITTLE_ENDIAN was the only change in qemu
> to make BE KVM to work.
>
>>>
>>> Yes, I fully agree :).
>>>
>> Great, I'll prepare a patch for the KVM API documentation.
>>
>> -Christoffer
>> _______________________________________________
>> kvmarm mailing list
>> [hidden email]
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>
> Thanks,
> Victor
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>
>
>     Appendix
>     Data path endianity in ARM KVM mmio
>     ===================================
>
> This writeup considers several scenarios and tracks endianity
> of data how it travels from emulator to guest CPU register, in
> case of ARM KVM. It starts with currently committed code for LE
> KVM host case and further discusses proposed BE KVM host
> arrangement.
>
> Just to restrict discussion writeup considers code path of
> integer (4 bytes) read from h/w mapped emulated device memory.
> Writeup considers endianity of essential places involved in such
> code path.
>
> For all cases when endianity is defined, it is assumed that
> values under consideration are in memory (opposite to be in
> register that does not have endianity). I.e even if function
> variable could be actually allocated in CPU register writeup
> will reference to it as it is in memory, just to keep
> discussion clean, except for final guest CPU register.
>
> Let's consider the following places along data path from
> emulator to guest CPU register:
>
> 1) emulator code that holds integer value to be read, assume
> it would be global 'int emulated_hw_device_val' variable.
> Normally in emulator it is held in native endian format - i.e
> it is CPSR E bit is the same as kernel CPSR E bit. Just for
> discussion sake assume that this h/w device registers
> holds 5 as its value.
>
> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
> mmio.data byte array. Byte array does not have endianity,
> but for this discussion it would track endianity of integer
> at &mmio.data[0] address
>
> 3) 'data' variable type of 'unsigned long' in
> kvm_handle_mmio_return function before vcpu_data_host_to_guest
> call. KVM host mmio_read_buf function is used to fill this
> variable from mmio.data buffer. mmio_read_buf actually
> acts as memcpy from mmio.data buffer address,
> just taking access size in account.
>
> 4) the same 'data' variable as above, but after
> vcpu_data_host_to_guest function call, just before it is copied
> to vcpu_reg target register location. Note
> vcpu_data_host_to_guest function may byteswap value of 'data'
> depending on current KVM host endianity and value of
> guest CPSR E bit.
>
> 5) guest CPU spilled register array, location of target register
> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>
> 6) finally guest CPU register filled from vcpu_reg just before
> guest resume execution of trapped emulated instruction. Note
> it is done by hypervisor part of code and hypervisor EE bit is
> the same as KVM host CPSR E bit.
>
> Note again, KVM host, emulator, and hypervisor part of code (guest
> CPU registers save and restore code) always run in the same
> endianity. Endianity of accessed emulated devices and endianity
> of guest varies independently of KVM host endianity.
>
> Below sections consider all permutations of all possible cases,
> it maybe quite boring to read. I've created summary table at
> the end, you can jump to the table, after reading few cases.
> But if you have objections and you see things happen differently
> please comment inline of the use cases steps.
>
> LE KVM host
> ===========
>
> Use case 1
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in LE format, matches device
> endianity
> 3) 'data' is LE
> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 5 (0x00000005)
>
> guest resumes execution ... Let's say after 'ldr r1, [r0]'
> instruction, where r0 holds address of devices, it knows
> that it reads LE mapped h/w so no addition processing is
> needed
>
> Use case 2
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in LE format; matches device
> endianity
> 3) 'data' is LE
> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
> will do byteswap: cpu_to_be)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 0x05000000
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode (E bit on), it knows that it reads
> LE device memory, it needs to byteswap r1 before further
> processing so it does 'rev r1, r1' and proceed with result
>
> Use case 3
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
> it because it knows that device endianity is opposite to native,
> and it should match device endianity
> 3) 'data' is BE
> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 0x05000000
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode (E bit off), it knows that it
> reads BE device memory, it need to byteswap r1 before further
> processing so it does 'rev r1, r1' and proceeds with result
>
> Use case 4
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is LE (host CPSR E bit is off); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is LE
> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
> it because it knows that device endianity is opposite to native,
> and should match device endianity
> 3) 'data' is BE
> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
> will do byteswap: cpu_to_be)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 5 (0x00000005)
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads BE device
> memory, so it does not need to do anything before further
> processing.
>
>
> Above uses cases that is exactly what we have now after Marc's
> commit to support BE guest on LE KVM host. Further use
> cases describe how it would work with BE KVM patches I proposed.
> It is understood that it is subject of further discussion.
>
>
> BE KVM host
> ===========
>
> Use case 5
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
> it because it knows that device endianity is opposite to native;
> matches device endianity
> 3) 'data' is LE
> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 0x05000000 because
> hypervisor runs in BE mode, so load of LE integer will be
> byteswapped value in register
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads LE device
> memory, it need to byteswap r1 before further processing so it
> does 'rev r1, r1' and proceeds with result
>
> Use case 6
> ----------
>
> Emulated h/w device gives data in LE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
> it because it knows that device endianity is opposite to native;
> matches device endianity
> 3) 'data' is LE
> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
> does byteswap: cpu_to_le)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 5 (0x00000005) because
> hypervisor runs in BE mode, so load of BE integer will be OK
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode, it knows that it reads LE device
> memory, so it does not need to do anything else it just proceeds
>
> Use case 7
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in BE mode; and guest does access with CPSR E bit on
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in BE format; matches device
> endianity
> 3) 'data' is BE
> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
> 6) final guest target CPU register contains 5 (0x00000005) because
> hypervisor runs in BE mode, so load of BE integer will be OK
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in BE mode, it knows that it reads BE device
> memory, so it does not need to do anything else it just proceeds
>
> Use case 8
> ----------
>
> Emulated h/w device gives data in BE form; emulator and KVM
> host endianity is BE (host CPSR E bit is on); guest compiled
> in LE mode; and guest does access with CPSR E bit off
>
> 1) 'emulated_hw_device_val' emulator variable is BE
> 2) &mmio.data[0] holds integer in BE format; matches device
> endianity
> 3) 'data' is BE
> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
> does byteswap: cpu_to_le)
> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
> 6) final guest target CPU register contains 0x05000000 because
> hypervisor runs in BE mode, so load of LE integer will be
> byteswapped value in register
>
> guest resumes execution after 'ldr r1, [r0]', guest kernel
> knows that it runs in LE mode, it knows that it reads BE device
> memory, it need to byteswap r1 before further processing so it
> does 'rev r1, r1' and proceeds with result
>
> Note that with BE kernel we actually have some initial portion
> of assembler code that is executed with CPSR bit off and it reads
> LE h/w - i.e it falls into use case 1.
>
> Summary Table (please use fixed font to see it correctly)
> ========================================
>
> --------------------------------------------------------------
> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
> --------------------------------------------------------------
> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
> | Emulator,  |     |     |     |     |     |     |     |     |
> | Hypervisor |     |     |     |     |     |     |     |     |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
> | Access     |     |     |     |     |     |     |     |     |
> | Endianity  |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
> --------------------------------------------------------------
> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
> --------------------------------------------------------------
> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
> --------------------------------------------------------------
> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
> | value      |     |     |     |     |     |     |     |     |
> | byteswapped|     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
> | Follows    |     |     |     |     |     |     |     |     |
> | with rev   |     |     |     |     |     |     |     |     |
> --------------------------------------------------------------
>
> Few objservations
> =================
>
> x) Note above table is symmetric wrt to BE<->LE change:
>        1<-->7
>        2<-->8
>        3<-->5
>        4<-->6
>
> x) &mmio.data[0] address always holds integer in the same
> format as emulated device endianity
>
> x) During step 4) when vcpu_data_host_to_guest function
> is used, if guest E bit value different, but everything else
> is the same, opposite result are produced (1&2, 3&4, 5&6,
> 7&8)
>
> If you reached to this end :), again, thank you very much for
> reading it!
>
> - Victor
> _______________________________________________
> kvmarm mailing list
> [hidden email]
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Hi Victor,

First of all I really appreciate the thorough description with
all the use-cases.

Below would be a summary of what I understood from your
analysis:

1. Any MMIO device marked as NATIVE ENDIAN in user
space tool (QEMU or KVMTOOL) is bad for cross-endian
Guest. For supporting cross-endian Guest we need to have
all MMIO device with fixed ENDIANESS.

2. We don't need to do any endianness conversions in KVM
for MMIO writes that are being forwarded to user space. It is
the job of user space (QEMU or KVMTOOL) to interpret the
endianness of MMIO write data based on device endianness.

3. The MMIO read operation is the one which will need
explicit handling in KVM because the target VCPU register
of MMIO read operation should be loaded with MMIO data
(returned from user space) based upon current VCPU
endianness (i.e. VCPU CPSR.E bit).

4. In-kernel emulated devices (such as VGIC) will have not
require any explicit endianness conversion of MMIO data for
MMIO write operations (same as point 2).

5. In-kernel emulated devices (such as VGIC) will have to
explicit endianness conversion of MMIO data for MMIO read
operations based on device endianness (same as point 3).

I hope above summary of my understanding is as-per your
description. If so then I am in-support of your description.

I think your description (and above 5 points) takes care of
all use cases of cross-endianness without changing current
MMIO ABI.

Regards,
Anup

Reply | Threaded
Open this post in threaded view
|

Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

Alexander Graf-4


> Am 22.01.2014 um 07:31 schrieb Anup Patel <[hidden email]>:
>
> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
> <[hidden email]> wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>>
>> Here is my definition:
>>
>> mmio.data[] is array of bytes that contains memory
>> bytes in such form, for read case, that if those
>> bytes are placed in guest memory and guest executes
>> the same read access instruction with address to this
>> memory, result would be the same as real h/w device
>> memory access. Rest of KVM host and hypervisor
>> part of code should really take care of mmio.data[]
>> memory so it will be delivered to vcpu registers and
>> restored by hypervisor part in such way that guest CPU
>> register value is the same as it would be for real
>> non-emulated h/w read access (that is emulation part).
>> The same goes for write access, if guest writes into
>> memory and those bytes are just copied to emulated
>> h/w register it would have the same effect as real
>> mapped h/w register write.
>>
>> In shorter form, i.e for len=4 access: endianity of integer
>> at &mmio.data[0] address should match endianity
>> of emulated h/w device behind phys_addr address,
>> regardless what is endianity of emulator, KVM host,
>> hypervisor, and guest
>>
>> Examples that illustrate my definition
>> --------------------------------------
>>
>> 1) LE guest (E bit is off in ARM speak) reads integer
>> (4 bytes) from mapped h/w LE device register -
>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>
>> 2) BE guest (E bit is on in ARM speak) reads integer
>> from mapped h/w LE device register - mmio.data[3]
>> contains MSB, mmio.data[0] contains LSB. Note that
>> if &mmio.data[0] memory would be placed in guest
>> address space and instruction restarted with new
>> address, then it would meet BE guest expectations
>> - the guest knows that it reads LE h/w so it will byteswap
>> register before processing it further. This is BE guest ARM
>> case (regardless of what KVM host endianity is).
>>
>> 3) BE guest reads integer from mapped h/w BE device
>> register - mmio.data[0] contains MSB, mmio.data[3]
>> contains LSB. Note that if &mmio.data[0] memory would
>> be placed in guest address space and instruction
>> restarted with new address, then it would meet BE
>> guest expectation - the guest knows that it reads
>> BE h/w so it will proceed further without any other
>> work. I guess, it is BE ppc case.
>>
>>
>> Arguments in favor of memcpy semantics of mmio.data[]
>> ------------------------------------------------------
>>
>> x) What are possible values of 'len'? Previous discussions
>> imply that is always powers of 2. Why is that? Maybe
>> there will be CPU that would need to do 5 bytes mmio
>> access, or 6 bytes. How do you assign endianity to
>> such case? 'len' 5 or 6, or any works fine with
>> memcpy semantics. I admit it is hypothetical case, but
>> IMHO it tests how clean ABI definition is.
>>
>> x) Byte array does not have endianity because it
>> does not have any structure. If one would want to
>> imply structure why mmio is not defined in such way
>> so structure reflected in mmio definition?
>> Something like:
>>
>>
>>                /* KVM_EXIT_MMIO */
>>                struct {
>>                          __u64 phys_addr;
>>                          union {
>>                               __u8 byte;
>>                               __u16 hword;
>>                               __u32 word;
>>                               __u64 dword;
>>                          }  data;
>>                          __u32 len;
>>                          __u8  is_write;
>>                } mmio;
>>
>> where len is really serves as union discriminator and
>> only allowed len values are 1, 2, 4, 8.
>> In this case, I agree, endianity of integer types
>> should be defined. I believe, use of byte array strongly
>> implies that original intent was to have semantics of
>> byte stream copy, just like memcpy does.
>>
>> x) Note there is nothing wrong with user kernel ABI to
>> use just bytes stream as parameter. There is already
>> precedents like 'read' and 'write' system calls :).
>>
>> x) Consider case when KVM works with emulated memory mapped
>> h/w devices where some devices operate in LE mode and others
>> operate in BE mode. It is defined by semantics of real h/w
>> device which is it, and should be emulated by emulator and KVM
>> given all other context. As far as mmio.data[] array concerned, if the
>> same integer value is read from these devices registers, mmio.data[]
>> memory should contain integer in opposite endianity for these
>> two cases, i.e MSB is data[0] in one case and MSB is
>> data[3] is in another case. It cannot be the same, because
>> except emulator and guest kernel, all other, like KVM host
>> and hypervisor, have no clue what endianity of device
>> actually is - it should treat mmio.data[] in the same way.
>> But resulting guest target CPU register would need to contain
>> normal integer value in one case and byteswapped in another,
>> because guest kernel would use it directly in one case and
>> byteswap it in another. Byte stream semantics allows to do
>> that. I don't see how it could happen if you fixate mmio.data[]
>> endianity in such way that it would contain integer in
>> the same format for BE and LE emulated device types.
>>
>> If by this point you agree, that mmio.data[] user-land/kernel
>> ABI semantics should be just memcpy, stop reading :). If not,
>> you may would like to take a look at below appendix where I
>> described in great details endianity of data at different
>> points along mmio processing code path of existing ARM LE KVM,
>> and proposed ARM BE KVM. Note appendix, is very long and very
>> detailed, sorry about that, but I feel that earlier more
>> digested explanations failed, so it driven me to write out
>> all details how I see them. If I am wrong, I hope it would be
>> easier for folks to point in detailed explanation places
>> where my logic goes bad. Also, I am not sure whether this
>> mail thread is good place to discuss all details described
>> in the appendix. Christoffer, please advise whether I should take
>> that one back on [1]. But I hope this bigger picture may help to
>> see the mmio.data[] semantics issue in context.
>>
>> More inline and appendix is at the end.
>>
>>> On 20 January 2014 11:19, Christoffer Dall <[hidden email]> wrote:
>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>
>>>>> On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:
>>>>>
>>>>>> On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>> array and a length", and the current QEMU code treats that
>>>>>> as "this is a byte array written as if the guest CPU
>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>> I/O access to this buffer rather than to the device".
>>>>>>
>>>>>> The KVM API docs don't actually specify the endianness
>>>>>> semantics of the byte array, but I think that that really
>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>> * always LE
>>>>>> * always BE
>>>>>>  [these first two are non-starters because they would
>>>>>>  break either x86 or PPC existing code]
>>>>>> * always the endianness the guest is at the time
>>>>>> * always some arbitrary endianness based purely on the
>>>>>>  endianness the KVM implementation used historically
>>>>>> * always the endianness of the host QEMU binary
>>>>>> * something else?
>>>>>>
>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>> which is pretty random.
>>>>>
>>>>> Having thought a little more about this, my opinion is:
>>>>>
>>>>> * we should specify that the byte order of the mmio.data
>>>>>  array is host kernel endianness (ie same endianness
>>>>>  as the QEMU process itself) [this is what it actually
>>>>>  is, I think, for all the cases that work today]
>>
>> In above please consider two types of mapped emulated
>> h/w devices: BE and LE they cannot have mmio.data in the
>> same endianity. Currently in all observable cases LE ARM
>> and BE PPC devices endianity matches kernel/qemu
>> endianity but it would break when BE ARM is introduced
>> or LE PPC or one would start emulating BE devices on LE
>> ARM.
>>
>>>>> * we should fix the code path in QEMU for handling
>>>>>  mmio.data which currently has the implicit assumption
>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>  as the QEMU host process endianness (because it's using
>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>
>> I do not follow above. Maybe I am missing bigger context.
>> What is CPU under discussion in above? On ARM V7 system
>> when LE device is accessed as integer &mmio.data[0] address
>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>
>> Here is gdb session of LE qemu running on V7 LE kernel and
>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>> mapped LE device.
>> Please check run->mmio structure after read
>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>> LE format mmio.data[0] is LSB and is equal to 1
>> (s->syscfgstat value):
>>
>> (gdb) bt
>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>> mask=4294967295)
>>    at /home/root/20131219/qemu-be/memory.c:407
>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>> access_size_min=1,
>>    access_size_max=2357596, access=access@entry=0x23b96c
>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>> /home/root/20131219/qemu-be/memory.c:477
>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>> /home/root/20131219/qemu-be/memory.c:1743
>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>> len=4, is_write=false,
>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>    at /home/root/20131219/qemu-be/exec.c:2070
>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1701
>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>> /home/root/20131219/qemu-be/cpus.c:874
>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>> #11 0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> #12 0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>> (gdb) p /x s->sys_cfgstat
>> $25 = 0x1
>> (gdb) finish
>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>> /home/root/20131219/qemu-be/memory.c:408
>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>> Value returned is $26 = 1
>> (gdb) enable 2
>> (gdb) cont
>> Continuing.
>>
>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1660
>> 1660            kvm_arch_pre_run(cpu, run);
>> (gdb) bt
>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>> /home/root/20131219/qemu-be/kvm-all.c:1660
>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>> /home/root/20131219/qemu-be/cpus.c:874
>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>> #3  0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> #4  0xb69f5070 in ?? () at
>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>> (gdb) p /x run->mmio
>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>
>> Also please look at adjust_endianness function and
>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>> works quite nicely already. MemoryRegion 'read' and 'write'
>> callbacks return/get data in native format adjust_endianness
>> function checks whether emulated device endianness matches
>> emulator endianness and if it is different it does byteswap
>> according to size. As in above example arm_sysctl_ops memory
>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>> returns s->sys_cfgstat value LE qemu sees that endianity
>> matches and it does not byteswap of result, so integer at
>> &mmio.data[0] address is in LE form. When qemu would
>> run in BE mode on BE kernel, it would see that endianity
>> mismatches and it will byteswap s->sys_cfgstat native value
>> (BE), so mmio.data would contain integer in LE format again.
>>
>> Note in currently committed code arm_sysctl_ops endianity
>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>> arm_sysctl device always gives/receives data in LE format regardless
>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>> LE and BE kernels always read it as LE device; BE kernel follows
>> with byteswap. It was OK while we just run qemu in LE, but it
>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>> ... and actually that device and few other ARM specific devices
>> endianity change to LITTLE_ENDIAN was the only change in qemu
>> to make BE KVM to work.
>>
>>>>
>>>> Yes, I fully agree :).
>>> Great, I'll prepare a patch for the KVM API documentation.
>>>
>>> -Christoffer
>>> _______________________________________________
>>> kvmarm mailing list
>>> [hidden email]
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Thanks,
>> Victor
>>
>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>
>>
>>    Appendix
>>    Data path endianity in ARM KVM mmio
>>    ===================================
>>
>> This writeup considers several scenarios and tracks endianity
>> of data how it travels from emulator to guest CPU register, in
>> case of ARM KVM. It starts with currently committed code for LE
>> KVM host case and further discusses proposed BE KVM host
>> arrangement.
>>
>> Just to restrict discussion writeup considers code path of
>> integer (4 bytes) read from h/w mapped emulated device memory.
>> Writeup considers endianity of essential places involved in such
>> code path.
>>
>> For all cases when endianity is defined, it is assumed that
>> values under consideration are in memory (opposite to be in
>> register that does not have endianity). I.e even if function
>> variable could be actually allocated in CPU register writeup
>> will reference to it as it is in memory, just to keep
>> discussion clean, except for final guest CPU register.
>>
>> Let's consider the following places along data path from
>> emulator to guest CPU register:
>>
>> 1) emulator code that holds integer value to be read, assume
>> it would be global 'int emulated_hw_device_val' variable.
>> Normally in emulator it is held in native endian format - i.e
>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>> discussion sake assume that this h/w device registers
>> holds 5 as its value.
>>
>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>> mmio.data byte array. Byte array does not have endianity,
>> but for this discussion it would track endianity of integer
>> at &mmio.data[0] address
>>
>> 3) 'data' variable type of 'unsigned long' in
>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>> call. KVM host mmio_read_buf function is used to fill this
>> variable from mmio.data buffer. mmio_read_buf actually
>> acts as memcpy from mmio.data buffer address,
>> just taking access size in account.
>>
>> 4) the same 'data' variable as above, but after
>> vcpu_data_host_to_guest function call, just before it is copied
>> to vcpu_reg target register location. Note
>> vcpu_data_host_to_guest function may byteswap value of 'data'
>> depending on current KVM host endianity and value of
>> guest CPSR E bit.
>>
>> 5) guest CPU spilled register array, location of target register
>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>
>> 6) finally guest CPU register filled from vcpu_reg just before
>> guest resume execution of trapped emulated instruction. Note
>> it is done by hypervisor part of code and hypervisor EE bit is
>> the same as KVM host CPSR E bit.
>>
>> Note again, KVM host, emulator, and hypervisor part of code (guest
>> CPU registers save and restore code) always run in the same
>> endianity. Endianity of accessed emulated devices and endianity
>> of guest varies independently of KVM host endianity.
>>
>> Below sections consider all permutations of all possible cases,
>> it maybe quite boring to read. I've created summary table at
>> the end, you can jump to the table, after reading few cases.
>> But if you have objections and you see things happen differently
>> please comment inline of the use cases steps.
>>
>> LE KVM host
>> ===========
>>
>> Use case 1
>> ----------
>>
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>>
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in LE format, matches device
>> endianity
>> 3) 'data' is LE
>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 5 (0x00000005)
>>
>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>> instruction, where r0 holds address of devices, it knows
>> that it reads LE mapped h/w so no addition processing is
>> needed
>>
>> Use case 2
>> ----------
>>
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>>
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in LE format; matches device
>> endianity
>> 3) 'data' is LE
>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>> will do byteswap: cpu_to_be)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 0x05000000
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode (E bit on), it knows that it reads
>> LE device memory, it needs to byteswap r1 before further
>> processing so it does 'rev r1, r1' and proceed with result
>>
>> Use case 3
>> ----------
>>
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>>
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native,
>> and it should match device endianity
>> 3) 'data' is BE
>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 0x05000000
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode (E bit off), it knows that it
>> reads BE device memory, it need to byteswap r1 before further
>> processing so it does 'rev r1, r1' and proceeds with result
>>
>> Use case 4
>> ----------
>>
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is LE (host CPSR E bit is off); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>>
>> 1) 'emulated_hw_device_val' emulator variable is LE
>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native,
>> and should match device endianity
>> 3) 'data' is BE
>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>> will do byteswap: cpu_to_be)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 5 (0x00000005)
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads BE device
>> memory, so it does not need to do anything before further
>> processing.
>>
>>
>> Above uses cases that is exactly what we have now after Marc's
>> commit to support BE guest on LE KVM host. Further use
>> cases describe how it would work with BE KVM patches I proposed.
>> It is understood that it is subject of further discussion.
>>
>>
>> BE KVM host
>> ===========
>>
>> Use case 5
>> ----------
>>
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>>
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native;
>> matches device endianity
>> 3) 'data' is LE
>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 0x05000000 because
>> hypervisor runs in BE mode, so load of LE integer will be
>> byteswapped value in register
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads LE device
>> memory, it need to byteswap r1 before further processing so it
>> does 'rev r1, r1' and proceeds with result
>>
>> Use case 6
>> ----------
>>
>> Emulated h/w device gives data in LE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>>
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>> it because it knows that device endianity is opposite to native;
>> matches device endianity
>> 3) 'data' is LE
>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>> does byteswap: cpu_to_le)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 5 (0x00000005) because
>> hypervisor runs in BE mode, so load of BE integer will be OK
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode, it knows that it reads LE device
>> memory, so it does not need to do anything else it just proceeds
>>
>> Use case 7
>> ----------
>>
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in BE mode; and guest does access with CPSR E bit on
>>
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in BE format; matches device
>> endianity
>> 3) 'data' is BE
>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>> 6) final guest target CPU register contains 5 (0x00000005) because
>> hypervisor runs in BE mode, so load of BE integer will be OK
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in BE mode, it knows that it reads BE device
>> memory, so it does not need to do anything else it just proceeds
>>
>> Use case 8
>> ----------
>>
>> Emulated h/w device gives data in BE form; emulator and KVM
>> host endianity is BE (host CPSR E bit is on); guest compiled
>> in LE mode; and guest does access with CPSR E bit off
>>
>> 1) 'emulated_hw_device_val' emulator variable is BE
>> 2) &mmio.data[0] holds integer in BE format; matches device
>> endianity
>> 3) 'data' is BE
>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>> does byteswap: cpu_to_le)
>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>> 6) final guest target CPU register contains 0x05000000 because
>> hypervisor runs in BE mode, so load of LE integer will be
>> byteswapped value in register
>>
>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>> knows that it runs in LE mode, it knows that it reads BE device
>> memory, it need to byteswap r1 before further processing so it
>> does 'rev r1, r1' and proceeds with result
>>
>> Note that with BE kernel we actually have some initial portion
>> of assembler code that is executed with CPSR bit off and it reads
>> LE h/w - i.e it falls into use case 1.
>>
>> Summary Table (please use fixed font to see it correctly)
>> ========================================
>>
>> --------------------------------------------------------------
>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>> --------------------------------------------------------------
>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>> | Emulator,  |     |     |     |     |     |     |     |     |
>> | Hypervisor |     |     |     |     |     |     |     |     |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>> | Access     |     |     |     |     |     |     |     |     |
>> | Endianity  |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>> --------------------------------------------------------------
>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>> --------------------------------------------------------------
>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>> --------------------------------------------------------------
>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>> | value      |     |     |     |     |     |     |     |     |
>> | byteswapped|     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>> | Follows    |     |     |     |     |     |     |     |     |
>> | with rev   |     |     |     |     |     |     |     |     |
>> --------------------------------------------------------------
>>
>> Few objservations
>> =================
>>
>> x) Note above table is symmetric wrt to BE<->LE change:
>>       1<-->7
>>       2<-->8
>>       3<-->5
>>       4<-->6
>>
>> x) &mmio.data[0] address always holds integer in the same
>> format as emulated device endianity
>>
>> x) During step 4) when vcpu_data_host_to_guest function
>> is used, if guest E bit value different, but everything else
>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>> 7&8)
>>
>> If you reached to this end :), again, thank you very much for
>> reading it!
>>
>> - Victor
>> _______________________________________________
>> kvmarm mailing list
>> [hidden email]
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>
> Hi Victor,
>
> First of all I really appreciate the thorough description with
> all the use-cases.
>
> Below would be a summary of what I understood from your
> analysis:
>
> 1. Any MMIO device marked as NATIVE ENDIAN in user

"Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.

QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.

A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.

That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.


Alex

> space tool (QEMU or KVMTOOL) is bad for cross-endian
> Guest. For supporting cross-endian Guest we need to have
> all MMIO device with fixed ENDIANESS.
>
> 2. We don't need to do any endianness conversions in KVM
> for MMIO writes that are being forwarded to user space. It is
> the job of user space (QEMU or KVMTOOL) to interpret the
> endianness of MMIO write data based on device endianness.
>
> 3. The MMIO read operation is the one which will need
> explicit handling in KVM because the target VCPU register
> of MMIO read operation should be loaded with MMIO data
> (returned from user space) based upon current VCPU
> endianness (i.e. VCPU CPSR.E bit).
>
> 4. In-kernel emulated devices (such as VGIC) will have not
> require any explicit endianness conversion of MMIO data for
> MMIO write operations (same as point 2).
>
> 5. In-kernel emulated devices (such as VGIC) will have to
> explicit endianness conversion of MMIO data for MMIO read
> operations based on device endianness (same as point 3).
>
> I hope above summary of my understanding is as-per your
> description. If so then I am in-support of your description.
>
> I think your description (and above 5 points) takes care of
> all use cases of cross-endianness without changing current
> MMIO ABI.
>
> Regards,
> Anup
>

Reply | Threaded
Open this post in threaded view
|

Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

Victor Kamensky
On 21 January 2014 22:41, Alexander Graf <[hidden email]> wrote:

>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <[hidden email]>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <[hidden email]> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <[hidden email]> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> [hidden email]
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> [hidden email]
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian"
> which is LE for ARM and BE for PPC. There shouldn't be
> a qemu-system-armeb or qemu-system-ppc64le.

I disagree. Fully functional ARM BE system is what we've
been working on for last few months. 'We' is Linaro
Networking Group, Endian subteam and some other guys
in ARM and across community. Why we do that is a bit
beyond of this discussion.

ARM BE patches for both V7 and V8 are already in mainline
kernel. But ARM BE KVM host is broken now. It is known
deficiency that I am trying to fix. Please look at [1]. Patches
for V7 BE KVM were proposed and currently under active
discussion. Currently I work on ARM V8 BE KVM changes.

So "native endian" in ARM is value of CPSR register E bit.
If it is off native endian is LE, if it is on it is BE.

Once and if we agree on ARM BE KVM host changes, the
next step would be patches in qemu one of which introduces
qemu-system-armeb. Please see [2].

> QEMU emulates everything that comes after the CPU, so
> imagine the ioctl struct as a bus package. Your bus
> doesn't care what endianness the CPU is in - it just
> gets data from the CPU.

I am not sure that I follow above. Suppose I have

move r1, #1
str r1, [r0]

where r0 is device address. Now depending on CPSR
E bit value device address will receive 1 as integer either
in LE order or in BE order. That is how ARM v7 CPU
works, regardless whether it is emulated or not.

So if E bit is off (LE case) after str is executed
 byte at r0 address will get 1
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 0

If E bit is on (BE case) after str is executed
 byte at r0 address will get 0
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 1

my point that mmio.data[] just carries bytes for phys_addr
mmio.data[0] would be value for byte at phys_addr,
mmio.data[1] would be value for byte at phys_addr + 1, and
so on.

> A bus write on the CPU however honors the endianness
> setting of the CPU. So when we convert from a value in
> register to a value on the bus we need to take this endian
> configuration into account.

for read it is the same mmio.data[0] just carries memory
for emulated phys_addr. It is the same as for write case.

But if one would want to look at endianity of integer
at &mmio.data[0] address its endianity would be really
defined as endianity of emulated h/w memory mapped
device.

Not sure, maybe I miss your point.

Also please consider endianity of device memory could BE or
LE and it does not depend on "native endianity" it could
exist in any combination and it would work because in
all proper place explicit byteswap would be executed
by code that works with device memory that is in opposite
endianity. Admittedly for ARM most dominating case now is
LE devices, but nothing prevent us to attach memory
mapped devices that would work in BE mode. For example
my parent company, Cisco, which Linaro assignee I am,
has a lot fabric chips that operate in BE, and once attached
to the system they would be treated properly - read in BE
mode without byteswap and read with byteswap in LE mode.
Note last point is oversimplified picture.

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/220973.html

[2] https://git.linaro.org/people/victor.kamensky/qemu-be.git/shortlog/refs/heads/armv7be

> That's exactly what we are talking about here. KVM
> should do the cpu configured register->bus endian
> mapping while QEMU does the bus->device endian map.
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>

Reply | Threaded
Open this post in threaded view
|

Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

Anup Patel-5
In reply to this post by Alexander Graf-4
Hi Alex,

On Wed, Jan 22, 2014 at 12:11 PM, Alexander Graf <[hidden email]> wrote:

>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <[hidden email]>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <[hidden email]> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <[hidden email]> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <[hidden email]> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <[hidden email]> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> [hidden email]
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> [hidden email]
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.
>
> QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.
>
> A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.
>
> That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.

Thanks for the info on QEMU side handling of MMIO data.

I was not aware that we would be only have "target endian = LE"
for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar
thing about MMIO this in our previous discussions on his patches.
(Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html)

This clearly means MMIO data passed to user space (QEMU) has
to of host endianness so that QEMU can take care of bust->device
endian map.

Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest()
does not perform endianness conversion of MMIO data to LE when
we are running LE guest on BE host so we do need Victor's patch
for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest().
(Already reported long time back by me,
http://www.spinics.net/lists/arm-kernel/msg283308.html)

Regards,
Anup

>
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Peter Maydell-5
In reply to this post by Victor Kamensky
On 22 January 2014 05:39, Victor Kamensky <[hidden email]> wrote:

> Hi Guys,
>
> Christoffer and I had a bit heated chat :) on this
> subject last night. Christoffer, really appreciate
> your time! We did not really reach agreement
> during the chat and Christoffer asked me to follow
> up on this thread.
> Here it goes. Sorry, it is very long email.
>
> I don't believe we can assign any endianity to
> mmio.data[] byte array. I believe mmio.data[] and
> mmio.len acts just memcpy and that is all. As
> memcpy does not imply any endianity of underlying
> data mmio.data[] should not either.

This email is about five times too long to be actually
useful, but the major issue here is that the data being
transferred is not just a bag of bytes. The data[]
array plus the size field are being (mis)used to indicate
that the memory transaction is one of:
 * an 8 bit access
 * a 16 bit access of some uint16_t value
 * a 32 bit access of some uint32_t value
 * a 64 bit access of some uint64_t value

exactly as a CPU hardware bus would do. It's
because the API is defined in this awkward way with
a uint8_t[] array that we need to specify how both
sides should go from the actual properties of the
memory transaction (value and size) to filling in the
array.

Furthermore, device endianness is entirely irrelevant
for deciding the properties of mmio.data[], because the
thing we're modelling here is essentially the CPU->bus
interface. In real hardware, the properties of individual
devices on the bus are irrelevant to how the CPU's
interface to the bus behaves, and similarly here the
properties of emulated devices don't affect how KVM's
interface to QEMU userspace needs to work.

MemoryRegion's 'endianness' field, incidentally, is
a dreadful mess that we should get rid of. It is attempting
to model the property that some buses/bridges have of
doing byte-lane-swaps on data that passes through as
a property of the device itself. It would be better if we
modelled it properly, with container regions having possible
byte-swapping and devices just being devices.

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

Alexander Graf-4
In reply to this post by Victor Kamensky

On 22.01.2014, at 08:26, Victor Kamensky <[hidden email]> wrote:

> On 21 January 2014 22:41, Alexander Graf <[hidden email]> wrote:
>>
>>
>> "Native endian" really is just a shortcut for "target endian"
>> which is LE for ARM and BE for PPC. There shouldn't be
>> a qemu-system-armeb or qemu-system-ppc64le.
>
> I disagree. Fully functional ARM BE system is what we've
> been working on for last few months. 'We' is Linaro
> Networking Group, Endian subteam and some other guys
> in ARM and across community. Why we do that is a bit
> beyond of this discussion.
>
> ARM BE patches for both V7 and V8 are already in mainline
> kernel. But ARM BE KVM host is broken now. It is known
> deficiency that I am trying to fix. Please look at [1]. Patches
> for V7 BE KVM were proposed and currently under active
> discussion. Currently I work on ARM V8 BE KVM changes.
>
> So "native endian" in ARM is value of CPSR register E bit.
> If it is off native endian is LE, if it is on it is BE.
>
> Once and if we agree on ARM BE KVM host changes, the
> next step would be patches in qemu one of which introduces
> qemu-system-armeb. Please see [2].

I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the "default endiannes" for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the "default".

Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers.

As Peter already pointed out, the actual breakage behind this is that we have a "default endianness" at all. But that's a very difficult thing to resolve and I don't think should be our primary goal. Just live with the fact that we declare ARM little endian in QEMU and swap things accordingly - then everyone's happy.

This really only ever becomes a problem if you have devices that have awareness of the CPUs endian mode. The only one on PPC that I'm aware of that falls into this category is virtio and there are patches pending to solve that. I don't know if there are any QEMU emulated devices outside of virtio with this issue on ARM, but you'll have to make the emulation code for those look at the CPU state then.

>
>> QEMU emulates everything that comes after the CPU, so
>> imagine the ioctl struct as a bus package. Your bus
>> doesn't care what endianness the CPU is in - it just
>> gets data from the CPU.
>
> I am not sure that I follow above. Suppose I have
>
> move r1, #1
> str r1, [r0]
>
> where r0 is device address. Now depending on CPSR
> E bit value device address will receive 1 as integer either
> in LE order or in BE order. That is how ARM v7 CPU
> works, regardless whether it is emulated or not.
>
> So if E bit is off (LE case) after str is executed
> byte at r0 address will get 1
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 0
>
> If E bit is on (BE case) after str is executed
> byte at r0 address will get 0
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 1
>
> my point that mmio.data[] just carries bytes for phys_addr
> mmio.data[0] would be value for byte at phys_addr,
> mmio.data[1] would be value for byte at phys_addr + 1, and
> so on.

What we get is an instruction that traps because it wants to "write r1 (which has value=1) into address x". So at that point we get the register value.

Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian.

With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a "register value as if the CPU was running in native endianness" as parameter - with "native" meaning "little endian" for qemu-system-arm. Device emulation code doesn't know what to do with a byte array.

Take a look at QEMU's MMIO handler:

        case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");
            cpu_physical_memory_rw(run->mmio.phys_addr,
                                   run->mmio.data,
                                   run->mmio.len,
                                   run->mmio.is_write);
            ret = 0;
            break;

which translates to

                switch (l) {
                case 8:
                    /* 64 bit write access */
                    val = ldq_p(buf);
                    error |= io_mem_write(mr, addr1, val, 8);
                    break;
                case 4:
                    /* 32 bit write access */
                    val = ldl_p(buf);
                    error |= io_mem_write(mr, addr1, val, 4);
                    break;
                case 2:
                    /* 16 bit write access */
                    val = lduw_p(buf);
                    error |= io_mem_write(mr, addr1, val, 2);
                    break;
                case 1:
                    /* 8 bit write access */
                    val = ldub_p(buf);
                    error |= io_mem_write(mr, addr1, val, 1);
                    break;
                default:
                    abort();
                }

which calls the ldx_p primitives

#if defined(TARGET_WORDS_BIGENDIAN)
#define lduw_p(p) lduw_be_p(p)
#define ldsw_p(p) ldsw_be_p(p)
#define ldl_p(p) ldl_be_p(p)
#define ldq_p(p) ldq_be_p(p)
#define ldfl_p(p) ldfl_be_p(p)
#define ldfq_p(p) ldfq_be_p(p)
#define stw_p(p, v) stw_be_p(p, v)
#define stl_p(p, v) stl_be_p(p, v)
#define stq_p(p, v) stq_be_p(p, v)
#define stfl_p(p, v) stfl_be_p(p, v)
#define stfq_p(p, v) stfq_be_p(p, v)
#else
#define lduw_p(p) lduw_le_p(p)
#define ldsw_p(p) ldsw_le_p(p)
#define ldl_p(p) ldl_le_p(p)
#define ldq_p(p) ldq_le_p(p)
#define ldfl_p(p) ldfl_le_p(p)
#define ldfq_p(p) ldfq_le_p(p)
#define stw_p(p, v) stw_le_p(p, v)
#define stl_p(p, v) stl_le_p(p, v)
#define stq_p(p, v) stq_le_p(p, v)
#define stfl_p(p, v) stfl_le_p(p, v)
#define stfq_p(p, v) stfq_le_p(p, v)
#endif

and then passes the result as "originating register access" to the device emulation part of QEMU.


Maybe it becomes more clear if you understand the code flow that TCG is going through. With TCG whenever a write traps into MMIO we go through these functions

void
glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                         DATA_TYPE val, int mmu_idx)
{
    helper_te_st_name(env, addr, val, mmu_idx, GETRA());
}

#ifdef TARGET_WORDS_BIGENDIAN
# define TGT_BE(X)  (X)
# define TGT_LE(X)  BSWAP(X)
#else
# define TGT_BE(X)  BSWAP(X)
# define TGT_LE(X)  (X)
#endif

void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                       int mmu_idx, uintptr_t retaddr)
{
[...]
    /* Handle an IO access.  */
    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
        hwaddr ioaddr;
        if ((addr & (DATA_SIZE - 1)) != 0) {
            goto do_unaligned_access;
        }
        ioaddr = env->iotlb[mmu_idx][index];

        /* ??? Note that the io helpers always read data in the target
           byte ordering.  We should push the LE/BE request down into io.  */
        val = TGT_LE(val);
        glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
        return;
    }
    [...]
}

static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                          hwaddr physaddr,
                                          DATA_TYPE val,
                                          target_ulong addr,
                                          uintptr_t retaddr)
{
    MemoryRegion *mr = iotlb_to_region(physaddr);

    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
        cpu_io_recompile(env, retaddr);
    }

    env->mem_io_vaddr = addr;
    env->mem_io_pc = retaddr;
    io_mem_write(mr, physaddr, val, 1 << SHIFT);
}

which at the end of the chain means if you're running an same endianness on guest and host, you get the original register value as function parameter. If you run different endianness you get a swapped value as function parameter.

So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE host the request into the io callback function will come in as register, then stay all the way it is until it reaches the IO callback function. Unless you define a specific endianness for your device in which case the callback may swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle it.

What happens when you switch your guest to BE mode (or LE for PPC)? Very simple. The TCG frontend swizzles every memory read and write before it hits TCG's memory operations.

If you're running qemu-system-arm (KVM) on a BE host the request will come into kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that way into the IO callback function. That's where the bug lies. It should behave the same way as TCG, so it needs to know the value the register originally had. So instead of doing an ldq_p() it should go through a different path that does memcpy().

But that doesn't fix the other-endian issue yet, right? Every value now would come in as the register value.

Well, unless you do the same thing TCG does inside the kernel. So the kernel would swap the reads and writes before it accesses the ioctl struct that connects kvm with QEMU. Then all abstraction layers work just fine again and we don't need any qemu-system-armeb.


Alex


Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Victor Kamensky
In reply to this post by Peter Maydell-5
Hi Peter,

On 22 January 2014 02:22, Peter Maydell <[hidden email]> wrote:

> On 22 January 2014 05:39, Victor Kamensky <[hidden email]> wrote:
>> Hi Guys,
>>
>> Christoffer and I had a bit heated chat :) on this
>> subject last night. Christoffer, really appreciate
>> your time! We did not really reach agreement
>> during the chat and Christoffer asked me to follow
>> up on this thread.
>> Here it goes. Sorry, it is very long email.
>>
>> I don't believe we can assign any endianity to
>> mmio.data[] byte array. I believe mmio.data[] and
>> mmio.len acts just memcpy and that is all. As
>> memcpy does not imply any endianity of underlying
>> data mmio.data[] should not either.
>
> This email is about five times too long to be actually
> useful,

Sorry, about that you may be right.
My below responses much shorter :)

> but the major issue here is that the data being
> transferred is not just a bag of bytes. The data[]
> array plus the size field are being (mis)used to indicate
> that the memory transaction is one of:
>  * an 8 bit access
>  * a 16 bit access of some uint16_t value
>  * a 32 bit access of some uint32_t value
>  * a 64 bit access of some uint64_t value
>
> exactly as a CPU hardware bus would do. It's
> because the API is defined in this awkward way with
> a uint8_t[] array that we need to specify how both
> sides should go from the actual properties of the
> memory transaction (value and size) to filling in the
> array.

While responding to Alex last night, I found, I think,
easiest and shortest way to think about mmio.data[]

Just for discussion reference here it is again
                struct {
                        __u64 phys_addr;
                        __u8  data[8];
                        __u32 len;
                        __u8  is_write;
                } mmio;
I believe that in all cases it should be interpreted
in the following sense
   byte data[0] goes into byte at phys_addr + 0
   byte data[1] goes into byte at phys_addr + 1
   byte data[2] goes into byte at phys_addr + 2
   and so on up to len size

Basically if it would be on real bus, get byte value
that corresponds to phys_addr + 0 address place
it into data[0], get byte value that corresponds to
phys_addr + 1 address place it into data[1], etc.

I believe it is true for current ARM LE case and
PPC BE case. I am asking you to keep it this way
for all other cases. My ARM BE V7 KVM patches
still use it in the same sense.

What is wrong with it?

Note nowhere in my above description I've talked
about endianity of anything: device, access (E bit),
KVM host, guest, hypervisor. All these endianities
are irrelevant to mmio interface.

> Furthermore, device endianness is entirely irrelevant
> for deciding the properties of mmio.data[], because the
> thing we're modelling here is essentially the CPU->bus
> interface. In real hardware, the properties of individual
> devices on the bus are irrelevant to how the CPU's
> interface to the bus behaves, and similarly here the
> properties of emulated devices don't affect how KVM's
> interface to QEMU userspace needs to work.

As far as mmio interface concerned I claim that any
endianity is irrelevant here. I am utterly lost about
endianity of what you care about? Consider
the following ARM code snippets:

setend le
mov r1, #0x04030201
str r1, [r0]

and

setend be
mov r1, #0x01020304
str r1, [r0]

when above snippets are executed memory bus
sees absolutely the same thing, can you tell by
looking at this memory transaction what endianity
is it? And endianity of what? I can't.
The only thing you can tell by looking at this bus
memory transaction is that 0x01 byte value goes at
r0 address, 0x02 byte value goes at r0 + 1 address,
etc.

Thanks,
Victor

> MemoryRegion's 'endianness' field, incidentally, is
> a dreadful mess that we should get rid of. It is attempting
> to model the property that some buses/bridges have of
> doing byte-lane-swaps on data that passes through as
> a property of the device itself. It would be better if we
> modelled it properly, with container regions having possible
> byte-swapping and devices just being devices.
>
> thanks
> -- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Peter Maydell-5
On 22 January 2014 17:19, Victor Kamensky <[hidden email]> wrote:

> On 22 January 2014 02:22, Peter Maydell <[hidden email]> wrote:
>> but the major issue here is that the data being
>> transferred is not just a bag of bytes. The data[]
>> array plus the size field are being (mis)used to indicate
>> that the memory transaction is one of:
>>  * an 8 bit access
>>  * a 16 bit access of some uint16_t value
>>  * a 32 bit access of some uint32_t value
>>  * a 64 bit access of some uint64_t value
>>
>> exactly as a CPU hardware bus would do. It's
>> because the API is defined in this awkward way with
>> a uint8_t[] array that we need to specify how both
>> sides should go from the actual properties of the
>> memory transaction (value and size) to filling in the
>> array.
>
> While responding to Alex last night, I found, I think,
> easiest and shortest way to think about mmio.data[]
>
> Just for discussion reference here it is again
>                 struct {
>                         __u64 phys_addr;
>                         __u8  data[8];
>                         __u32 len;
>                         __u8  is_write;
>                 } mmio;
> I believe that in all cases it should be interpreted
> in the following sense
>    byte data[0] goes into byte at phys_addr + 0
>    byte data[1] goes into byte at phys_addr + 1
>    byte data[2] goes into byte at phys_addr + 2
>    and so on up to len size
>
> Basically if it would be on real bus, get byte value
> that corresponds to phys_addr + 0 address place
> it into data[0], get byte value that corresponds to
> phys_addr + 1 address place it into data[1], etc.

This just isn't how real buses work. There is no
"address + 1, address + 2". There is a single address
for the memory transaction and a set of data on
data lines and some separate size information.
How the device at the far end of the bus chooses
to respond to 32 bit accesses to address X versus
8 bit accesses to addresses X through X+3 is entirely
its own business and unrelated to the CPU. (It would
be perfectly possible to have a device which when
you read from address X as 32 bits returned 0x12345678,
when you read from address X as 16 bits returned
0x9abc, returned 0x42 for an 8 bit read from X+1,
and so on. Having byte reads from X..X+3 return
values corresponding to parts of the 32 bit access
is purely a convention.)

> Note nowhere in my above description I've talked
> about endianity of anything: device, access (E bit),
> KVM host, guest, hypervisor. All these endianities
> are irrelevant to mmio interface.

As soon as you try to think of the mmio.data as a set
of bytes then you have to specify some endianness to
the data, so that both sides (kernel and userspace)
know how to reconstruct the actual data value from the
array of bytes.

>> Furthermore, device endianness is entirely irrelevant
>> for deciding the properties of mmio.data[], because the
>> thing we're modelling here is essentially the CPU->bus
>> interface. In real hardware, the properties of individual
>> devices on the bus are irrelevant to how the CPU's
>> interface to the bus behaves, and similarly here the
>> properties of emulated devices don't affect how KVM's
>> interface to QEMU userspace needs to work.
>
> As far as mmio interface concerned I claim that any
> endianity is irrelevant here. I am utterly lost about
> endianity of what you care about?

I care about knowing which end of mmio.data is the
least significant byte, obviously.

> Consider
> the following ARM code snippets:
>
> setend le
> mov r1, #0x04030201
> str r1, [r0]
>
> and
>
> setend be
> mov r1, #0x01020304
> str r1, [r0]
>
> when above snippets are executed memory bus
> sees absolutely the same thing, can you tell by
> looking at this memory transaction what endianity
> is it? And endianity of what? I can't.

That is correct. That is because the value sent out on
the bus from the CPU is always the same: it says
"32 bit transaction, value 0x04030201, address $whatever".

> The only thing you can tell by looking at this bus
> memory transaction is that 0x01 byte value goes at
> r0 address, 0x02 byte value goes at r0 + 1 address,
> etc.

No, this part is absolutely wrong, see above.

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Victor Kamensky
On 22 January 2014 09:29, Peter Maydell <[hidden email]> wrote:

> On 22 January 2014 17:19, Victor Kamensky <[hidden email]> wrote:
>> On 22 January 2014 02:22, Peter Maydell <[hidden email]> wrote:
>>> but the major issue here is that the data being
>>> transferred is not just a bag of bytes. The data[]
>>> array plus the size field are being (mis)used to indicate
>>> that the memory transaction is one of:
>>>  * an 8 bit access
>>>  * a 16 bit access of some uint16_t value
>>>  * a 32 bit access of some uint32_t value
>>>  * a 64 bit access of some uint64_t value
>>>
>>> exactly as a CPU hardware bus would do. It's
>>> because the API is defined in this awkward way with
>>> a uint8_t[] array that we need to specify how both
>>> sides should go from the actual properties of the
>>> memory transaction (value and size) to filling in the
>>> array.
>>
>> While responding to Alex last night, I found, I think,
>> easiest and shortest way to think about mmio.data[]
>>
>> Just for discussion reference here it is again
>>                 struct {
>>                         __u64 phys_addr;
>>                         __u8  data[8];
>>                         __u32 len;
>>                         __u8  is_write;
>>                 } mmio;
>> I believe that in all cases it should be interpreted
>> in the following sense
>>    byte data[0] goes into byte at phys_addr + 0
>>    byte data[1] goes into byte at phys_addr + 1
>>    byte data[2] goes into byte at phys_addr + 2
>>    and so on up to len size
>>
>> Basically if it would be on real bus, get byte value
>> that corresponds to phys_addr + 0 address place
>> it into data[0], get byte value that corresponds to
>> phys_addr + 1 address place it into data[1], etc.
>
> This just isn't how real buses work. There is no
> "address + 1, address + 2". There is a single address
> for the memory transaction and a set of data on
> data lines and some separate size information.

Yes, and those data lines are just binary signal lines
not numbers. If one would want to describe information
on data lines as number he/she needs to assign integer
bits numbers to lines, and that is absolutely arbitrary
process  In one choose one way to assigned those
bits to lines and another choose reverse way, they will
talk about completely different numbers for the same
signals on the bus. Such data lines enumeration has
no reflection on how bus actually works. And I don't
even see why it should be described just as single
integer, for example one can describe information on
data lines as set of 4 byte value, nothing wrong with
such description.

> How the device at the far end of the bus chooses
> to respond to 32 bit accesses to address X versus
> 8 bit accesses to addresses X through X+3 is entirely
> its own business and unrelated to the CPU. (It would
> be perfectly possible to have a device which when
> you read from address X as 32 bits returned 0x12345678,
> when you read from address X as 16 bits returned
> 0x9abc, returned 0x42 for an 8 bit read from X+1,
> and so on. Having byte reads from X..X+3 return
> values corresponding to parts of the 32 bit access
> is purely a convention.)

I don't follow above, one may have one read from
device address X as 32 bits returned 0x12345678,
and another read from the same address X as 32 bit
returned 0xabcdef123, so what? Maybe real example
would help.

>> Note nowhere in my above description I've talked
>> about endianity of anything: device, access (E bit),
>> KVM host, guest, hypervisor. All these endianities
>> are irrelevant to mmio interface.
>
> As soon as you try to think of the mmio.data as a set
> of bytes then you have to specify some endianness to
> the data, so that both sides (kernel and userspace)
> know how to reconstruct the actual data value from the
> array of bytes.

What actual value? In what sense? You need to bring
into discussion semantic of this h/w address to really tell
that. Driver that reads/writes is aware about semantics
of those addresses. For example devices gives chanel1 byte
value at phys_addr, gives chanel2 byte value at
phys_addr + 1, so 16bit integer read from phys_addr
will bring two chanel values into register not one 16bit
integer.

>>> Furthermore, device endianness is entirely irrelevant
>>> for deciding the properties of mmio.data[], because the
>>> thing we're modelling here is essentially the CPU->bus
>>> interface. In real hardware, the properties of individual
>>> devices on the bus are irrelevant to how the CPU's
>>> interface to the bus behaves, and similarly here the
>>> properties of emulated devices don't affect how KVM's
>>> interface to QEMU userspace needs to work.
>>
>> As far as mmio interface concerned I claim that any
>> endianity is irrelevant here. I am utterly lost about
>> endianity of what you care about?
>
> I care about knowing which end of mmio.data is the
> least significant byte, obviously.

LSB of what? Memory semantics does not have
notion of LSB. It comes only when one start
interpreting memory content. memcpy does not
have any LSB involved just bytes.

>> Consider
>> the following ARM code snippets:
>>
>> setend le
>> mov r1, #0x04030201
>> str r1, [r0]
>>
>> and
>>
>> setend be
>> mov r1, #0x01020304
>> str r1, [r0]
>>
>> when above snippets are executed memory bus
>> sees absolutely the same thing, can you tell by
>> looking at this memory transaction what endianity
>> is it? And endianity of what? I can't.
>
> That is correct. That is because the value sent out on
> the bus from the CPU is always the same: it says
> "32 bit transaction, value 0x04030201, address $whatever".

Above is just *your* choice to describe signals on
actual bus lines. I can describe the same transaction
as "put set of 4 bytes {0x01, 0x02, 0x03, 0x04} at
address $whatever". It does not change what lines
values would be during this memory transaction.

BTW could you please propose how will you see such
"32 bit transaction, value 0x04030201, address $whatever".
on ARM LE CPU in mmio.data?

If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
me. That is current case ARM LE case when above
snippets would be executed by guest.

Would we  agree that the same arrangement would be
true for all other cases on ARM regardless of all other
endianities of qemu, KVM host, guest, hypervisor, etc?
If we agree on that I think we are talking about the
same thing just in different concepts. My memcpy
semantics of data.mmio[] matches those values just
fine.

>> The only thing you can tell by looking at this bus
>> memory transaction is that 0x01 byte value goes at
>> r0 address, 0x02 byte value goes at r0 + 1 address,
>> etc.
>
> No, this part is absolutely wrong, see above.

I don't see why you so attached to desire to describe
data part of memory transaction as just one of int
types. If we are talking about bunch of hypothetical
cases imagine such bus that allow transaction with
size of 6 bytes. How do you describe such data in
your ints speak? What endianity you can assign to
sequence of 6 bytes? While note that description of
such transaction as set of 6 byte values at address
$whatever makes perfect sense.

Thanks,
Victor

> thanks
> -- PMM

Reply | Threaded
Open this post in threaded view
|

Re: KVM and variable-endianness guest CPUs

Peter Maydell-5
On 22 January 2014 19:29, Victor Kamensky <[hidden email]> wrote:

> On 22 January 2014 09:29, Peter Maydell <[hidden email]> wrote:
>> This just isn't how real buses work. There is no
>> "address + 1, address + 2". There is a single address
>> for the memory transaction and a set of data on
>> data lines and some separate size information.
>
> Yes, and those data lines are just binary signal lines
> not numbers. If one would want to describe information
> on data lines as number he/she needs to assign integer
> bits numbers to lines, and that is absolutely arbitrary
> process

It is part of the definition of the bus which signal pin is
D0 and which is D31...

>  In one choose one way to assigned those
> bits to lines and another choose reverse way, they will
> talk about completely different numbers for the same
> signals on the bus. Such data lines enumeration has
> no reflection on how bus actually works. And I don't
> even see why it should be described just as single
> integer, for example one can describe information on
> data lines as set of 4 byte value, nothing wrong with
> such description.

It is not how the hardware works. If you describe it as
a set of 4 bytes, then you need to also say how you are
mapping from those 4 bytes to the actual 32 bit data
transaction the hardware is doing. Which is the question
we're trying to answer in this thread.

I've snipped a huge chunk of my initial reply to this email,
because it all boiled down to "sorry, you're just not correct
about how the hardware works" and it doesn't seem
necessary to repeat it three times. Devices really do see
"this is a transaction with this value and this size". They
do not in any way see a 32 bit word write as "this is a collection
of byte writes". Therefore:

 1) thinking about a 32 bit word write in terms of a byte array
    is confusing
 2) since the KVM API is unfortunately stuck with this byte
   array, we must define the semantics of what it actually
   contains, so that the kernel and QEMU can go between
  "the value being read/written in the transaction" and
  "the contents of the byte array

>> As soon as you try to think of the mmio.data as a set
>> of bytes then you have to specify some endianness to
>> the data, so that both sides (kernel and userspace)
>> know how to reconstruct the actual data value from the
>> array of bytes.
>
> What actual value? In what sense? You need to bring
> into discussion semantic of this h/w address to really tell
> that.

I've just spent the last two emails doing exactly that.
The actual value, as in "this CPU just did a memory
transaction of a 32 bit data value".

> BTW could you please propose how will you see such
> "32 bit transaction, value 0x04030201, address $whatever".
> on ARM LE CPU in mmio.data?

That is exactly the problem we're discussing in this thread.
Indeed, I proposed an answer to it, which is that the mmio.data
array should be in host kernel byte order, in which case it
would be (for an LE host kernel) 0x01 in mmio.data[0] and so
on up.

> If it would be {0x01, 0x02, 0x03, 0x4} it is fine with
> me. That is current case ARM LE case when above
> snippets would be executed by guest.
>
> Would we  agree that the same arrangement would be
> true for all other cases on ARM regardless of all other
> endianities of qemu, KVM host, guest, hypervisor, etc?

No; under my proposal, for a big-endian host kernel (and
thus big-endian QEMU) the order would be
mmio.data[0] = 0x04, etc. (This wouldn't change based
on the guest kernel endianness or whether it happened
to have set the E bit temporarily.)

Defining that mmio.data[] is always little-endian would
be a valid definition of an API if we were doing it from
scratch. It has the unfortunate property that it would
completely break the existing PPC BE setups, which
don't define it that way, so it is a non-starter.

Defining it as being always guest-order would mean that
userspace had to continually look at the guest CPU
endianness bit, which is annoying and awkward.

Defining it as always host-endian order is the most
reasonable option available. It also happens to work
for the current QEMU code, which is nice.

thanks
-- PMM

123