outlined TLB lookup on x86

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

outlined TLB lookup on x86

Xin Tong
I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on x86-64 machine, potentially for better instruction cache performance, I have a few  questions.

1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to the generated slow path and slow path refills the TLB, then load/store and jumps to the next emulated instruction. I am wondering is it easy to outline the code for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical path, so its not as important as the code when TLB hits.
2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i measured this using Intel PIN. so even the miss rate is low (say 3%) the overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.  I am thinking the tlb may need to be organized in a set associative fashion to reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb that is 4 way associative and use x86 simd instructions to do the lookup once the direct-mapped tlb misses. Has anybody done any work on this front ?
3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K entries ?

Xin
Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Lluís Vilanova
Xin Tong writes:

> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few questions.

> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and jumps
> to the next emulated instruction. I am wondering is it easy to outline the code
> for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code
> should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk
> the TLB after its refilled ? This code is off the critical path, so its not as
> important as the code when TLB hits.
> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the TLB
> lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the overall
> time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am thinking
> the tlb may need to be organized in a set associative fashion to reduce conflict
> miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb
> that is 4 way associative and use x86 simd instructions to do the lookup once
> the direct-mapped tlb misses. Has anybody done any work on this front ?
> 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K
> entries ?

Using vector intrinsics for the TLB lookup will probably make the code less
portable. I don't know how compatible are the GCC and LLVM vectorizing
intrinsics between each other (since there has been some efforts on making QEMU
also compile with LLVM).

A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE in
cputlb.c), but the higher hit ratio could pay off, although I don't know how the
current size was chosen.


Lluis

--
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong
Hi LIuis

we can probably generate vector intrinsics using the tcg, e.g. add support to tcg to emit vector instructions directly in code cache

why would a larger TLB make some operations slower, the TLB is a direct-mapped hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE  is always used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).

Thank you
Xin


On Wed, Nov 27, 2013 at 5:12 AM, Lluís Vilanova <[hidden email]> wrote:
Xin Tong writes:

> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few questions.

> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and jumps
> to the next emulated instruction. I am wondering is it easy to outline the code
> for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code
> should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk
> the TLB after its refilled ? This code is off the critical path, so its not as
> important as the code when TLB hits.
> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the TLB
> lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the overall
> time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am thinking
> the tlb may need to be organized in a set associative fashion to reduce conflict
> miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb
> that is 4 way associative and use x86 simd instructions to do the lookup once
> the direct-mapped tlb misses. Has anybody done any work on this front ?
> 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K
> entries ?

Using vector intrinsics for the TLB lookup will probably make the code less
portable. I don't know how compatible are the GCC and LLVM vectorizing
intrinsics between each other (since there has been some efforts on making QEMU
also compile with LLVM).

A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE in
cputlb.c), but the higher hit ratio could pay off, although I don't know how the
current size was chosen.


Lluis

--
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Richard Henderson
In reply to this post by Xin Tong
On 11/27/2013 08:41 PM, Xin Tong wrote:
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few  questions.
>
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and
> jumps to the next emulated instruction. I am wondering is it easy to outline
> the code for the slow path.

Hard.  There's quite a bit of code on that slow path that's unique to the
surrounding code context -- which registers contain inputs and outputs, where
to continue after slow path.

The amount of code that's in the TB slow path now is approximately minimal, as
far as I can see.  If you've got an idea for improvement, please share.  ;-)


> I am thinking when a TLB misses, the outlined TLB
> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
> path, so its not as important as the code when TLB hits.

That would work for true TLB misses to RAM, but does not work for memory mapped
I/O.

> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.

I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance.  But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.

But be aware that we can't simply make the change universally.  E.g. ARM can
use an immediate 8-bit operand during the TLB lookup, but would have to use
several insns to perform a 9-bit mask.

>  I am
> thinking the tlb may need to be organized in a set associative fashion to
> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
> have a victim tlb that is 4 way associative and use x86 simd instructions to do
> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
> front ?

Even with SIMD, I don't believe you could make the fast-path of a set
associative lookup fast.  This is the sort of thing for which you really need
the dedicated hardware of the real TLB.  Feel free to prove me wrong with code,
of course.


r~

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong



On Wed, Nov 27, 2013 at 6:12 PM, Richard Henderson <[hidden email]> wrote:
On 11/27/2013 08:41 PM, Xin Tong wrote:
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few  questions.
>
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and
> jumps to the next emulated instruction. I am wondering is it easy to outline
> the code for the slow path.

Hard.  There's quite a bit of code on that slow path that's unique to the
surrounding code context -- which registers contain inputs and outputs, where
to continue after slow path.

The amount of code that's in the TB slow path now is approximately minimal, as
far as I can see.  If you've got an idea for improvement, please share.  ;-)


> I am thinking when a TLB misses, the outlined TLB
> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
> path, so its not as important as the code when TLB hits.

That would work for true TLB misses to RAM, but does not work for memory mapped
I/O.

> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.

I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance.  But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.
 
Hardware TLBs are limited in size primarily due to the fact that increasing their sizes increases their access latency as well. but software tlb does not suffer from that problem. so i think the size of the softtlb should be not influenced by the size of the hardware tlb. 

Flushing the TLB is minimal unless we have a really really large TLB, e.g. a TLB with 1M entries. I vaguely remember that i see ~8% of the time is spent in the cpu_x86_mmu_fault function in one of the speccpu2006 workload some time ago. so if we increase the size of the TLB significantly and potential getting rid of most of the TLB misses, we can get rid of most of the 8%. ( there are still compulsory misses and a few conflict misses, but i think compulsory misses is not the major player here).

But be aware that we can't simply make the change universally.  E.g. ARM can
use an immediate 8-bit operand during the TLB lookup, but would have to use
several insns to perform a 9-bit mask.
 
This can be handled with ifndefs. most of the tlb code common to all cpus need not be changed. 

>  I am
> thinking the tlb may need to be organized in a set associative fashion to
> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
> have a victim tlb that is 4 way associative and use x86 simd instructions to do
> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
> front ?

Even with SIMD, I don't believe you could make the fast-path of a set
associative lookup fast.  This is the sort of thing for which you really need
the dedicated hardware of the real TLB.  Feel free to prove me wrong with code,
of course.

I am thinking the primary TLB should remain what it is, i.e. a direct mapped. but we can have a victim TLB with bigger assoiciatity. the victim TLB can be walked either sequentially or in parallel with simd instructions. its going to be slower than hitting in the direct mapped TLB. but (much) better than having to rewalk the page table. 

For OS with ASID, we can have a shared TLB with ASID. and this can potentially get rid of some compulsory misses for us. e.g. multiple threads in a process share the same ASID and we can check for the shared TLB on miss before walk the page table.


r~

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Lluís Vilanova
In reply to this post by Xin Tong
Xin Tong writes:

> Hi LIuis
> we can probably generate vector intrinsics using the tcg, e.g. add support to
> tcg to emit vector instructions directly in code cache

There was some discussion long ago about adding vector instructions to TCG, but
I don't remember what was the conclusion.

Also remember that using vector instructions will "emulate" a low-associativity
TLB; don't know how much better than a 1-way TLB will that be, though.


> why would a larger TLB make some operations slower, the TLB is a direct-mapped
> hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is always
> used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).

It would make TLB invalidations slower (e.g., see 'tlb_flush' in
"cputlb.c"). And right now QEMU performs full TLB invalidations more frequently
than the equivalent HW needs to, although I suppose that should be quantified
too.


Lluis

--
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong



On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <[hidden email]> wrote:
Xin Tong writes:

> Hi LIuis
> we can probably generate vector intrinsics using the tcg, e.g. add support to
> tcg to emit vector instructions directly in code cache

There was some discussion long ago about adding vector instructions to TCG, but
I don't remember what was the conclusion.

Also remember that using vector instructions will "emulate" a low-associativity
TLB; don't know how much better than a 1-way TLB will that be, though.


> why would a larger TLB make some operations slower, the TLB is a direct-mapped
> hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is always
> used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).

It would make TLB invalidations slower (e.g., see 'tlb_flush' in
"cputlb.c"). And right now QEMU performs full TLB invalidations more frequently
than the equivalent HW needs to, although I suppose that should be quantified
too.

you are right LIuis. QEMU does context switch quite more often that real hw, this is probably primarily due to the fact that QEMU is magnitude slower than real hw.  I am wondering where timer is emulated in QEMU system-x86_64. I imagine the guest OS must program the timers to do interrupt for context switches. 

Another question, what happens when a vcpu is stuck in an infinite loop ? QEMU must need an timer interrupt somewhere as well ?

Is my understanding correct ?

Xin  

Lluis

--
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Avi Kivity-2
In reply to this post by Richard Henderson
On 11/28/2013 04:12 AM, Richard Henderson wrote:
>> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
>> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
>> measured this using Intel PIN. so even the miss rate is low (say 3%) the
>> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
> I'd be interested to experiment with different TLB sizes, to see what effect
> that has on performance.  But I suspect that lack of TLB contexts mean that we
> wind up flushing the TLB more often than real hardware does, and therefore a
> larger TLB merely takes longer to flush.
>


You could use a generation counter to flush the TLB in O(1) by
incrementing the counter.  That slows down the fast path though. Maybe
you can do that for the larger second level TLB only.

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong
In reply to this post by Lluís Vilanova



On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <[hidden email]> wrote:
Xin Tong writes:

> Hi LIuis
> we can probably generate vector intrinsics using the tcg, e.g. add support to
> tcg to emit vector instructions directly in code cache

There was some discussion long ago about adding vector instructions to TCG, but
I don't remember what was the conclusion.



Hi LIuis

Can you please forward me that email if it is not difficult to find. otherwise, it is ok!.

Thank you 
Xin
Also remember that using vector instructions will "emulate" a low-associativity
TLB; don't know how much better than a 1-way TLB will that be, though.


> why would a larger TLB make some operations slower, the TLB is a direct-mapped
> hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is always
> used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).

It would make TLB invalidations slower (e.g., see 'tlb_flush' in
"cputlb.c"). And right now QEMU performs full TLB invalidations more frequently
than the equivalent HW needs to, although I suppose that should be quantified
too.


Lluis

--
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Lluís Vilanova
Xin Tong writes:

> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <[hidden email]> wrote:
>     Xin Tong writes:
   
>> Hi LIuis
>> we can probably generate vector intrinsics using the tcg, e.g. add support
>     to
>> tcg to emit vector instructions directly in code cache
   
   
>     There was some discussion long ago about adding vector instructions to TCG,
>     but
>     I don't remember what was the conclusion.
   

> Hi LIuis

> Can you please forward me that email if it is not difficult to find. otherwise,
> it is ok!.

Sorry, I can't remember when it was. You'll have to look it up in the mailing
list archive.


Lluis

--
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong
In reply to this post by Xin Tong
On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <[hidden email]> wrote:

>
>
>
> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <[hidden email]> wrote:
>>
>> Xin Tong writes:
>>
>> > Hi LIuis
>> > we can probably generate vector intrinsics using the tcg, e.g. add
>> > support to
>> > tcg to emit vector instructions directly in code cache
>>
>> There was some discussion long ago about adding vector instructions to
>> TCG, but
>> I don't remember what was the conclusion.
>>
>> Also remember that using vector instructions will "emulate" a
>> low-associativity
>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>
>>
>> > why would a larger TLB make some operations slower, the TLB is a
>> > direct-mapped
>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>> > always
>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>
>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>> frequently
>> than the equivalent HW needs to, although I suppose that should be
>> quantified
>> too.

I see QEMU executed ~1M instructions per context switch for
qemu-system-x86_64. Is this because of the fact that the periodical
time interval interrupt is delivered in real time while QEMU is
significantly slower than real hw ?

Xin

>>
> you are right LIuis. QEMU does context switch quite more often that real hw,
> this is probably primarily due to the fact that QEMU is magnitude slower
> than real hw.  I am wondering where timer is emulated in QEMU system-x86_64.
> I imagine the guest OS must program the timers to do interrupt for context
> switches.
>
> Another question, what happens when a vcpu is stuck in an infinite loop ?
> QEMU must need an timer interrupt somewhere as well ?
>
> Is my understanding correct ?
>
> Xin
>>
>>
>> Lluis
>>
>> --
>>  "And it's much the same thing with knowledge, for whenever you learn
>>  something new, the whole world becomes that much richer."
>>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>>  Tollbooth
>
>

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong
why is QEMU TLB organized based on the modes, e.g. on x86 there are 3
modes. what i think is that there may be conflicts between virtual
addresses and physical addresses. organizing it by modes guarantees
that QEMU does not hit a physical address translation entry when in
user mode and vice versa ?

Thank you,
Xin

On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <[hidden email]> wrote:

> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <[hidden email]> wrote:
>>
>>
>>
>> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <[hidden email]> wrote:
>>>
>>> Xin Tong writes:
>>>
>>> > Hi LIuis
>>> > we can probably generate vector intrinsics using the tcg, e.g. add
>>> > support to
>>> > tcg to emit vector instructions directly in code cache
>>>
>>> There was some discussion long ago about adding vector instructions to
>>> TCG, but
>>> I don't remember what was the conclusion.
>>>
>>> Also remember that using vector instructions will "emulate" a
>>> low-associativity
>>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>>
>>>
>>> > why would a larger TLB make some operations slower, the TLB is a
>>> > direct-mapped
>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>>> > always
>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>>
>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>>> frequently
>>> than the equivalent HW needs to, although I suppose that should be
>>> quantified
>>> too.
>
> I see QEMU executed ~1M instructions per context switch for
> qemu-system-x86_64. Is this because of the fact that the periodical
> time interval interrupt is delivered in real time while QEMU is
> significantly slower than real hw ?
>
> Xin
>
>>>
>> you are right LIuis. QEMU does context switch quite more often that real hw,
>> this is probably primarily due to the fact that QEMU is magnitude slower
>> than real hw.  I am wondering where timer is emulated in QEMU system-x86_64.
>> I imagine the guest OS must program the timers to do interrupt for context
>> switches.
>>
>> Another question, what happens when a vcpu is stuck in an infinite loop ?
>> QEMU must need an timer interrupt somewhere as well ?
>>
>> Is my understanding correct ?
>>
>> Xin
>>>
>>>
>>> Lluis
>>>
>>> --
>>>  "And it's much the same thing with knowledge, for whenever you learn
>>>  something new, the whole world becomes that much richer."
>>>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>>>  Tollbooth
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong
Hi

I have found that adding a small (8-entry) fully associative victim
TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path
(page table walking) improves the performance of QEMU x86_64 system
emulation mode significantly on the specint2006 benchmarks. This is
primarily due to the fact that the primary TLB is directly mapped and
suffer from conflict misses. I have this implemented on QEMU trunk and
would like to contribute this back to QEMU. Where should i start ?

Xin

On Tue, Dec 17, 2013 at 8:22 PM, Xin Tong <[hidden email]> wrote:

> why is QEMU TLB organized based on the modes, e.g. on x86 there are 3
> modes. what i think is that there may be conflicts between virtual
> addresses and physical addresses. organizing it by modes guarantees
> that QEMU does not hit a physical address translation entry when in
> user mode and vice versa ?
>
> Thank you,
> Xin
>
> On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <[hidden email]> wrote:
>> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <[hidden email]> wrote:
>>>
>>>
>>>
>>> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <[hidden email]> wrote:
>>>>
>>>> Xin Tong writes:
>>>>
>>>> > Hi LIuis
>>>> > we can probably generate vector intrinsics using the tcg, e.g. add
>>>> > support to
>>>> > tcg to emit vector instructions directly in code cache
>>>>
>>>> There was some discussion long ago about adding vector instructions to
>>>> TCG, but
>>>> I don't remember what was the conclusion.
>>>>
>>>> Also remember that using vector instructions will "emulate" a
>>>> low-associativity
>>>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>>>
>>>>
>>>> > why would a larger TLB make some operations slower, the TLB is a
>>>> > direct-mapped
>>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>>>> > always
>>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>>>
>>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>>>> frequently
>>>> than the equivalent HW needs to, although I suppose that should be
>>>> quantified
>>>> too.
>>
>> I see QEMU executed ~1M instructions per context switch for
>> qemu-system-x86_64. Is this because of the fact that the periodical
>> time interval interrupt is delivered in real time while QEMU is
>> significantly slower than real hw ?
>>
>> Xin
>>
>>>>
>>> you are right LIuis. QEMU does context switch quite more often that real hw,
>>> this is probably primarily due to the fact that QEMU is magnitude slower
>>> than real hw.  I am wondering where timer is emulated in QEMU system-x86_64.
>>> I imagine the guest OS must program the timers to do interrupt for context
>>> switches.
>>>
>>> Another question, what happens when a vcpu is stuck in an infinite loop ?
>>> QEMU must need an timer interrupt somewhere as well ?
>>>
>>> Is my understanding correct ?
>>>
>>> Xin
>>>>
>>>>
>>>> Lluis
>>>>
>>>> --
>>>>  "And it's much the same thing with knowledge, for whenever you learn
>>>>  something new, the whole world becomes that much richer."
>>>>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>>>>  Tollbooth
>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Peter Maydell-5
On 21 January 2014 14:22, Xin Tong <[hidden email]> wrote:
> I have found that adding a small (8-entry) fully associative victim
> TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path
> (page table walking) improves the performance of QEMU x86_64 system
> emulation mode significantly on the specint2006 benchmarks. This is
> primarily due to the fact that the primary TLB is directly mapped and
> suffer from conflict misses. I have this implemented on QEMU trunk and
> would like to contribute this back to QEMU. Where should i start ?

The wiki page http://wiki.qemu.org/Contribute/SubmitAPatch
tries to describe our usual process for reviewing code
submissions. If you make sure your changes follow the
guidelines described there and then send them to the mailing
list as a series of patch emails in the right format, we
can start reviewing the code.

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Xin Tong
In reply to this post by Richard Henderson
On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <[hidden email]> wrote:

> On 11/27/2013 08:41 PM, Xin Tong wrote:
>> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
>> x86-64 machine, potentially for better instruction cache performance, I have a
>> few  questions.
>>
>> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
>> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
>> the generated slow path and slow path refills the TLB, then load/store and
>> jumps to the next emulated instruction. I am wondering is it easy to outline
>> the code for the slow path.
>
> Hard.  There's quite a bit of code on that slow path that's unique to the
> surrounding code context -- which registers contain inputs and outputs, where
> to continue after slow path.
>
> The amount of code that's in the TB slow path now is approximately minimal, as
> far as I can see.  If you've got an idea for improvement, please share.  ;-)
>
>
>> I am thinking when a TLB misses, the outlined TLB
>> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
>> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
>> path, so its not as important as the code when TLB hits.
>
> That would work for true TLB misses to RAM, but does not work for memory mapped
> I/O.
>
>> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
>> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
>> measured this using Intel PIN. so even the miss rate is low (say 3%) the
>> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
>
> I'd be interested to experiment with different TLB sizes, to see what effect
> that has on performance.  But I suspect that lack of TLB contexts mean that we
> wind up flushing the TLB more often than real hardware does, and therefore a
> larger TLB merely takes longer to flush.
>
> But be aware that we can't simply make the change universally.  E.g. ARM can
> use an immediate 8-bit operand during the TLB lookup, but would have to use
> several insns to perform a 9-bit mask.


Hi Richard

I've done some experiments on increasing the size of the tlb.
increasing the size of the tlb from 256 entries to 4096 entries gives
significant performance improvement on the specint2006 benchmarks on
qemu-system-x86_64 running on a x86_64 linux machine . i am in the
process of exploring more tlb sizes and will post the data after i am
done.

Can you tell me whether ARM is the only architecture that requires
special treatment for increasing tlb size beyond 256 entries so that i
can whip up a patch to the QEMU mainline.

Thank you,
Xin

>
>>  I am
>> thinking the tlb may need to be organized in a set associative fashion to
>> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
>> have a victim tlb that is 4 way associative and use x86 simd instructions to do
>> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
>> front ?
>
> Even with SIMD, I don't believe you could make the fast-path of a set
> associative lookup fast.  This is the sort of thing for which you really need
> the dedicated hardware of the real TLB.  Feel free to prove me wrong with code,
> of course.
>
>
> r~

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Richard Henderson
On 01/22/2014 07:28 AM, Xin Tong wrote:
> Can you tell me whether ARM is the only architecture that requires
> special treatment for increasing tlb size beyond 256 entries so that i
> can whip up a patch to the QEMU mainline.

The major constraint for the non-arm ports is

    CPU_TLB_ENTRY_SIZE + CPU_TLB_BITS < immediate bit size

I.e.

    (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS

is representable as an immediate within an AND instruction.

MIPS has a 16-bit unsigned immediate, and as written would generate bad code
for CPU_TLB_BITS > 11.

I386 has a 32-bit signed immediate, and would generate bad code for
CPU_TLB_BITS > 26.  Though I can't imagine you want to make it that big.

SPARC has a 13-bit signed immediate,  But it's written with a routine which
checks the size of the constant and loads it if necessary.  Which is good,
because that's clearly already happening for CPU_TLB_BITS > 7.

AArch64, ia64, ppc, ppc64 all use fully capable extract-bit-field type insns
and could handle any change you make.

S390 is written using generic routines like sparc, so it won't fail with any
change.  It ought to be adjusted to use the extract-bit-field type insns that
exist in the current generation of machines.  The oldest generation of machine
would have reduced performance with CPU_TLB_BITS > 11.

ARM is also a case in which armv6t2 and later could be written with an
extract-bit-field insn, but previous versions would need to use 2 insns to form
the constant.  But at least we'd be able to combine the shift and and insns.


r~

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Peter Maydell-5
In reply to this post by Xin Tong
On 22 January 2014 15:28, Xin Tong <[hidden email]> wrote:
> On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <[hidden email]> wrote:
>> I'd be interested to experiment with different TLB sizes, to see what effect
>> that has on performance.  But I suspect that lack of TLB contexts mean that we
>> wind up flushing the TLB more often than real hardware does, and therefore a
>> larger TLB merely takes longer to flush.

> I've done some experiments on increasing the size of the tlb.
> increasing the size of the tlb from 256 entries to 4096 entries gives
> significant performance improvement on the specint2006 benchmarks on
> qemu-system-x86_64 running on a x86_64 linux machine . i am in the
> process of exploring more tlb sizes and will post the data after i am
> done.

Of course a single big benchmark program is probably the best case
for "not having lots of TLB flushing". It would probably also be instructive
to benchmark other cases, like OS bootup, running multiple different
programs simultaneously and system call heavy workloads.

Has anybody ever looked at implementing proper TLB contexts?

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Richard Henderson
On 01/22/2014 08:55 AM, Peter Maydell wrote:
> Has anybody ever looked at implementing proper TLB contexts?

I've thought about it.  The best I could come up with is a pointer within ENV
that points to the current TLB context.  It definitely adds another load insn
on the fast path, but we should be able to schedule that first, since it
depends on nothing but the mem_index constant.  Depending on the schedule, it
may require reserving another register on the fast path, which could be a
problem for i686.

It would also greatly expand the size of ENV.

E.g. Alpha would need to implement 256 contexts to match the hardware.  We
currently get away with pretending to implement contexts by implementing none
at all, and flushing the TLB at every context change.

Our current TLB size is 8k.  Times 256 contexts is 2MB.  Which might just be
within the range of possibility.  Certainly not if we expand the size of the
individual TLBs.

Although interestingly, for Alpha we don't need 256 * NB_MMU_MODES, because
MMU_KERNEL_IDX always uses context 0, the "global context".

I don't recall enough about the intimate details of other TLB hardware to know
if there are similar savings that can be had elsewhere.  But the amount of
memory involved is large enough to suggest that some sort of target-specific
sizing of the number of contexts might be required.


r~

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Peter Maydell-5
On 22 January 2014 17:32, Richard Henderson <[hidden email]> wrote:

> On 01/22/2014 08:55 AM, Peter Maydell wrote:
>> Has anybody ever looked at implementing proper TLB contexts?
>
> I've thought about it.  The best I could come up with is a pointer within ENV
> that points to the current TLB context.  It definitely adds another load insn
> on the fast path, but we should be able to schedule that first, since it
> depends on nothing but the mem_index constant.  Depending on the schedule, it
> may require reserving another register on the fast path, which could be a
> problem for i686.
>
> It would also greatly expand the size of ENV.
>
> E.g. Alpha would need to implement 256 contexts to match the hardware.  We
> currently get away with pretending to implement contexts by implementing none
> at all, and flushing the TLB at every context change.

I don't really know the details of Alpha, but can you get away with just
"we implement N contexts, and only actually keep the most recently
used N"? This is effectively what we're doing at the moment, with N==1.

(ASIDs on ARM are 16 bit now, so we definitely wouldn't want to keep
an entire TLB for each ASID; if we ever implemented virtualization there's
another 8 bits of VMID and each VMID has its own ASID range...)

thanks
-- PMM

Reply | Threaded
Open this post in threaded view
|

Re: outlined TLB lookup on x86

Richard Henderson
On 01/22/2014 09:35 AM, Peter Maydell wrote:
> I don't really know the details of Alpha, but can you get away with just
> "we implement N contexts, and only actually keep the most recently
> used N"? This is effectively what we're doing at the moment, with N==1.

Yes, I suppose we could do that.  Rather than have the ASN switching code be a
simple store, it could be a full helper function.  At which point we can do
just about anything at all.


r~

12