qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] softmmu thoughts


From: Piotras
Subject: Re: [Qemu-devel] softmmu thoughts
Date: Wed, 20 Oct 2004 02:13:24 +0200

Hi!

I experimented already with similar approach. I started with Qemu-fast as it 
already uses signal handler and mmap to setup guest address space.

Qemu-fast requires that virtual address space visible inside emulator is mapped 
directly to qemu process address space. Because of this qemu-fast uses special 
memory layout. This means that it will always be much less portable than 
qemu-softmmu, but on the other hand is much faster and can support code-copy 
to achieve near-native performance.

The goal of my experiment was to evaluate possibility of using mmap-ed memory 
to improve speed of softmmu without introducing portability limitations of 
Qemu-fast. Because of this I used an indirection table and the memory
access code
was very similar to yours:
  mem_read (uint32_t virtual_addr)
    {
      uint32_t entry;
      uint32_t physical_addr;

      entry = virtual_addr >> MAP_BLOCK_BITS;
      /* the entries in indirection_table compensate for higher bits of 
         virtual_addr to avoid extra "and" operation */
      physical_addr = CPUState->indirection_table[entry] + virtual_addr;
      return *(TYPE *)physical_addr;
    }

Each indirection_table entry points to a block of 2^(MAP_BLOCK_BITS - 12)+1 
pages of virtual memory. Each block contains pages that should be accessible 
at continues virtual addresses. Because of this (and +1 in the formula above) 
memory access that crosses page boundaries is done full-speed.

I use a pool of blocks that is much smaller then guest virtual address space 
and use a special block on inaccessible memory to trap memory access
via entries
of indirection_table that are not mapped to a valid block. If I need
to allocate
a new block and my pool is empty I'm unmapping last-recently allocated block.

The memory access is implemented in 4 x86 instructions:
    asm volatile (
        "mov    %3, %%eax\n"
        "shr    %2, %%eax\n"
        "mov    %1(%%ebp,%%eax,4), %%eax\n"
        "movl   (%3,%%eax,1), %0\n"
        : "=r" (result)
        : "m" (*(uint8_t *)offsetof(CPUX86State, indirection_table[0])),
          "I" (MAP_BLOCK_BITS),
          "r" (virtual_addr)
        : "%eax");
Only the last instruction can fail. Whenever the signal handler modifies the 
indirection_table entry, the new value is stored in the EAX register.
On instruction
restart the new value is used.

I believe that MAP_BLOCK_BITS should be set to value bigger then 12 to limit 
size of indirection_table and fragmentation of memory map (mmaped pages).

IIRC, the nbench results were about 20-30% better then traditional
Qemu-softmmu.
Also Linux seemed faster. However Windows 98 seemed much slower. The problem 
with Windows is that it does a lot of writes on the very same pages that the 
code is executing from and this causes a lot of page faults.

I'm attaching the patch. It's very experimental so you may expect bugs.


Piotrek


On Tue, 19 Oct 2004 22:27:57 +0200, Magnus Damm <address@hidden> wrote:
> Hello all,
> 
> Wouldn't it be possible to speed up the softmmu code by using some
> mmap() tricks?
> 
> u_int32_t mem_read(u_int32_t address)
> {
>   u_int8_t entry;
>   u_int32_t a;
> 
>   entry = CPUState->softmmu_lookup[address >> 12];
>   a = CPUState->softmmu_entries[entry].base + (address & 0xfff);
>   return *(u_int32_t *)a;
> }
> 
> The idea is to optimize so the most common memory accesses becomes
> faster than today but the more uncommon (crossing page boundary) will
> generate a signal and thus become slower. If I remember correctly the
> code above will be around 7 x86 instructions long.
> 
> The code above will use 1 MiB of memory for the softmmu_lookup, one byte
> for each entry. A value of 0 means "not mapped" and softmmu_entries[0]
> will always point to a page that generates a signal. The other 255
> entries are used to map one virtual address to a base address of a
> two-page combination somewhere in memory. This two page combination is
> actually two VMA:s where the first page maps to the correct simulated
> physical address. The second page is mapped as inaccessible and is used
> to generate a signal when a memory access crosses the page boundary.
> 
> And of course, there are many more things that must be done including a
> complicated signal handler, and I guess that this kind of implementation
> is not really useful for mapping in memory mapped I/O. But maybe it is
> efficient for userspace?
> 
> Any thoughts?
> 
> / magnus




reply via email to

[Prev in Thread] Current Thread [Next in Thread]