qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 3/6] kvmvapic: Introduce TPR access optimization


From: Jan Kiszka
Subject: Re: [Qemu-devel] [PATCH 3/6] kvmvapic: Introduce TPR access optimization for Windows guests
Date: Thu, 09 Feb 2012 16:39:45 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); de; rv:1.8.1.12) Gecko/20080226 SUSE/2.0.0.12-1.1 Thunderbird/2.0.0.12 Mnenhy/0.7.5.666

On 2012-02-09 16:18, Avi Kivity wrote:
> On 02/05/2012 02:39 PM, Jan Kiszka wrote:
>> From: Jan Kiszka <address@hidden>
>>
>> This enables acceleration for MMIO-based TPR registers accesses of
>> 32-bit Windows guest systems. It is mostly useful with KVM enabled,
>> either on older Intel CPUs (without flexpriority feature, can also be
>> manually disabled for testing) or any current AMD processor.
>>
>> The approach introduced here is derived from the original version of
>> qemu-kvm. It was refactored, documented, and extended by support for
>> user space APIC emulation, both with and without KVM acceleration. 
> 
> However, it's presented as a rewrite instead of a series of changes, so
> we can't see what the changes are.

Yes, but the original code also depends on interfaces we don't have in
upstream.

> 
> Having said that, the end result is way, way better than the original.
> 
>> The
>> VMState format was kept compatible, so was the ABI to the option ROM
>> that implements the guest-side para-virtualized driver service. This
>> enables seamless migration from qemu-kvm to upstream or, one day,
>> between KVM and TCG mode.
>>
>> The basic concept goes like this:
>>  - VAPIC PV interface consisting of I/O port 0x7e and (for KVM in-kernel
>>    irqchip) a vmcall hypercall is registered
>>  - VAPIC option ROM is loaded into guest
>>  - option ROM activates TPR MMIO access reporting via port 0x7e
>>  - TPR accesses are trapped and patched in the guest to call into option
>>    ROM instead, VAPIC support is enabled
>>  - option ROM TPR helpers track state in memory and invoke hypercall to
>>    poll for pending IRQs if required
>>  
> 
>> +
>> +static int evaluate_tpr_instruction(VAPICROMState *s, CPUState *env,
>> +                                    target_ulong *pip, int access)
>> +{
>> +    target_ulong addr_offset;
>> +    target_ulong ip = *pip;
>> +    uint8_t opcode[2];
>> +    uint32_t real_tpr_addr;
>> +
>> +    if ((ip & 0xf0000000) != 0x80000000 && (ip & 0xf0000000) != 0xe0000000) 
>> {
>> +        return -1;
>> +    }
>> +
>> +    /*
>> +     * Early Windows 2003 SMP initialization contains a
>> +     *
>> +     *   mov imm32, r/m32
>> +     *
>> +     * instruction that is patched by TPR optimization. The problem is that
>> +     * RSP, used by the patched instruction, is zero, so the guest gets a
>> +     * double fault and dies.
>> +     */
>> +    if (env->regs[R_ESP] == 0) {
>> +        return -1;
>> +    }
>> +
>> +    if (access == TPR_ACCESS_WRITE && kvm_enabled() &&
>> +        !kvm_irqchip_in_kernel()) {
>> +        /*
>> +         * KVM without TPR access reporting calls into the user space APIC 
>> on
>> +         * write with IP pointing after the accessing instruction. So we 
>> need
>> +         * to look backward to find the reason.
>> +         */
> 
> Why do we need to do anything at all?

We need to patch the causing instruction, so we have to know where it
starts. Or what do you mean?

> 
> I'm not sure if the ABI guarantees anything about %rip.

That's indeed a point. It's likely coupled to the emulator's internals
and when it calls out to user space for MMIO write. How to deal with it?

> 
>> +        ip -= 5;
>> +        if (cpu_memory_rw_debug(env, ip, opcode, 1, 0) < 0) {
>> +            return -1;
>> +        }
>> +        if (opcode[0] == 0xa3) {
>> +            addr_offset = 1;
>> +            goto instruction_ok;
>> +        }
>> +
>> +        ip--;
>> +        if (cpu_memory_rw_debug(env, ip, opcode, 2, 0) < 0) {
>> +            return -1;
>> +        }
>> +        if (opcode[0] == 0x89 && is_abs_modrm(opcode[1])) {
>> +            addr_offset = 2;
>> +            goto instruction_ok;
>> +        }
>> +
>> +        ip -= 4;
>> +        if (cpu_memory_rw_debug(env, ip, opcode, 2, 0) < 0) {
>> +            return -1;
>> +        }
>> +        if (opcode[0] != 0xc7 || modrm_reg(opcode[1]) != 0 ||
>> +            !is_abs_modrm(opcode[1])) {
>> +            return -1;
>> +        }
>> +        addr_offset = 2;
>> +    } else {
>> +        if (cpu_memory_rw_debug(env, ip, opcode, sizeof(opcode), 0) < 0) {
>> +            return -1;
>> +        }
>> +
>> +        switch (opcode[0]) {
>> +        case 0xc7: /* mov imm32, r/m32 (c7/0) */
>> +            if (modrm_reg(opcode[1]) != 0) {
>> +                return -1;
>> +            }
>> +            /* fall through */
>> +        case 0x89: /* mov r32 to r/m32 */
>> +        case 0x8b: /* mov r/m32 to r32 */
>> +            if (!is_abs_modrm(opcode[1])) {
>> +                return -1;
>> +            }
>> +            addr_offset = 2;
>> +            break;
>> +        case 0xa1: /* mov abs to eax */
>> +        case 0xa3: /* mov eax to abs */
>> +            addr_offset = 1;
>> +            break;
>> +        case 0xff: /* push r/m32 */
>> +            if (modrm_reg(opcode[1]) != 6 || !is_abs_modrm(opcode[1])) {
>> +                return -1;
>> +            }
>> +            addr_offset = 2;
>> +            break;
>> +        default:
>> +            return -1;
>> +        }
> 
> You could code this as a table of predicate functions and insn sizes. 
> When working backward you use the sizes to offset %rip.  This eliminates
> duplication.

OK, will check what it buys us.

> 
>> +    }
>> +
>> +instruction_ok:
>> +    /*
>> +     * Grab the virtual TPR address from the instruction
>> +     * and update the cached values.
>> +     */
>> +    if (cpu_memory_rw_debug(env, ip + addr_offset, (void *)&real_tpr_addr,
>> +                            sizeof(real_tpr_addr), 0) < 0) {
>> +        return -1;
>> +    }
>> +    real_tpr_addr = le32_to_cpu(real_tpr_addr);
>> +    if ((real_tpr_addr & 0xfff) != 0x80) {
>> +        return -1;
>> +    }
>> +    s->real_tpr_addr = real_tpr_addr;
>> +    update_guest_rom_state(s);
>> +
>> +    *pip = ip;
>> +    return 0;
>> +}
>> +
>> +
>> +/*
>> + * Tries to read the unique processor number from the Kernel Processor 
>> Control
>> + * Region (KPCR) of 32-bit Windows. Returns -1 if the KPCR cannot be 
>> accessed
>> + * or is considered invalid.
>> + */
>> +static int get_kpcr_number(CPUState *env)
>> +{
>> +    struct kpcr {
>> +        uint8_t  fill1[0x1c];
>> +        uint32_t self;
>> +        uint8_t  fill2[0x31];
>> +        uint8_t  number;
>> +    } QEMU_PACKED kpcr;
>> +
>> +    if (smp_cpus == 1) {
>> +        return 0;
>> +    }
>> +    if (cpu_memory_rw_debug(env, env->segs[R_FS].base,
>> +                            (void *)&kpcr, sizeof(kpcr), 0) < 0 ||
>> +        kpcr.self != env->segs[R_FS].base) {
> 
> Ah, so it works.  We may want to do it for UP as well, as a way of
> verifying that the guest is compatible with these hacks.

I'm not sure if Windows has this properly set up for the UP HAL. I
rather think this was a bug in the original implementation. The ROM uses
0 as CPU index in UP mode unconditionally, so should we in QEMU.

> 
>> +        return -1;
>> +    }
>> +    return kpcr.number;
>> +}
> 
>> +static void patch_instruction(VAPICROMState *s, CPUState *env, target_ulong 
>> ip)
>> +{
>> +    target_phys_addr_t paddr;
>> +    VAPICHandlers *handlers;
>> +    uint8_t opcode[2];
>> +    uint32_t imm32;
>> +
>> +    if (smp_cpus == 1) {
>> +        handlers = &s->rom_state.up;
>> +    } else {
>> +        handlers = &s->rom_state.mp;
>> +    }
>> +
>> +    cpu_memory_rw_debug(env, ip, opcode, sizeof(opcode), 0);
>> +
>> +    switch (opcode[0]) {
>> +    case 0x89: /* mov r32 to r/m32 */
>> +        patch_byte(env, ip, 0x50 + modrm_reg(opcode[1]));  /* push reg */
>> +        patch_call(s, env, ip + 1, handlers->set_tpr);
>> +        break;
>> +    case 0x8b: /* mov r/m32 to r32 */
>> +        patch_byte(env, ip, 0x90);
>> +        patch_call(s, env, ip + 1, handlers->get_tpr[modrm_reg(opcode[1])]);
>> +        break;
>> +    case 0xa1: /* mov abs to eax */
>> +        patch_call(s, env, ip, handlers->get_tpr[0]);
>> +        break;
>> +    case 0xa3: /* mov eax to abs */
>> +        patch_call(s, env, ip, handlers->set_tpr_eax);
>> +        break;
>> +    case 0xc7: /* mov imm32, r/m32 (c7/0) */
>> +        patch_byte(env, ip, 0x68);  /* push imm32 */
>> +        cpu_memory_rw_debug(env, ip + 6, (void *)&imm32, sizeof(imm32), 0);
>> +        cpu_memory_rw_debug(env, ip + 1, (void *)&imm32, sizeof(imm32), 1);
>> +        patch_call(s, env, ip + 5, handlers->set_tpr);
>> +        break;
>> +    case 0xff: /* push r/m32 */
>> +        patch_byte(env, ip, 0x50); /* push eax */
>> +        patch_call(s, env, ip + 1, handlers->get_tpr_stack);
>> +        break;
> 
> With a table, we could put the patch sequences there as well.

Yep, starting to become really attractive.

> 
>> +    default:
>> +        abort();
>> +    }
>> +
>> +    paddr = cpu_get_phys_page_debug(env, ip);
>> +    paddr += ip & ~TARGET_PAGE_MASK;
>> +    tb_invalidate_phys_page_range(paddr, paddr + 1, 1);
>> +}
>> +
>> +
>> +static void vapic_map_rom_writable(VAPICROMState *s)
>> +{
>> +    target_phys_addr_t rom_paddr = s->rom_state_paddr & ROM_BLOCK_MASK;
>> +    MemoryRegionSection section;
>> +    MemoryRegion *as;
>> +    size_t rom_size;
>> +    uint8_t *ram;
>> +
>> +    as = sysbus_address_space(&s->busdev);
>> +
>> +    if (s->rom_mapped_writable) {
>> +        memory_region_del_subregion(as, &s->rom);
>> +        memory_region_destroy(&s->rom);
>> +    }
>> +
>> +    /* grab RAM memory region (region @rom_paddr may still be pc.rom) */
>> +    section = memory_region_find(as, 0, 1);
>> +
>> +    /* read ROM size from RAM region */
>> +    ram = memory_region_get_ram_ptr(section.mr);
>> +    rom_size = ram[rom_paddr + 2] * ROM_BLOCK_SIZE;
>> +    s->rom_size = rom_size;
>> +
>> +    /* FIXME: round up as everything underneath would fall apart otherwise
>> +     * (subpages are broken) */
>> +    rom_size = TARGET_PAGE_ALIGN(rom_size);
> 
> Ok.  Subpages aren't broken (at least, they can't be expected to work
> here).  Without a page aligned region you can't expect kvm to map this area.
> 
> Wrt tcg, it should work, but I don't think anyone ever tested tcg out of
> subpage.

OK, will rephrase the reason.

> 
>> +
>> +    memory_region_init_alias(&s->rom, "kvmvapic-rom", section.mr, rom_paddr,
>> +                             rom_size);
>> +    memory_region_add_subregion_overlap(as, rom_paddr, &s->rom, 1000);
> 
> This is incredibly hacky, so at least the spirit of the original code is
> preserved.

I know, and it caused some pain to write it (not only to find out how to
solve it technically). We would need to pass the RAM memory region down
to this freaky device, like we do to the i440fx for PAM purposes. But,
well, that is not straightforward right now. Better ideas welcome.

> 
>> +    s->rom_mapped_writable = true;
>> +}
>> +
>>
> 
> Impressive refactoring overall; I thought that code was a basket case.
> 

Thanks,
Jan

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]