qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Question about the block linking limitation


From: Max Filippov
Subject: Re: [Qemu-devel] Question about the block linking limitation
Date: Sun, 15 Apr 2012 02:54:24 +0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1

On 04/14/2012 03:44 PM, 陳韋任 wrote:
>> I've made a test from the grub multiboot sample, you may find it here:
>> http://jcmvbkbc.spb.ru/git/?p=dumb/qemu-test-kernel.git;a=summary
>>
>> With it I see that an attempt to execute a TB that spans two pages causes
>> an exception when the second page is unmapped. It happens because both
>> tlb_flush and tlb_flush_page invalidate relevant tb_jmp_cache entries:
>> the former flushes all of them, the latter flushes them for two adjacent 
pages
>> around the given address. Later tb_find_fast fails to find a TB in the
>> tb_jmp_cache and has to call tb_find_slow which retranslates TB, triggering
>> a pagefault.
>
>  Thanks for the example, Max. But..., I want to repeat the experiment you did
> and cannot figure out how to do that. Would you mind to give me some hints? 
For
> example, how did you locate the TB spanning pages whose second page happened 
to
> be unmapped?

First two patches in the mentioned repository is a grub multiboot kernel sample,
the third patch is my test. It can be built and run like this (you'll need 
autotools):

$ git clone git://jcmvbkbc.spb.ru/dumb/qemu-test-kernel.git
$ cd qemu-test-kernel
$ git checkout HEAD~1           # to see how the original kernel works
$ ./autogen.sh
$ ./configure
$ make
$ qemu-system-x86_64 -kernel docs/kernel

According to multiboot specification [1] multiboot kernel starts its execution 
in
protected mode with paging disabled.

The following fragment allocates properly aligned page directory and one page 
table,
makes 1:1 virtual to physical mapping for the first 4MB of virtual/physical 
memory,
loads page directory address into CR3 and enables paging (bit 31 in CR0):

uint32_t page_directory[1024] __attribute__((aligned(4096)));
uint32_t page_table[1024] __attribute__((aligned(4096)));

static void start_paging(void)
{
    unsigned i;
    for (i = 0; i < ARRAY_SIZE(page_table); ++i)
        page_table[i] = (i << 12) | 3;
    page_directory[0] = ((uint32_t)page_table) | 3;
    asm __volatile__ (
            "movl %0, %%cr3\n"
            "movl %%cr0, %0\n"
            "orl  $0x80000000, %0\n"
            "movl %0, %%cr0\n"
            : : "r"(page_directory) : "memory");
}

The following fragment allocates two adjacent pages and puts test code around
the page boundary between them: 20 'nop' instructions (opcode 0x90), 10 in the
first page, 10 in the second page, followed by a 'ret' instruction (opcode 
0xc3):

uint8_t code_buf[8192] __attribute__((aligned(4096)));

static void make_test_code(void)
{
    unsigned i;
    for (i = 0; i < 20; ++i)
        code_buf[4096 - 10 + i] = 0x90;
    code_buf[4096 + 10] = 0xc3;

}

The following fragment makes a function pointer f pointing to the beginning of 
'nop'
series and calls this function to make a TB (and check that it works at all).
If a return is put right after the first 'f();' the sample kernel should print 
a few lines
describing memory map and halt execution.

Then 'code_pfn' is a page frame number of the second page of the test code.
'page_table[code_pfn] = 0;' marks that page as non-present, following invlpg
instruction invalidates its TLB entry. Commented code that reloads CR3 register
may be used to invalidate the whole TLB. The following 'f();' invocation fails,
resulting in machine reset (because the IDT is not initialized).

static void test_code(void)
{
    void (*f)(void) = (void*)(code_buf + 4096 - 10);
    uint32_t code_pfn = (uint32_t)(code_buf + 4096) >> 12;
    f();

    page_table[code_pfn] = 0;

    //asm __volatile__ (
    //        "movl %%cr3, %%eax\n"
    //        "movl %%eax, %%cr3\n"
    //        ::: "memory");
    asm __volatile__ (
            "invlpg (%0)\n"
            : : "r"(code_buf + 4096) :"memory");
    f();
}

When the kernel is run with '-d in_asm,cpu,exec,int' I see the following
in the log:

IN: cmain
0x0000000000100272:  movb   $0xc3,0x10900a
0x0000000000100279:  mov    $0x108ff6,%ebx
0x000000000010027e:  call   *%ebx

Trace 0x4191ad90 [0000000000100272] cmain
EAX=00000014 EBX=00108ff6 ECX=00100000 EDX=003ff003
ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc4
EIP=00108ff6 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000cca10 00000027
IDT=     00000000 000003ff
CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000014 CCD=00000000 CCO=SUBL
EFER=0000000000000000
----------------
IN:
0x0000000000108ff6:  nop
0x0000000000108ff7:  nop
0x0000000000108ff8:  nop
0x0000000000108ff9:  nop
0x0000000000108ffa:  nop
0x0000000000108ffb:  nop
0x0000000000108ffc:  nop
0x0000000000108ffd:  nop
0x0000000000108ffe:  nop
0x0000000000108fff:  nop
0x0000000000109000:  nop
0x0000000000109001:  nop
0x0000000000109002:  nop
0x0000000000109003:  nop
0x0000000000109004:  nop
0x0000000000109005:  nop
0x0000000000109006:  nop
0x0000000000109007:  nop
0x0000000000109008:  nop
0x0000000000109009:  nop
0x000000000010900a:  ret

Trace 0x4191ae60 [0000000000108ff6]
EAX=00000014 EBX=00108ff6 ECX=00100000 EDX=003ff003
ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc8
EIP=00100280 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000cca10 00000027
IDT=     00000000 000003ff
CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000014 CCD=00000000 CCO=SUBL
EFER=0000000000000000
----------------
IN: cmain
0x0000000000100280:  mov    $0x109000,%eax
0x0000000000100285:  shr    $0xc,%eax
0x0000000000100288:  movl   $0x0,0x106000(,%eax,4)
0x0000000000100293:  mov    $0x109000,%eax
0x0000000000100298:  invlpg (%eax)

Trace 0x4191aed0 [0000000000100280] cmain
EAX=00109000 EBX=00108ff6 ECX=00100000 EDX=003ff003
ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc8
EIP=0010029b EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000cca10 00000027
IDT=     00000000 000003ff
CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000212 CCD=00000109 CCO=SARL
EFER=0000000000000000
----------------
IN: cmain
0x000000000010029b:  call   *%ebx

Trace 0x4191afa0 [000000000010029b] cmain
EAX=00109000 EBX=00108ff6 ECX=00100000 EDX=003ff003
ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc4
EIP=00108ff6 EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000cca10 00000027
IDT=     00000000 000003ff
CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000212 CCD=00000109 CCO=SARL
EFER=0000000000000000
check_exception old: 0xffffffff new 0xe
     0: v=0e e=0000 i=0 cpl=0 IP=0008:0000000000108ff6 pc=0000000000108ff6 
SP=0010:0000000000104fc4 CR2=0000000000109000

That's it (:

>  Also, I found something interesting in function cpu_exec (cpu-exec.c). The
> code snip below will do block linking only when the target tb does NOT span
> guest pages. Is it necessary? According to your observation, it seems QEMU
> handle tb spanning pages appropriately, why it still needs to check if the
> target tb spanning guest pages?

Because QEMU handling of TB spanning pages happens in the 
tb_find_fast/tb_find_slow,
which wouldn't be called in case of direct linking. This can be easily verified 
with
the testing kernel with the direct short jump (opcode 0xeb, jump target offset
+8 bytes) added to the test code:

static void make_test_code(void)
{
    unsigned i;
    code_buf[4096 - 20] = 0xeb;
    code_buf[4096 - 19] = 8;
    for (i = 0; i < 20; ++i)
        code_buf[4096 - 10 + i] = 0x90;
    code_buf[4096 + 10] = 0xc3;

}
static void test_code(void)
{
    void (*f)(void) = (void*)(code_buf + 4096 - 20);
...

> ---
>    if (next_tb != 0 && tb->page_addr[1] == -1) {
>                        ^^^^^^^^^^^^^^^^^^^^^^
>        tb_add_jump((TranslationBlock *)(next_tb & ~3), next_tb & 3, tb);
>    }
> ---
>
>  Finally, does the comment on gen_goto_tb (target-i386/translate.c) still
> hold? Maybe we should change it to something like "we handle the case where
> the block linking spans two pages here"?

I'd say that it does: the check is that pc is in the same page as the TB 
beginning or
the TB ending, they only differ when the TB spans two pages.

> ---
>    /* NOTE: we handle the case where the TB spans two pages here */
>    if ((pc & TARGET_PAGE_MASK) == (tb->pc & TARGET_PAGE_MASK) ||
>        (pc & TARGET_PAGE_MASK) == ((s->pc - 1) & TARGET_PAGE_MASK))  {
>    }
> ---

[1] 
http://www.gnu.org/software/grub/manual/multiboot/multiboot.html#Machine-state

--
Thanks.
-- Max



reply via email to

[Prev in Thread] Current Thread [Next in Thread]