[Dotgnu-pnet-commits] CVS: pnet/doc unrolling.txt,NONE,1.1

dotgnu-pnet-commits
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Dotgnu-pnet-commits] CVS: pnet/doc unrolling.txt,NONE,1.1

From:	Rhys Weatherley <address@hidden>
Subject:	[Dotgnu-pnet-commits] CVS: pnet/doc unrolling.txt,NONE,1.1
Date:	Sat, 10 May 2003 06:35:56 -0400
Update of /cvsroot/dotgnu-pnet/pnet/doc
In directory subversions:/tmp/cvs-serv19801a/doc

Added Files:
        unrolling.txt 
Log Message:


Description of new unroller system.


--- NEW FILE ---

Introduction
------------

CVM unrolling is a mechanism for speeding up the Portable.NET runtime
engine using some simple JIT techniques.  This document describes what you
need to do to write a CVM unroller for a new CPU architecture.

The process of writing an unroller has been simplified compared to earlier
versions of the runtime engine.  Most of the hard work of instruction
decoding, stack management, register allocation, etc, have already been
done for you, and you just need to supply the CPU specifics.  In particular,
you need to provide the following:

        - CPU-specific modifications to the CVM configuration.
        - Lists of rules for allocating registers, using the FPU, etc.
        - Code generation macros for the CPU in question.

If you need help, then send an e-mail message on the "pnet-developers"
mailing list, or contact Rhys Weatherley directly.  To subscribe to the
mailing list, visit "http://www.dotgnu.org";.

Modifying the CVM configuration
-------------------------------

The first thing to do is to modify the CVM configuration so that it
knows that you will be using the unroller.  Edit "pnet/engine/cvm_config.h"
and add some detection logic at the top of the file to detect your
architecture.  There is already logic there for x86, ARM, etc.

For example, the detection logic for a 32-bit architecture called "foo"
with little-endian words and word-aligned longs can be defined as follows:

        #if defined(__foo) || defined(__foo__)
                #define CVM_FOO
                #define CVM_LITTLE_ENDIAN
                #define CVM_LONGS_ALIGNED_WORD
                #define CVM_WORDS_AND_PTRS_SAME_SIZE
        #endif

The "CVM_FOO" macro will be used elsewhere to detect the CPU type.

Now, down the bottom of "pnet/engine/cvm_config.h", you need to add some
additional logic which defines the "IL_CVM_DIRECT_UNROLLED" macro.
For example:

        #if defined(IL_CVM_DIRECT) && defined(CVM_FOO) && \
                defined(__GNUC__) && !defined(IL_NO_ASM) && \
                !defined(IL_CVM_PROFILE_CVM_METHODS) && \
                !defined(IL_CVM_PROFILE_CVM_VAR_USAGE) && \
                defined(IL_CONFIG_UNROLL)
        #define IL_CVM_DIRECT_UNROLLED
        #endif

Finally, we need to add some logic to the top of "pnet/engine/cvm.c" to
perform manual register assignment.  It will look something like this:

        #elif defined(CVM_FOO) && defined(__GNUC__) && !defined(IL_NO_ASM)
                #define REGISTER_ASM_PC(x)              register x asm ("r1")
                #define REGISTER_ASM_STACK(x)   register x asm ("r2")
                #define REGISTER_ASM_FRAME(x)   register x asm ("r3")

The values "r1", "r2", and "r3" will probably be different for your CPU.
Look up your system's documentation to find three registers that are
normally used for local variables and which are saved across function calls.

These three manually-assigned registers will hold the important state
variables "pc", "stacktop", and "frame".

If you don't know which registers to choose, then ask on the pnet-developers
mailing list.  If your compiler cannot assign registers manually, then
there are other ways for the unroller to get the information, but they
are trickier to set up.  Contact pnet-developers for assistance.

You should now be able to recompile the runtime engine.  The compiler
will give you an error if the registers you chose are unsuitable.
The error might be strange, talking about "register spills".  If you
get such an error, go back and try different registers.

At this point, the engine is set up for unrolling but it isn't actually
doing any unrolling yet.  Re-test the engine - you will probably already
see a small performance improvement due to the manual register assignment.

Writing the CPU-specific rules
------------------------------

The next step is to make a file called "pnet/engine/md_foo.h".  This will
contain rules that tell the unroller how to assign registers and generate
code for your architecture.

If you need some extra helper macros, then put them into the file
"pnet/engine/md_foo_macros.h".  If some of your macros are complicated,
you may want to convert them into functions.  Put these functions into
"pnet/engine/md_foo.c" and update the "Makefile.am" file to include it.

We recommend starting with the "md_arm.h" file as a template, since ARM
is the simplest platform out of those that are currently supported.

If you want to make things easier on yourself, don't worry about
floating-point on the first pass - just get the integer operations working.
ARM is a good choice here because its unroller doesn't do floating-point.

The rest of this section describes the rule definitions in "md_foo.h":

MD_REG_<n>

        These macros define the word registers that are used for temporarily
        storing values during integer computations.  You can use up to 16
        registers for temporary work values.

        Even if your CPU has more than 16 registers, it is highly unlikely
        that the unroller will use more than 6 or 7 registers at any one time.
        You can experiment with greater numbers of registers later if you like.

        The registers you choose must not be used for any other purpose in
        the system.  e.g. you probably cannot use the CPU's stack pointer
        register as a temporary register.

        The order of MD_REG_<n> registers determines the order in which the
        unroller will allocate them to temporary values.  Usually the order
        will be unimportant.  The x86 CPU is an exception - more efficient
        code can be obtained for division and shift operations if the order
        starts with EAX, ECX, and then EDX.

MD_FREG_<n>

        These macros define the floating-point registers that are used during
        floating-point computations.  If your architecture doesn't have
        floating-point operations, or you don't wish to do floating-point
        at this time, then set all of them to -1.

MD_FP_STACK_SIZE

        Some CPU's (e.g. x86) organise their floating-point registers into a
        stack.  If this applies to you, then set this macro to the maximum 
height
        of the floating-point stack.  Otherwise set this macro to zero.

MD_REG_PC
MD_REG_STACK
MD_REG_FRAME

        The special registers that contain the CVM interpreter's "pc",
        "stacktop", and "frame" values.  These must be same as the registers
        you chose when configuring the engine earlier.

        Of these three registers, MD_REG_STACK and MD_REG_FRAME have a fixed
        meaning throughout the unrolled code, but MD_REG_PC can be reused as a
        temporary work register (i.e. one of the MD_REG_<n> values).

MD_STATE_ALREADY_IN_REGS

        This will normally be set to 1 unless you have the misfortune of
        using a compiler without the ability to manually assign registers.
        Contact pnet-developers in this case for assistance.

MD_REGS_TO_BE_SAVED

        This macro is a bitmask, with each bit corresponding to one of the
        registers in the MD_REGS_<n> list.  Use this if your architecture
        assigns special meaning to certain registers, but you wish to make
        use of them for temporary values anyway.

MD_SPECIAL_REGS_TO_BE_SAVED

        This is only useful if MD_STATE_ALREADY_IN_REGS is zero.  It should
        normally be set to zero.

MD_HAS_INT_DIVISION

        Set this to 1 if your CPU has integer division operations.  Some
        CPU's (e.g. ARM) don't have a simple division operator, and so
        the unroller should ignore integer division in this case.

        Note: you don't need to do anything special to handle division
        by zero or arithmetic overflow (MININT / -1).  The unroller will
        check for these cases before performing the division.

md_inst_ptr

        This is a typedef that defines the type of the instruction word.
        On CPU's with byte-aligned instructions, this will be "unsigned char".
        On word-aligned CPU's, this will typically be "unsigned int", or
        perhaps "unsigned long" on 64-bit architectures.

Writing the code generation macros
----------------------------------

The rest of the "md_foo.h" file consists of macros for generating code
for the various instructions used by the unroller.

md_push_reg(inst, reg)
md_pop_reg(inst, reg)

        Push or pop registers from the system stack.  The system stack is
        used to save registers before they are reused for other purposes.

md_discard_freg(inst, reg)

        Discard the contents of a floating-point register.  If the FPU
        is organised as a stack (MD_FP_STACK_SIZE != 0), then this will
        normally pop the top-most item from the stack.

md_load_const_32(inst, reg, value)

        Load a 32-bit constant into a register, sign-extending if the
        register is 64-bits in size.

md_load_const_native(inst, reg, value)

        Load a native (32-bit or 64-bit) constant into a register.  This
        will be the same as "md_load_const_32" on 32-bit platforms.

md_load_membase_word_32(inst, reg, basereg, offset)

        Loads the contents of the 32-bit memory location "basereg + offset"
        into the register "reg".  On 64-bit systems, this will sign-extend.

        Note: "offset" could be anything.  It isn't limited to any particular
        range.  Some CPU's cannot do a direct load with an arbitrary offset
        in one instruction, and need to load the offset into a scratch
        register first.

md_load_membase_word_native(inst, reg, basereg, offset)

        Load the contents of the native-sized memory location "basereg + offset"
        into the register "reg".  On 32-bit systems, this will be identical
        to "md_load_membase_word_32".

md_load_membase_byte(inst, reg, basereg, offset)
md_load_membase_sbyte(inst, reg, basereg, offset)
md_load_membase_short(inst, reg, basereg, offset)
md_load_membase_ushort(inst, reg, basereg, offset)

        Load 8-bit or 16-bit values form "basereg + offset".

md_load_membase_float_32(inst, reg, basereg, offset)
md_load_membase_float_64(inst, reg, basereg, offset)
md_load_membase_float_native(inst, reg, basereg, offset)

        Load floating-point values into a floating-point register.  The
        values are always extended to the "native" floating-point size.

        If the FPU is organised as a stack, this will load the value onto
        the top of the stack and "reg" is ignored.

md_store_membase_word_32(inst, reg, basereg, offset)

        Store the contents of "reg" to the address "basereg + offset"
        as a 32-bit value.  On 64-bit platforms, the most significant bits
        are discarded.

md_store_membase_word_native(inst, reg, basereg, offset)

        Store the contents of "reg" to the address "basereg+ offset"
        as a native-sized word value.

md_store_membase_byte(inst, reg, basereg, offset)
md_store_membase_sbyte(inst, reg, basereg, offset)
md_store_membase_short(inst, reg, basereg, offset)
md_store_membase_ushort(inst, reg, basereg, offset)

        Store 8-bit or 16-bit values from "reg" to "basereg + offset".
        It is OK if the value in "reg" is destroyed during the store
        because it will immediately forgotten by the unroller afterwards.
        (ARM destroys 16-bit values in the process of storing them).

md_store_membase_float_32(inst, reg, basereg, offset)
md_store_membase_float_64(inst, reg, basereg, offset)
md_store_membase_float_native(inst, reg, basereg, offset)

        Store floating-point values from "reg" to "basereg + offset".
        If the FPU is stack based, then this will always store the top-most
        value on the stack, and ignore "reg".

md_add_reg_imm(inst, reg, imm)
md_sub_reg_imm(inst, reg, imm)

        Add or subtract an immediate value to or from a word register.
        The immediate value could be anything - it is not limited to any
        particular range of values.

md_add_reg_reg_word_32(inst, reg1, reg2)
md_sub_reg_reg_word_32(inst, reg1, reg2)
md_mul_reg_reg_word_32(inst, reg1, reg2)
md_div_reg_reg_word_32(inst, reg1, reg2)
md_udiv_reg_reg_word_32(inst, reg1, reg2)
md_rem_reg_reg_word_32(inst, reg1, reg2)
md_urem_reg_reg_word_32(inst, reg1, reg2)
md_neg_reg_word_32(inst, reg)
md_and_reg_reg_word_32(inst, reg1, reg2)
md_or_reg_reg_word_32(inst, reg1, reg2)
md_xor_reg_reg_word_32(inst, reg1, reg2)
md_not_reg_word_32(inst, reg)
md_shl_reg_reg_word_32(inst, reg1, reg2)
md_shr_reg_reg_word_32(inst, reg1, reg2)
md_ushr_reg_reg_word_32(inst, reg1, reg2)

        Perform arithmetic operations on 32-bit integer values.  If the
        CPU is 64-bit, then most of these can be performed as 64-bit
        operations.  Some (e.g. division and right shifts) require the
        operands to be truncated to 32-bits first.

        It is expected that the code generator will be able to handle
        any combination of registers.  If an invalid combination is
        provided, then the code generator must save registers on the
        system stack to make room, perform the operation, and then
        restore everything to its original state.

        In some cases, the macro "md_is_free_reg(reg)" can be used to
        determine if a temporary work register is currently free.  This
        will allow you to avoid saving the register in some circumstances.

md_add_reg_reg_word_native(inst, reg1, reg2)
md_sub_reg_reg_word_native(inst, reg1, reg2)
md_mul_reg_reg_word_native(inst, reg1, reg2)
md_div_reg_reg_word_native(inst, reg1, reg2)
md_udiv_reg_reg_word_native(inst, reg1, reg2)
md_rem_reg_reg_word_native(inst, reg1, reg2)
md_urem_reg_reg_word_native(inst, reg1, reg2)
md_neg_reg_word_native(inst, reg)
md_and_reg_reg_word_native(inst, reg1, reg2)
md_or_reg_reg_word_native(inst, reg1, reg2)
md_xor_reg_reg_word_native(inst, reg1, reg2)
md_not_reg_word_native(inst, reg)
md_shl_reg_reg_word_native(inst, reg1, reg2)
md_shr_reg_reg_word_native(inst, reg1, reg2)
md_ushr_reg_reg_word_native(inst, reg1, reg2)

        Similar to above, except that these macros work on native-sized values.
        On 32-bit platforms, they will be identical to the above macros.

md_add_reg_reg_float(inst, reg1, reg2)
md_sub_reg_reg_float(inst, reg1, reg2)
md_mul_reg_reg_float(inst, reg1, reg2)
md_div_reg_reg_float(inst, reg1, reg2)
md_rem_reg_reg_float(inst, reg1, reg2)
md_neg_reg_float(inst, reg)

        Perform arithmetic operations on floating-point values.  If the
        FPU is organised as a stack, then the register arguments are ignored
        and the values at the top of the stack are used.

md_freg_swap(inst)

        Swap the two top-most values on the floating-point register stack.
        Not used if the FPU is not stack-based.

[More to come here]

Debugging
---------

Because debugging the unroller can be difficult, you may want to attack
the problem in stages.  The nice thing about the unroller is that the
interpreter will automatically handle anything that you haven't handled.

As described earlier, don't bother with floating-point on the first pass.
You can also temporarily remove entire instruction categories by commenting
out the #include's for "unroll_xxx.c" at the bottom of "unroll.c".

For testing, we recommend running the "make check" in pnetlib regularly,
and also running the PNetMark benchmark.  If either of these cause the
engine to crash, or to fail a test that works in the regular engine,
then you have probably done something wrong.  You can return to the regular
engine at any time by commenting out "IL_CVM_DIRECT_UNROLLED" in the
"pnet/engine/cvm_config.h" file.

Isolating what went wrong can be difficult.  Try commenting out sections
of "unroll_xxx.c" until the problem disappears.  Whatever you commented
out last might have something to do with the problem.

While you can breakpoint the unroller while it is converting code, it isn't
possible to put breakpoints in the code that it outputs.

You can also uncomment "UNROLL_DEBUG" in "unroll.c".  This will cause
the unroller to disassemble the unrolled code as it executes methods.
By staring at this output, you should hopefully be able to figure out
which instructions are being unrolled incorrectly.

Another problem is the CPU cache.  Most CPU's need to flush the data cache
prior to executing the unrolled code.  Check the "pnet/support/clflush.c"
file to ensure that cache flushing on your architecture is supported.

If still in doubt, don't hesitate to ask for help on "pnet-developers".
[Prev in Thread]
Current Thread
[Next in Thread]
[Dotgnu-pnet-commits] CVS: pnet/doc unrolling.txt,NONE,1.1, Rhys Weatherley <address@hidden> <=
Prev by Date: [Dotgnu-pnet-commits] CVS: pnet/support dir.c,1.15,1.16
Next by Date: [Dotgnu-pnet-commits] CVS: pnet ChangeLog,1.2344,1.2345
Previous by thread: [Dotgnu-pnet-commits] CVS: pnet/support dir.c,1.15,1.16
Next by thread: [Dotgnu-pnet-commits] CVS: pnet ChangeLog,1.2344,1.2345
Index(es):
- Date
- Thread