[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] i386 emulation: improved flag handing
Re: [Qemu-devel] i386 emulation: improved flag handing
Sun, 29 Aug 2004 14:58:58 +0200
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624
The current QEMU eflags handling is not efficient for inc/dec as it must
recompute the C flag which is not modified by inc/dec. I think this is
the most important slowdown due to eflags handling. A simple solution
would just be to save CC_OP/CC_SRC/CC_DST instead of computing 'CF'. A
test is still needed if an inc/dec is followed by inc/dec to avoid
saving CC_OP/CC_SRC/CC_DST again.
So the eflags state would be:
if CC_OP == CC_OP_INC/DEC then all eflags except C are computed from
CC_SRC. 'CF' is computed from CC_OP_C, CC_DST_C and CC_SRC (CC_OP_C must
never be CC_OP_INC/DEC).
Your solution seems a little too complicated for the expected gain. Try
to compare it with my proposal.
Just for your information, my next developments will consist in
improving QEMU performance in the x86 on x86 case to match (or exceed
:-)) the VMware or VirtualPC level of performance. The downside is that
some kernel support will be needed. The kernel support will of course
remain optional. This mode of operation will replace 'qemu-fast'.
For the x86 on PowerPC case, better usage of the host registers would
give a performance boost. In particular, CC_SRC and CC_DST should be
saved in host registers too.
Magnus Damm wrote:
Here is something that I've been thinking about the last week. I hope it
can lead to improved performance.
The flag emulation code today:
The implementation today is rather straightforward and simple:
1. Each emulated instruction that modifies any flag will update up to
three variables containing instruction type (CC_OP), source value
(CC_SRC) and destination value (CC_DST). If the instruction not modifies
all flags, the previous flags are calculated - hopefully only the carry
2. When a instruction depends on a flag, all flags (or just the carry
flag) are calculated from the stored information.
3. During the opcode to micro operations translation, the last type of
flag instruction (CC_OP) is kept track of and only written if necessary.
4. After the translation between the i386 opcodes and the micro
operations has taken place, a optimization step takes place and replaces
micro operations that are redundant with NOPs.
Improved flag handling - a more fine grained approach:
By looking at the "status flag summary" in my 486 book I understand that
there are basically three groups of x86 instructions that modify flags.
Note that this does not include rare single-flag modifying instructions.
OF SF ZF AF PF CF
A x x x x x x
B x x x x x
C x x
Say hello to group A, group B and group C. Group A contains the most
common flag operations, group B is basically INC and DEC while group C
contains various shift instructions.
Each group is kept track of with two variables, CC_SRC_<group> and
CC_DST_<group>. The current value of the EFLAGS register is stored in a
variable called CC_EFLAGS. A 32 bit variable, CC_CACHE is used to store
the state of each flag. Six tables, one for each flag (cc_table_<flag>)
are used to lookup flag calculating functions.
12 bits flag state 18 bits group info
OF SF ZF AF PF CF A B C
NN NN NN NN NN NN NNNNNN NNNNNN NNNNNN
Each flag has a two bit field indicating the state:
0 -> flag is up to date, no need to flush cache.
1 -> flag was last modified by group A
2 -> flag was last modified by group B
3 -> flag was last modified by group C
When an instruction that belongs to group A is translated into micro
operations, the last micro operation will perform up to three variable
1. CC_CACHE is written with all flags states set to 1 (indicating the
flag belongs to group A) and group info A field is set to the
instruction number (compare with CC_OP today). This is a single 32 bit
2. CC_DST_A is set in the same way as CC_DST today.
3. If required, CC_SRC_A is set too.
When a group B or C instruction is translated, the last micro operation
1. CC_CACHE is modified (read-modify-write) to update the flags and
group info field B or C. For group B, all flags except CF are set to 2
(indicating group B). For group C, the OF and CF fields are set to 3
indicating group C.
2. For group B CC_DST_B is written, for group C CC_DST_C is written.
3. If required CC_SRC_B or CC_SRC_C is written.
Because group A instructions are the most common ones, the group A
implementation is faster (no read-modify-write) than group B and C.
Question: What happens when an instruction needs to test one or more
flags? Answer: Before the flag can be used to calculate anything micro
operations that flush the state of each flag must be performed. One
micro operation per flag. The post-translation optimization step could
probably change more than N flag flush micro operations into one micro
operation flushing all flags if that would be more efficient.
When the cache of one flag is flushed, the corresponding flag state
field in CC_CACHE is read out and used as a index into cc_group_<flag>
to point out the function used to flush the flag.
cc_group_<flag> will all point to a function that just returns,
remember that a flag state of 0 means that the flag is up to date.
The other functions will calculate the flag based on CC_DST_<group> and
CC_SRC_<group>, store the result in CC_EFLAGS and then mark the flag
state in CC_CACHE as 0 to indicate that the flag now is up to date.
The actual implementation of the flag calculation code will of course
vary, for some flags the code could be shared between all instruction
types in one group. Example: ZF and PF are probably handled in the same
way for all group A instructions. Other flags will probably need a
second look up dealing with the instruction type.
So, what my improved flag handling scheme basically does is to divide
the load of calculating the flags into a several small pieces. Only the
flags required by an instruction must be flushed. I hope that some
cycles could be saved by not calculating all flags. The downside is of
course that it will be less efficient to update all flags compared with
the implementation today. And that it is less efficient to modify group
B/C (read-modify-write) and store CC_DST_B/C + CC_SRC_B/C, than just
store CC_OP, CC_DST and CC_SRC like today.
A good thing though is that it is always possible to set any flag in the
EFLAGS register without recalculating any other flags. And, of course, I
feel that it would be easier to add more advanced optimization code
Should I start hacking on a patch? Or would it be a waste of time?
Please let me know what you think. Thanks!
Qemu-devel mailing list