[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[avr-gcc-list] Speeding [u]ltoa by 35%

From: Georg-Johann Lay
Subject: [avr-gcc-list] Speeding [u]ltoa by 35%
Date: Tue, 19 Jun 2012 09:00:12 +0200
User-agent: Thunderbird (Windows/20100228)

Reviewing the ltoa sources [0] shows that the implementation's
speed is not award winning: it uses 32-bit div+mod.

There was propose [1] to speed ltoa by means of LUT but that
approach tries to use lookup tables for special radices.

The new proposal inlines __divmodsi4 leading to an ~35% tick gain
without increasing code size:

#define VAL     22
#define STR     20
#define RADIX   18

#define REST    26
#define CNT     19

DEFUN my_ltoa
    movw    r30,    STR

    mov     r27,    STR+1
    mov     TMP,    STR

    cpi     RADIX,   36+1
    cpc     RADIX+1, ZERO
    brsh 99f
    cpi     RADIX,  2
    brlo 99f

    cpi     RADIX,  10
    brne 1f
    bst     VAL+3,  7
    brtc 1f
    com     VAL+3
    com     VAL+2
    com     VAL+1
    neg     VAL
    sbci    VAL+1,  -1
    sbci    VAL+2,  -1
    sbci    VAL+3,  -1

1:  clr     REST
    ldi     CNT,    32

2:  lsl     VAL
    rol     VAL+1
    rol     VAL+2
    rol     VAL+3
    rol     REST
    cp      REST,   RADIX
    brlo 3f
    sub     REST,   RADIX
    inc     VAL
3:  dec     CNT
    brne 2b

    subi    REST,   -'0'
    cpi     REST,   '9'+1
    brlo 4f
    subi    REST,   '0'-'a'+10

4:  st      Z+,     REST
    sbiw    VAL+2,  0
    sbci    VAL+1,  0
    sbci    VAL,    0
    brne 1b

    brtc 99f
    ldi     CNT,    '-'
    st      Z+,     CNT

    mov     r24,    TMP
    mov     r25,    r27
    st      Z+,     ZERO
    jmp     strrev

ENDF my_ltoa

Some notes on the code:

* The code is not yet documented or ready to be dropped into
  avr-libc sources (coding style, parametrize hardware)

* Speed gain 34%..39% see [1] and [2] for detail analysis

* Code size is same as old implementation

* Won't drag in __divmodsi4, thus code size decrease of > 100
  bytes if application otherwise does not need 32-bit div

* Less stack usage

* Strict radix checking: original version is not strict, e.g.
  0x100 + 10 will slip through

* Code size can be decreased further by moving the sanity
  check to C like so:

static inline __attribute__((always_inline))
char* ltoa (long x, char *str, int radix)
    if (radix < 2 || radix > 36)
        *str = '\0';
        return str;
        extern char* ltoa_asm (long, char*, unsigned char);
        return ltoa_asm (x, str, radix);

Presumably almost all use cases use radix known at compile time.
Thus, the compiler can optimize away the sanity check. Moreover,
radix can be passed as 8-bit value: no need to pass 10 as 0x000a.

More notes:

* Similar improvement can be used for ultoa

* Code between [u]ltoa could be shared.

* There is ultoa_invert [4] that
    - str[] will NOT 0 terminated.
    - Sequence of digits is inverted.
    - It returns pointer to first byte after a string.

All these is readily available in ltoa, too. Again, code could
be shared.

A similar method could be used to speed to utoa/itoa with a speedup
of > 20%. But it is harder to work against code size increase,
and without the help of C inlining the code increases by ~10-16 bytes.
Again, code increase is no issue if the application does not need
16-bit divmod, as current i/ltoa will then drag that code.



[0] AVR-Libc: ltoa.S

[1] avr-libc: optimization for ltoa/printf

[2] Plot:

[3] Plot Description

[4] AVR-Libc: ultoa_invert.S

reply via email to

[Prev in Thread] Current Thread [Next in Thread]