[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-libc-dev] Printing octal (brain dump)
From: |
George Spelvin |
Subject: |
Re: [avr-libc-dev] Printing octal (brain dump) |
Date: |
19 Dec 2016 02:03:55 -0500 |
tl;dr: Beta code included below; it passes my initial testing. I named
it _itoa_engine, following "ftoa_engine", but suggestions are welcome.
It is unfortunately one instruction longer than ultoa_invert:
text data bss dec hex filename
188 0 0 188 bc ultoa_invert.o
190 0 0 190 be itoa_engine.o
208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED)
212 0 0 212 d4 itoa_engine.o (without HAVE_MUL)
230 0 0 230 e6 itoa_engine.o (+OPTIMIZE_SPEED)
but I'm reasonably satisfied.
Well, it turned out that my great idea from last time had a nasty
bug in it. There's no really good way to check for termination after
printing each digit.
Recall that my bit buffer only has room for 15 bits of the number, so I
arrange for it to always hold 8 <= n <= 15 bits. When it's about to fall
from 8 to 7, I fetch another byte and decrement it from 16 to 15 instead.
The problem is that there are two ways to handle the length, and
both of them have problems.
One is to count so len=1 after fetching the last byte of input, and
len=0 after fetching the first byte of padding.
In that case, consider printing one byte of zero in hex.
I start with 8 bits in the buffer (msbyte = 0x01), and len = 1. I print a
'0', and check if len == 0. It's not, so I keep going and print a second
zero before realizing that I should stop.
Okay, that doesn't work, so what if I arrange so len counts the bytes
remaining? len=0 after fetching the last byte, and it holds there
after fetching padding.
In that case, there are no extra digits, but there are missing
digits. Consider printing 2^15 = 0x8000. I start out with
len=1 and the last byte in the bit buffer. I print the final 0, then
have to fetch another byte to start shifting bits into the lsbyte.
So the bit buffer holds 800 (msbyte = 0x18, lsbyte = 0x00, len = 0) when
I print the second 0. At this point, len == 0 and lsbyte == 0, so
I should stop printing. But there are more bits in the msbyte.
There are various tricks you can try in hex, but they break down in
octal. Checking for extra bits in the msbyte makes for messy
(= large) code.
So I went back to my original idea of checing for termination when
fetching padding. It's a couple of instructions longer, but it works.
Anyway, here's the current state of the function I intend to rewrite
vfprintf() around. Any comments? I know about a few mechanical things:
I need to integrate it into avr-libc and #include the appropriate headers
rather than #define things myself. Likewise X_movw and so on.
/*
* bum
*
* 1. vt. (obs.) To make highly efficient, either in time or space,
* often at the expense of clarity.
*
* The core integer-to-ASCII code. This takes as input arbitrary-
* precision byte arrays, and outputs ASCII in bases 10 or 2^k.
* (With mostly separate code.)
*
* This code is small and fast, but difficult to understand. It uses
* tricky number theory (dividing by multiplying by the 2-adic inverse)
* and low-level bit hacks to implement it. Sorry about that.
*
* The arguments are:
* char *out:
* Buffer in which the ASCII digits are placed, in little-endian
* order. Must be reversed by the caller for printing. The caller is
* also responsible for ensuring that it is big enough.
* uint8_t *bin:
* Buffer holding the input value. This is NOT const; it is used as
* working storage by the algorithm. More specifically, negation
* is done in place if requested, after which decimal conversion
* leaves the buffer filled with zero, and binary-base conversion
* leaves the buffer unmodified.
* uint8_t len:
* The length of the data at "bin". Must be between 1 and 255;
* if you pass in 0, Bad Things will happen.
* uint8_t flags:
* A variable whise bits determine the form of the output.
* flags bit 0: If set, the input is negated before printing.
* This is a convenience to the caller for printing signed
* numbers.
* flags bit 1: If set, digits greater than 10 (a, b, c, ...)
* are printed in lower case. If clear, they are printed in
* upper case. This has no effect if the base is 10 or less.
* flags bits 2..7: A contiguous bit mask (of the form (1 << k) - 1)
* selecting the base to print. If zero, decimal is output.
* If non-zero, a mask of the binary base to print (1 for
* binary, 7 for octal, 15 for hex). No error-checking
* is done; if you pass in something else, you'll get
* peculiar output.
*
* The return value is a pointer to just past the end of the output.
*
* An overview of the code is as follows:
* - The conditional negate is fairly straightforward. It's convenient
* to do here because it leaves the pointer at the msbyte for the
* next step.
* - Then we strip off most significant zero bytes until we get to a
* non-zero byte or only one byte is left. (The latter ensures we
* print "0" for zero, rather than the null string.)
* - Then the code splits depending on whether the remaining flags are
* zero (decimal) or not.
*
* - The decimal code alternates two phases, finding the last digit (the
* input mod 5), and then dividing the input by 10. Each involves a
* pass over the input, although conveniently in opposite directions.
* - Actually, the first pass (msbyte-to-lsbyte) divides the input by 2,
* and stores the lsbit. It also computes the quotient mod 255,
* which is then reduced mod 5. Combining (x % 2) and (x/2 % 5)
* gives the digit.
* - The second pass (lsbyte-to-msbyte) then divides the input by 5.
* This is done by subtracting the remainder mod 5 to produce an exact
* multiple of 5, then multiplying the result by the 2-adic expansion of
* -1/5 (which has a particularly simple form 0x...33333) and negating.
* - If the second pass produces an msbyte of 0, the length is reduced.
* When the length reaches zero, conversion is done.
*
* - The binary code does one lsbit-to-msbit pass over the input.
* - It mostly counts loop iterations with right shifts rather than
* counters. When the shift produces a zero result (and a set carry,
* since it must have been non-zero before), the count has expired.
* - The input value is kept in a 2-byte shift register which holds
* 8..15 bits of the input, with a 1 bit at the high end delimiting
* the valid data.
* - The main loop shifts the 16-bit buffer right one bit at a time until
* it's time to print a digit, or the buffer is about to underflow to
* 7 bits, meaning it's time to load another byte into the high half.
* - The "digit" value is a mask of "bits already printed". We shift
* it down until it's zero, then print the digit (lsb & mask), then
* reload it with the mask.
* - When the right shift of the msbyte of the buffer produces zero
* (the delimiter bit was just shifted out), we load a new msbyte and
* retry the shift.
* - When we run out of input bytes, there's some tricky code to add
* enough padding msbits to ensure remaining bits in the buffer
* are printed, for any remaining bits in the buffer to be printed,
* stopping when there are no more.
*
* The code often sets the carry bit several instructions before using it.
* When reading, it's important to know that ld, st, mov, inc, dec,
* and logical operations do not modify the carry bit.
*/
#ifndef __tmp_reg__
# define __tmp_reg__ r0
#endif
#ifndef __zero_reg__
# define __zero_reg__ r1
#endif
//#undef __AVR_HAVE_MUL__
#define __AVR_HAVE_MUL__ 1
#define OPTIMIZE_SPEED 1
/* Arguments */
#define out X /* Arrives in r24:r25, but we move it immediately */
#define out_lo r26
#define out_hi r27
#define bin Z /* Arrives in r22:r23, but we move it immediately */
#define bin_lo r30
#define bin_hi r31
#define len r20
#define flags r18 /* Mask, after removing two lsbits */
/* Local variables. The first four overlap with arguments. */
#define msb r25 /* 16-bit accumulator */
#define lsb r24
#define rem r23 /* Remainder mod 255, or mod 5 */
#define digit r22 /* Current digit */
#define tlen r21 /* Loop counter, copy of len or flags */
#define k r19 /* Constant value, usually the multiplier 0x33 */
#if __AVR_HAVE_MUL__
#define zero r18 /* Used instead of r1 to free up multiplier */
#else
#define zero __zero_reg__
#endif
.text
.global _itoa_engine
.type _itoa_engine, @function
_itoa_engine:
movw out_lo, r24
movw bin_lo, r22
lsr flags ; Lsbit is negate flag
#if OPTIMIZE_SPEED
; 4 more instructions, but avoids a pass over the input (saving
; 9*len - 4 cycles) in the common case when no negation is desired.
brcs 1f
add bin_lo, len
adc bin_hi, __zero_reg__
rjmp 3f
#endif
;; Conditional negate, depending on carry. The standard identity
;; -x = ~x + 1 can be extended to do a conditional negate.
;; Given a mask of 0 or -1, (x^mask) - mask returns x or -x.
;; However, we would need the carry bit clear to start this,
;; and forming "mask" from the carry bit in one instruction
;; preserves the carry bit. So instead do the conditional
;; increment using add zero with carry.
;;
;; It turns out there's no faster way to do an unconditional
;; negate. Multi-precision negate is awkward on AVR, which lacks
;; negate-with-carry or any form of dest = src-dest instruction.
;; So it's two instructions minimum (mov+sbc or invert+adc),
;; and either way we need one instruction of setup, to clear
;; the carry or create a constant of 0xff. The latter because
;; the COM instruction overwrites carry, and there's no EORI,
;; so we need to load a constant 0xff into a register.
1: mov tlen, len
sbc k, k ; Set to 0 or -1, carry preserved
2: ld __tmp_reg__, bin
eor __tmp_reg__, k
adc __tmp_reg__, __zero_reg__
st bin+, __tmp_reg__
dec tlen
brne 2b
; Strip redundant trailing (most-significant) zeros from bin.
3: ld __tmp_reg__, -bin
or __tmp_reg__, __tmp_reg__
brne 4f
dec len
brne 3b
inc len
; Split into two branches based on msbits of flag.
4: lsr flags ; lower/upper case flag to carry
brne .L_binary
;; The longest form of the decimal code (OPTIMIZE_SPEED && !HAVE_MUL)
;; is exactly 63 instructions (126 bytes) long, which is the extreme
;; limit of this conditional branch. If it gets any longer, you'll
;; have add an instruction. I suggest swapping the cases and dealing
;; with the fact that the .L_epilogue branch is now out of range by
;; placing it at the end of the binary code and having the decimal
;; code end in a rjmp to it.
.L_decimal:
adiw bin_lo, 1
#if __AVR_HAVE_MUL__
clr zero
ldi k, 0x33
#endif
; The main loop, repeated while len > 0
.L_decimal_loop:
#if OPTIMIZE_SPEED
ldi digit, '0'+10 ; Adjust for later offset on rem
#else
ldi digit, '0' ; Sneakily preload the ASCII offset
#endif
ser rem ; Sum of all bytes, mod 255
mov tlen, len
;; Pass 1, msb-to-lsb: finding the input mod 10.
;;
;; We do two things here: divide by 2 (saving the lsbit), and sum
;; the result mod 255. This is then used to compute the result
;; mod 5, which combined with the digit gives the decimal digit
;; we want. The awkward thing is that we want *two* carry bits
;; in this loop (one for the shift and the other for summing the
;; bytes), so we use the lsbit of digit for one of them.
.L_mod10:
ld __tmp_reg__, -bin
lsr digit ; Lsbit to carry bit
ror __tmp_reg__
st bin, __tmp_reg__
rol digit ; Carry bit to lsbit
add rem, __tmp_reg__
adc rem, zero ; End-around carry for mod 255
dec tlen
brne .L_mod10
#if OPTIMIZE_SPEED
; Reduce rem mod 15 (from 1 <= rem <= 255 to 1 <= rem <= 15)
mov __tmp_reg__, rem
swap __tmp_reg__
andi rem, 0xf0
add rem, __tmp_reg__ ; Add high halves to get carry bit
andi rem, 0xf0
swap rem
adc rem, zero ; End-around carry
; Reduce rem mod 5 (average of 2.2 subtracts)
0: subi rem, 5
brcc 0b
; We are left with rem-5, which can be compensated for elsewhere.
#else
;; Reduce rem mod 35, using a loop.
;;
;; This is a bit slower (22.82 cycles average to the end of the
;; mod-5 loop, vs. 12.6 cycles using the above), but saves 5
;; instructions. Any multiple of 5 will work, but 35 is optimal.
0: subi rem, 35
brcc 0b
; Now add back mod 5 until we go positive again.
0: subi rem, -5
brcs 0b
#endif
; Form and store ASCII digit (2*rem + digit)
add digit, rem
add digit, rem
st out+, digit
;; Pass 2, lsb-to-msb: dividing by 5.
;;
;; Rather than do a general divide by 5, we can subtract rem
;; to produce a multiple of 5, and then do an exact division by
;; multiplying by the 2-adic expansion of 1/5, 0xCCC...CCD.
;;
;; To get this into an even simpler form, we multiply by -1/5 =
;; 0x333...333 and negate. Each byte is multiplied by 0x33 and
;; added to an accumulator to be used for each higher byte.
;;
;; The accumulator has to be 16 bits wide, but after storing
;; each output byte, we can fold the msbyte into the lsbyte.
;;
;; Negating the output can be "complement and add one", but
;; we do it as "subtract one and complement", initializing the
;; accumulator to 0xff, then complementing before storing.
;;
;; To subtract rem without an separate carry propagation pass,
;; subtract 0x33 times rem from the accumulator to start.
;; (Since 0 <= rem <= 4, this is very easy.)
; msb:lsb = 255 - (rem * 0x33). Guaranteed to fit into 8 bits.
#if !OPTIMIZE_SPEED
ldi lsb, 255
#elif __AVR_HAVE_MUL__
ldi lsb, 0 ; Adjust for pre-decrement of rem by 5
#else
ldi lsb, 45 ; Adjust for pre-decrement of rem by 5
#endif
#if __AVR_HAVE_MUL__
mul rem, k
sub lsb, r0
#else
; Multiply by 0x11
mov __tmp_reg__, rem
swap __tmp_reg__ ; rem < 16, so this is rem << 4
add __tmp_reg__, rem
; Multiply by 3
sub lsb, __tmp_reg__
sub lsb, __tmp_reg__
sub lsb, __tmp_reg__
#endif
clr msb
; Here is the actual divide loop
mov tlen, len
.L_divide:
ld __tmp_reg__, bin
; acc += 0x33 * r0. This product could be up to 14 bits long.
#if __AVR_HAVE_MUL__
mul __tmp_reg__, k
add lsb, r0
adc msb, r1
#else
; let rem:digit = __tmp_reg__ << 4 = __tmp_reg__ * 0x10
mov digit, __tmp_reg__
swap digit
mov rem, digit
andi rem, 15 ; Mask off high 4 bits
eor digit, rem ; Mask off low 4 bits
; Add __tmp_reg__ to get __tmp_reg__ * 0x11
add digit, __tmp_reg__
adc rem, zero
; Now add it to the accumulator 3 times (there is no faster way)
add lsb, digit
adc msb, rem
add lsb, digit
adc msb, rem
add lsb, digit
adc msb, rem
#endif
; Store the complemented accumulator (*bin++ = ~lsb)
mov __tmp_reg__, lsb
com __tmp_reg__
st bin+, __tmp_reg__
; Fold the accumulator: msb:lsb = msb + lsb
add lsb, msb
clr msb
adc msb, zero
dec tlen
brne .L_divide
;; End of main loop: check if the new msbyte was zero.
;; If so, drop it (decrement len), and stop when len == 0.
cpse r0, zero
rjmp .L_decimal_loop
sbiw bin_lo, 1
dec len
brne .L_decimal_loop
#if __AVR_HAVE_MUL__
clr __zero_reg__
#endif
; Finally copy the out pointer to r24 and return it.
.L_epilogue:
movw r24, out_lo
ret
.L_binary:
movw bin_lo, r22 ; Reset bin to lsbyte
; Done with args in r22-r25; now allowed to use digit, rem, lsb, msb
; Load startup values.
ldi rem, 'A'-'0'-10 ; ASCII adjustment for hex digits past 9
brcc 0f
ldi rem, 'a'-'0'-10
0: ldi msb, 1
ld lsb, bin+
.L_digit_out: ; Spit out a digit
mov digit, flags
and digit, lsb
cpi digit, 10
brcs 0f
add digit, rem ; Hex digit > 9
0: subi digit, -'0'
st X+, digit
mov digit, flags
.L_bitloop:
lsr msb
brne 6f
; msb is empty; fetch another byte
dec len ; Preserves carry
breq 7f ; No more input
ld msb, Z+
5: ror msb ; Shift carry=1 into msbit
6: ror lsb
lsr digit
brne .L_bitloop ; if ((tlen >>= 1)== 0) {
rjmp .L_digit_out
;; We've run out of bytes. If all 1 bits in lsb have been printed
;; (i.e. lsb <= digit), stop. If not, pad with enough zeros to print
;; one more digit, and so we'll check again after printing it.
7: cp digit, lsb ; if (lsb <= digit) return X
brcc .L_epilogue
inc len
adc msb, digit ; msb = digit + 1, and clear carry
rjmp 5b
.size _itoa_engine, .-_itoa_engine
- [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/14
- [avr-libc-dev] Working octal code (FYI), George Spelvin, 2016/12/16
- Re: [avr-libc-dev] Printing octal (brain dump),
George Spelvin <=
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/21
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/21
- [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/23
- Re: [avr-libc-dev] Even faster decimal code, Georg-Johann Lay, 2016/12/24
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/24