- Fast 8-bit Squaring
- 28 Jan 2014 09:31:04 am
- Last edited by Xeda112358 on 28 Jan 2014 12:22:04 pm; edited 1 time in total
A few days ago, I came up with a routine to approximate sine. The short, math version is:
-I converted the 8-bit register digits to a degree 7 polynomial where x=2 returns the correct value.
-I then made another polynomial where the coefficients where 1-ai.
-Then I multiplied these, basically performing z*~z
-Using these as fixed point numbers on [0,1), this is basically z(255/256-z)
-Scaling, this gives a nice approximation of sine.
-For speed purposes, I discarded the lower lower 8 terms, keeping the upper 8.
That was cool and all, but last night I was working on it some more when I realised that if I had taken just the lower terms, I wouldn't have propagation issues to worry about, and it would return -x2-x mod 256.
However, the algorithm I was using, if I initialized the accumulator to x instead of 0, I would get -x2. Naturally, I took advantage of this:
Code:
I am sure it can be optimized further, but it is already 159 t-states. It is only 29 cycles faster than my fastest general 8-bit multiplication routine, but can you imagine this as a circuit? It's kind of pretty cool, I think.
EDIT: I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:
Code:
Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.
-I converted the 8-bit register digits to a degree 7 polynomial where x=2 returns the correct value.
-I then made another polynomial where the coefficients where 1-ai.
-Then I multiplied these, basically performing z*~z
-Using these as fixed point numbers on [0,1), this is basically z(255/256-z)
-Scaling, this gives a nice approximation of sine.
-For speed purposes, I discarded the lower lower 8 terms, keeping the upper 8.
That was cool and all, but last night I was working on it some more when I realised that if I had taken just the lower terms, I wouldn't have propagation issues to worry about, and it would return -x2-x mod 256.
However, the algorithm I was using, if I initialized the accumulator to x instead of 0, I would get -x2. Naturally, I took advantage of this:
Code:
L_sqrd:
;Input: L
;Output: L*L->A
;159 t-states
;39 bytes
ld h,l
ld c,l
rr h
sbc a,a
xor l
add a,c
ld c,a
rl l
rr h
sbc a,a
xor l
and %11111000
add a,c
ld c,a
rl l
rr h
sbc a,a
xor l
and %11100000
add a,c
ld c,a
rl l
ld a,h
rrca
xor l
and 128
xor c
neg
ret
I am sure it can be optimized further, but it is already 159 t-states. It is only 29 cycles faster than my fastest general 8-bit multiplication routine, but can you imagine this as a circuit? It's kind of pretty cool, I think.
EDIT: I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:
Code:
L_sqrd:
;Input: L
;Output: L*L->A
;151 t-states
;37 bytes
ld h,l
;First iteration, get the lowest 3 bits
sla l
rrh
sbc a,a
or l
;second iteration, get the next 2 bits
ld c,a
rr h
sbc a,a
xor l
and $F8
add a,c
;third iteration, get the next 2 bits
ld c,a
sla l
rr h
sbc a,a
xor l
and $E0
add a,c
;fourth iteration, get the last bit
ld c,a
ld a,l
add a,a
rrc h
xor h
and $80
xor c
neg
ret
Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.