- Fast 8-bit Squaring
- 28 Jan 2014 09:31:04 am
- Last edited by Xeda112358 on 28 Jan 2014 12:22:04 pm; edited 1 time in total

A few days ago, I came up with a routine to approximate sine. The short, math version is:

-I converted the 8-bit register digits to a degree 7 polynomial where x=2 returns the correct value.

-I then made another polynomial where the coefficients where 1-a

-Then I multiplied these, basically performing z*~z

-Using these as fixed point numbers on [0,1), this is basically z(255/256-z)

-Scaling, this gives a nice approximation of sine.

-For speed purposes, I discarded the lower lower 8 terms, keeping the upper 8.

That was cool and all, but last night I was working on it some more when I realised that if I had taken just the lower terms, I wouldn't have propagation issues to worry about, and it would return -x

However, the algorithm I was using, if I initialized the accumulator to x instead of 0, I would get -x

I am sure it can be optimized further, but it is already 159 t-states. It is only 29 cycles faster than my fastest general 8-bit multiplication routine, but can you imagine this as a circuit? It's kind of pretty cool, I think.

EDIT: I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.

Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:

Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.

-I converted the 8-bit register digits to a degree 7 polynomial where x=2 returns the correct value.

-I then made another polynomial where the coefficients where 1-a

_{i}.-Then I multiplied these, basically performing z*~z

-Using these as fixed point numbers on [0,1), this is basically z(255/256-z)

-Scaling, this gives a nice approximation of sine.

-For speed purposes, I discarded the lower lower 8 terms, keeping the upper 8.

That was cool and all, but last night I was working on it some more when I realised that if I had taken just the lower terms, I wouldn't have propagation issues to worry about, and it would return -x

^{2}-x mod 256.However, the algorithm I was using, if I initialized the accumulator to x instead of 0, I would get -x

^{2}. Naturally, I took advantage of this:**Code:**```
```

L_sqrd:

;Input: L

;Output: L*L->A

;159 t-states

;39 bytes

ld h,l

ld c,l

rr h

sbc a,a

xor l

add a,c

ld c,a

rl l

rr h

sbc a,a

xor l

and %11111000

add a,c

ld c,a

rl l

rr h

sbc a,a

xor l

and %11100000

add a,c

ld c,a

rl l

ld a,h

rrca

xor l

and 128

xor c

neg

ret

I am sure it can be optimized further, but it is already 159 t-states. It is only 29 cycles faster than my fastest general 8-bit multiplication routine, but can you imagine this as a circuit? It's kind of pretty cool, I think.

EDIT: I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.

Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:

**Code:**```
```

L_sqrd:

;Input: L

;Output: L*L->A

;151 t-states

;37 bytes

ld h,l

;First iteration, get the lowest 3 bits

sla l

rrh

sbc a,a

or l

;second iteration, get the next 2 bits

ld c,a

rr h

sbc a,a

xor l

and $F8

add a,c

;third iteration, get the next 2 bits

ld c,a

sla l

rr h

sbc a,a

xor l

and $E0

add a,c

;fourth iteration, get the last bit

ld c,a

ld a,l

add a,a

rrc h

xor h

and $80

xor c

neg

ret

Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.