6502's don't support division directly so the closest thing we've got are
the shift operators. The nasty part is that I'm trying to write a pixel
driver and need to divide the X coord by 7 to get the correct pixel byte. I
thought about using a table, but somehow, having 140 or even 280 different
values stored didn't seem appealing for addressing a meager 40 bytes. Is
this the fastest division by 7 routine possible?
;Divide by 7 Specific routine
;Divisor in A
;Dividend always 7
DIV7 TAX ;clear SAVEA for use
LDA #$0
STA RESULT
TXA
LDIV PHA ;save A onto the stack
LSR ;divide a by 8
LSR
LSR
STA SAVEA
ADC RESULT ;add the results together
STA RESULT
PLA
AND #$7 ;get the remainder
ADC SAVEA ;add back the quotient
CMP #$7 ;are we done yet?
BNE TEST2
INC RESULT
LDA #0
BEQ DONE
TEST2 BCS LDIV
DONE TAX ;the remainder goes in X
LDA RESULT ;the result goes in A
RTS
This would loop at most 3 times. Or is it better to just do this:
;General division routine
;Divisor in A
;Dividend in X
DIV CLC
STX DIVISR
PHA
LDA #0
STA RESULT
PLA
BNE ADJUST
BRK ;division by zero error
ADJUST BMI DIVIDE ;loop till the high bit gets set
ASL DIVISR
BCC ADJUST
DIVIDE CMP DIVISR
BCC NEXT ;skip ahead if A is less
SET PHA
LDA RESULT
ORA #1
STA RESULT
PLA
SBC DIVISR
NEXT ASL RESULT
CLC
ROR DIVISR
BCC DIVIDE
DONE TAX ;put the remainder in X
LDA RESULT ;and the result in A
RTS
In this case, the ADJUST and DIVIDE loops may loop up to 7 times for any
given number. It's definitely more utility, but it's longer and more time
consuming. Is there a faster way to handle the X address of a pixel short of
using a table?
R. King
http://rich12345.tripod.com/srccode/graphic.html
this above page has the whole sourcecode
following this post is the division routine.
it takes the x coordinate, subtracts 7 from
it until it is less than 7, and puts the
result in Y register/xbyte variable and the
remainder in A register/xbit variable.
xbyte refers to which byte to draw to,
xbit refers to which shape in your preshifted
table should be drawn.
Rich
SCX = screen X coordinate
start ldy #0
lda scx
clc
dloop cmp #7 ; this is a cheap division
bcc ddone ; that can be avoided
sec
sbc #7
iny
clc
jmp dloop
ddone sta xbit ; division done
sty xbyte
aiiadict@gmail.com writes:
> http://rich12345.tripod.com/srccode/graphic.html
>
> this above page has the whole sourcecode
>
> following this post is the division routine.
>
> it takes the x coordinate, subtracts 7 from
> it until it is less than 7, and puts the
> result in Y register/xbyte variable and the
> remainder in A register/xbit variable.
>
> xbyte refers to which byte to draw to,
> xbit refers to which shape in your preshifted
> table should be drawn.
>
> Rich
>
> SCX = screen X coordinate
>
> start ldy #0
> lda scx
> clc
> dloop cmp #7 ; this is a cheap division
> bcc ddone ; that can be avoided
> sec
> sbc #7
> iny
> clc
> jmp dloop
> ddone sta xbit ; division done
> sty xbyte
>
The loop in the above routine can be shortened to three instructions:
start ldy #ff
lda scx
sec
dloop iny
sbc #7 ; this is a cheap division
bcs dloop ; that can be avoided
adc #7
sta xbit ; division done
sty xbyte
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
Ok. Just finished cycle counting.
In the "Divide by 7 specific" routine I gave, it measures out to around 135
cycles for calculating 255/7. For the super-short subtraction version given
below it comes up to 234 given the same assumptions and calculation. Is
there any way to get a result as fast as the routine I listed while making
it smaller like the one below?
"Scott Hemphill" wrote in message
news:m3zmybq9ey.fsf@pearl.local...
> aiiadict@gmail.com writes:
>
>
> start ldy #ff
> lda scx
> sec
> dloop iny
> sbc #7 ; this is a cheap division
> bcs dloop ; that can be avoided
> adc #7
> sta xbit ; division done
> sty xbyte
"Ranando King" writes:
> Ok. Just finished cycle counting.
>
> In the "Divide by 7 specific" routine I gave, it measures out to around 135
> cycles for calculating 255/7. For the super-short subtraction version given
> below it comes up to 234 given the same assumptions and calculation. Is
> there any way to get a result as fast as the routine I listed while making
> it smaller like the one below?
Do you really need the whole range 0-255 ? Is it worst case performance
that needs to be minimized or average case?
> "Scott Hemphill" wrote in message
> news:m3zmybq9ey.fsf@pearl.local...
> > aiiadict@gmail.com writes:
> >
>
> >
> > start ldy #ff
> > lda scx
> > sec
> > dloop iny
> > sbc #7 ; this is a cheap division
> > bcs dloop ; that can be avoided
> > adc #7
> > sta xbit ; division done
> > sty xbyte
You can split the range in half (roughly) with a few more instructions:
lda scx
cmp #126 ; decimal 7*18
bcc dentry1 ; branch if less than 126
ldy #17 ; start with quotient 18-1
sbc #126
bcs dloop
dentry1 ldy #-1
sec
dloop iny
sbc #7
bcs dloop
adc #7
sta xbit
sty xbyte
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
> With the function I wrote, for 35 of the possible 255 values, the time is
> 53-61 cycles.
Here is a 22-byte routine that takes between 32 and 67 cycles
(about 50 average), not counting the final RTS.
* Integer division by seven
* Input: an unsigned byte (0..255) in A
* Output: Computes A/7
* quotient (0..36) in X
* remainder (0..6) in A
* I use (8a+b)/7 = a + (a+b)/7
* where a = higher five bits and b = lower three bits
DIVIDE7:
tax
and #$7
sta TMP ; low bits (b)
txa
lsr ; replace with ror to use carry as 9th bit of A
lsr
lsr
tax ; high bits, divided by 8 (a)
clc
adc TMP ; a + b. here we know carry is low.
.1 inx ; the loop is executed between 1 and 6 times
sbc #7 ; since A is at most 38
bcs .1
adc #8 ; not 7 because of low carry at top of loop
dex
rts
Paul Guertin
pg@sff.net
Ranando King wrote:
> 6502's don't support division directly so the closest thing we've got are
> the shift operators. The nasty part is that I'm trying to write a pixel
> driver and need to divide the X coord by 7 to get the correct pixel byte. I
> thought about using a table, but somehow, having 140 or even 280 different
> values stored didn't seem appealing for addressing a meager 40 bytes. Is
> this the fastest division by 7 routine possible?
For speed, you would be better off using lookup tables. A while back I
put out a sample block routine. If you want to avoid generating the
tables yourself try this.
ftp://ftp.apple.asimov.net/pub/apple_II/documentation/zblock.source.txt
Of course you might be able to use the whole thing with a modification
or two.
Cheers,
Mike T.
"Ranando King" writes:
> 6502's don't support division directly so the closest thing we've got are
> the shift operators. The nasty part is that I'm trying to write a pixel
> driver and need to divide the X coord by 7 to get the correct pixel byte. I
> thought about using a table, but somehow, having 140 or even 280 different
> values stored didn't seem appealing for addressing a meager 40 bytes. Is
> this the fastest division by 7 routine possible?
>
>
> ;Divide by 7 Specific routine
> ;Divisor in A
> ;Dividend always 7
> DIV7 TAX ;clear SAVEA for use
> LDA #$0
> STA RESULT
> TXA
> LDIV PHA ;save A onto the stack
> LSR ;divide a by 8
> LSR
> LSR
> STA SAVEA
> ADC RESULT ;add the results together
> STA RESULT
> PLA
> AND #$7 ;get the remainder
> ADC SAVEA ;add back the quotient
> CMP #$7 ;are we done yet?
> BNE TEST2
> INC RESULT
> LDA #0
> BEQ DONE
> TEST2 BCS LDIV
> DONE TAX ;the remainder goes in X
> LDA RESULT ;the result goes in A
> RTS
It seems to me that there must be room for improvement, because the
following code successfully calculates division by 7 with no loops.
The drawback is that it doesn't produce a remainder.
div7 sta dvdend
lsr
lsr
lsr
adc dvdend
ror
lsr
lsr
adc dvdend
ror
lsr
lsr
sta result
rts
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
Scott Hemphill writes:
> It seems to me that there must be room for improvement, because the
> following code successfully calculates division by 7 with no loops.
> The drawback is that it doesn't produce a remainder.
>
> div7 sta dvdend
> lsr
> lsr
> lsr
> adc dvdend
> ror
> lsr
> lsr
> adc dvdend
> ror
> lsr
> lsr
> sta result
> rts
One way to get the remainder is to just multiply the result by 7 and
subtrace from the dividend:
div7 sta dvdend
lsr
lsr
lsr
adc dvdend
ror
lsr
lsr
adc dvdend
ror
lsr
lsr
sta result
asl
asl
asl
eor #-1
sec
adc result
clc
adc dvdend
tax
lda result
rts
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
"Ranando King" writes:
> ... and that would be the type of answer I was looking for.
> Consistently fast, small code. Always comes out to about 61 cycles or less.
Although this code can be modified to work with carry as a 9th bit, Paul
Guertin's code looks like it will then be faster in the worst case.
It's interesting to compare the theories of operation of the posted division
methods.
Your code divides by 8 to get an approximate quotient, then subtracts seven
times that from the dividend to get a remainder, which needs to be divided
by seven. You then repeat this operation in a loop, needing no more than
three tries to get the remainder down below 7.
Rich (aiiadict@gmail.com) and my first posted code divide by seven by
subtracting seven repeatedly, counting the number of times you can do this.
Paul Guertin's code is a hybrid between these two methods, dividing by 8
to get an approximate quotient and a remainder, then subtracting off 7
repeatedly to finalize the answer.
This last posting of mine divides by seven by multiplying by one seventh.
One seventh has the binary representation 0.00100100,10010010,01....
I use just the first three one bits, using 1/7 ~ (1/8+1/64+1/512),
calculating the low bits of the answer first. There's also some magic
in using the carries from the LSR instructions in the ADCs to cause
rounding up in all the right places.
[snip]
> There is one more question though. Does anyone know where in memory the
> Applesoft math functions are?
Try $E7A0 through $F1D4. I don't have the entry points handy.
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
On Tue, 15 Feb 2005 14:33:45 -0600, "Ranando King"
wrote:
> > Although this code can be modified to work with carry as a 9th bit, Paul
> > Guertin's code looks like it will then be faster in the worst case.
>
> True. That's why I'm using his division routine in the graphics module.
Glad to be of help.
If you're going to use the carry as a 9th bit, it may be faster
yet to use (64a+b+c)/7 = 9a+b + (a+b+c)/7. That is, split the
9-bit dividend into three 3-bit slices, form 9a+b and a+b+c using
three-position shifts, and count how many times you can decrement
a+b+c by 7, staying positive. Either with a loop as I did, or by
manually unrolling the loop (decision tree) if every cycle counts.
I'm a little busy right now and don't have to time to write a
proper implementation, but my guess is it should be significantly
faster if the dividend is often >255 (since in this case,
a+b+c <= 7+7+7 = 21 instead of a+b <= 63+7 =70. So three loops
maximum (or two if you exclude 511, the only number which makes
a+b+c=21) instead of ten.
I don't know if it is worth it if your numbers are always going
to be in 0..279, though. 279 only loops 6 times.
Come to think of it, the simplest way might be to test the state
of the carry bit at the very beginning of the routine. If C=0, use
my routine, and if C=1, use a modified version of my routine which
adds 36 to X and 4 to A before the loop (since 256 = 36*7 + 4).
Sorry for the unstructured rambling and no code.
> > It's interesting to compare the theories of operation of the posted
> division
> > methods.
Just for fun, here's a different method, which works but doesn't
seem to have any significant advantage over the ones already posted.
It takes between 67 and 73 cycles. The idea is quite simple.
cmp #$e0 ; divide by 7*2^n, for n=6 down to 1
bcc .1
sbc #$e0
.1 rol tmp
cmp #$70
bcc .2
sbc #$70
.2 rol tmp
cmp #$38
bcc .3
sbc #$38
.3 rol tmp
cmp #$1c
bcc .4
sbc #$1c
.4 rol tmp
cmp #$e
bcc .5
sbc #$e
.5 rol tmp
cmp #$7
bcc .6
sbc #$7
.6 rol tmp
tax ; remainder in x
lda tmp
and #$3f ; quotient in a
Takes about 75 cycles worst case, 68 cycles average.
> > > There is one more question though. Does anyone know where in memory the
> > > Applesoft math functions are?
See http://cosmicwolf.com/AppleII/AppleSoft_Commented.htm
Paul Guertin
pg@sff.net
Scott Hemphill writes:
> "Ranando King" writes:
>
> > ... and that would be the type of answer I was looking for.
> > Consistently fast, small code. Always comes out to about 61 cycles or less.
OK, I've got another improvement. It divides A by seven, returning the
quotient in A and the remainder in the lowest three bits of X.
div7 sta dvdend
lsr
lsr
lsr
adc dvdend
ror
lsr
lsr
adc dvdend
ror
lsr
lsr
adc dvdend
tax ; save for remainder
ror
lsr
lsr ; quotient is now in A
bcc end ; subtract one from X if carry set
dex ; remainder is in lowest 3 bits of X
end: rts
It can be extended to work for (256*carry+A) < 448, but it loses its
elegance. When you add the dividend back in each time, you need to keep
the carry bit if it resulted from the adc, but also set carry if it was set
upon entry to the routine. 448 is the upper limit because that's when
A + (A/7) overflows 9 bits.
div7 php
sta dvdend
ror
lsr
lsr
adc dvdend
bcs :1
plp
php
:1 ror
lsr
lsr
adc dvdend
bcs :2
plp
php
:2 ror
lsr
lsr
adc dvdend
bcs :3
plp
php
:3 tax ; save for remainder
ror
lsr
lsr ; quotient is now in A
bcc end ; subtract one from X if carry set
dex ; remainder is in lowest 3 bits of X
end: plp
rts
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
On 15 Feb 2005 13:21:33 -0500, Scott Hemphill wrote:
> This last posting of mine divides by seven by multiplying by one seventh.
> One seventh has the binary representation 0.00100100,10010010,01....
> I use just the first three one bits, using 1/7 ~ (1/8+1/64+1/512),
> calculating the low bits of the answer first. There's also some magic
> in using the carries from the LSR instructions in the ADCs to cause
> rounding up in all the right places.
Indeed, good magic is happening with the carry bit in your routine.
It's very nice!
The same idea can be used to divide by 3, 15, or in general
2**k-1 for any k.
Paul Guertin
pg@sff.net
Paul Guertin writes:
> On 15 Feb 2005 13:21:33 -0500, Scott Hemphill wrote:
>
> > This last posting of mine divides by seven by multiplying by one seventh.
> > One seventh has the binary representation 0.00100100,10010010,01....
> > I use just the first three one bits, using 1/7 ~ (1/8+1/64+1/512),
> > calculating the low bits of the answer first. There's also some magic
> > in using the carries from the LSR instructions in the ADCs to cause
> > rounding up in all the right places.
>
> Indeed, good magic is happening with the carry bit in your routine.
> It's very nice!
>
> The same idea can be used to divide by 3, 15, or in general
> 2**k-1 for any k.
>
> Paul Guertin
> pg@sff.net
But by extension, you can divide by any integer. For example, 1/5 doesn't
have a denominator that fits the pattern 2**k-1, but 3/15 does. One fifth
has the binary representation 0.00110011,00110011,.... The following
routine works:
div5 sta dvdend
lsr
adc dvdend
ror
lsr
lsr
adc dvdend
ror
adc dvdend
ror
lsr
lsr
adc dvdend
ror
adc dvdend
tax ; save for remainder
ror
lsr
lsr
; if we didn't need the remainder
; we could return here
tay ; save quotient
txa ; retrieve last 3 bits shifted out
and #7 ; calculate remainder
ina
lsr
tax ; store remainder
tya ; retrieve quotient
rts
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear
Paul Guertin writes:
> On 15 Feb 2005 13:21:33 -0500, Scott Hemphill wrote:
>
> > This last posting of mine divides by seven by multiplying by one seventh.
> > One seventh has the binary representation 0.00100100,10010010,01....
> > I use just the first three one bits, using 1/7 ~ (1/8+1/64+1/512),
> > calculating the low bits of the answer first. There's also some magic
> > in using the carries from the LSR instructions in the ADCs to cause
> > rounding up in all the right places.
>
> Indeed, good magic is happening with the carry bit in your routine.
> It's very nice!
It's hard to prove that it all works, but it's easier to show that it works.
I used Mathematica to operate on all possible dividends simultaneously,
using the following code:
------------------------------------------------------------------------
adc[m_] := Block[{}, a+=m+c; c=Floor[a/256]; a=Mod[a,256]];
and[m_] := a=BitAnd[a,m];
asl := Block[{}, a*=2; c=Floor[a/256]; a=Mod[a,256]];
bit[m_] := BitAnd[a,m];
clc := c=0&/@c;
cmp[m_] := c=Floor[1+(a-m)/256];
cpx[m_] := c=Floor[1+(x-m)/256];
cpy[m_] := c=Floor[1+(y-m)/256];
dea := a=Mod[a-1,256];
dex := x=Mod[x-1,256];
dey := y=Mod[y-1,256];
eor[m_] := a=BitXor[a,m];
ina := a=Mod[a+1,256];
inx := x=Mod[x+1,256];
iny := y=Mod[y+1,256];
lda[m_] := a=m;
ldx[m_] := x=m;
ldy[m_] := y=m;
lsr := Block[{}, c=Mod[a,2]; a=Floor[a/2]];
ora[m_] := BitOr[a,m];
rol := Block[{}, a=2*a+c; c=Floor[a/256]; a=Mod[a,256]];
ror := Block[{}, a+=256*c; c=Mod[a,2]; a=Floor[a/2]];
sbc[m_] := Block[{}, a+=255-m+c; c=Floor[a/256]; a=Mod[a,256]];
sec := c=1&/@c;
tax := x=a;
tay := y=a;
txa := a=x;
tya := a=y;
div7 := Block[{dvdend},
dvdend = a;
lsr;
lsr;
lsr;
adc[dvdend];
ror;
lsr;
lsr;
adc[dvdend];
ror;
lsr;
lsr;
adc[dvdend];
tax;
ror;
lsr;
lsr;
x -= c;
a];
a = Range[256]-1; (* a is a list from 0..255 *)
q = Floor[a/7]; (* this is the quotient for each element in list *)
r = Mod[a,7]; (* this is the remainder for each element in list *)
div7; (* list of carries will be intialized by first lsr *)
If[ a == q, Print["quotient is correct"]];
If[ BitAnd[x,7] == r, Print["remainder is correct"]];
------------------------------------------------------------------------
Scott
--
Scott Hemphill hemphill@alumni.caltech.edu
"This isn't flying. This is falling, with style." -- Buzz Lightyear