Newsgroups: comp.sys.apple2.programmer
Path: blue.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!gatech!newsfeed.pitt.edu!nntp.club.cc.cmu.edu!news.mic.ucla.edu!unixg.ubc.ca!quartz.ucs.ualberta.ca!acs.ucalgary.ca!cpsc.ucalgary.ca!debug!griffin!dockery
From: dockery@griffin.cuc.ab.ca (Sean Dockery)
Subject: Re: ASM multiply/divide
Message-ID:
Organization: Griffin Software Development
References: <2uhrag$ptp@jadzia.CSOS.ORST.EDU>
Date: Thu, 30 Jun 1994 22:35:23 GMT
Lines: 267
daye@CSOS.ORST.EDU wrote the following article:
| Does anyone have source for a relatively fast 65816 multiply/divide they
| could either post or email to me? I'm in the middle of a project and have
| come to realize I need this, and rather than do it myselft, I thought I'd
| check if someone already had it...
Following this message are a few routines in both 6502 and 65816 that have
floated about in the past. They can be easily converted to accommodate any
size of integer data by adding a few more ROLs and ASLs as well as a few
more variables.
As for fast, base two isn't really fast at all.
There is a short cut to the multiplication routine for larger order numbers
which I don't believe any of the routines use. It is: For each byte in
the multiplier, multiply it by the multiplicand, shift it for the order of
the byte of the multiplier, and add all those products.
Example for binary numbers:
1011
x 0110
--------
0 x 1011 << 0 == 0000
1 x 1011 << 1 == + 1011
1 x 1011 << 2 == + 1011
0 x 1011 << 3 == +0000
--------
1000010
When dealing with eight bit or even sixteen bit implementations, this can
cut down on multiplication times by up to a factor of ten or more over a
base two (binary) implementation.
If you think that this resembles "long hand" multiplication, you are
correct. It may be long hand to us, but it dramatically cuts down on CPU
load as you are only doing one shift (or eight or sixteen bits) for every
eight or sixteen shifts associated with the base two routine.
I've written an arithmetic package for both signed and unsigned 64 bit
integers in C, if you're interested in that. It works under the assumption
that sizeof (short int) == 2. I'm not sure if that is true under ORCA/C as
I don't have a IIGS, but I'm fairly certain that it is.
| Thanks
You're welcome.
| |Evan Day | "Plans made in the nursery can change the
| |daye@CSOS.ORST.EDU | course of history. Remember that."
-------------------------------------------------------------------------------
divide1.s (65816)
-------------------------------------------------------------------------------
mx %00
--------------------------
_Divide
--------------------------
stx Div1
sty Div2
stz Result
ldx #16
lda #0
clc
]Loop rol Result
dex
bmi :End
asl Div1
rol
cmp Div2
bcc ]Loop
sbc Div2
sec
bra ]Loop
:End sta Remain
rts
Result ds 2
Remain ds 2
Div1 ds 2
Div2 ds 2
-------------------------------------------------------------------------------
divide2.s (65816)
-------------------------------------------------------------------------------
stx divisor
sty dividend
ldx #16
lda #0
lp asl dividend
rol
cmp divisor
bcc nextbit
sbc divisor
inc dividend
nextbit dex
bne lp
;"dividend" now contains the result of the division.
sta remainder
It is rather sneaky, and yes, it is correct. What it does is essentially the
same computation, but it shifts the result into the dividend location instead
of a seperate result location. However, I like the carry flag technique used
by Shadow's code, so let's add it in:
stx divisor
sty dividend
ldx #16
lda #0
lp rol dividend
rol
cmp divisor
bcc nextbit
sbc divisor
nextbit dex
bne lp
rol dividend
;"dividend" now contains the result of the division.
sta remainder
Note that now when a 1 bit is entered into the result, we save 7 cycles. We
do perform an extra ROL at the end however. The relative performance is
determined by what kind of answers we expect: result of 0, new code is 7
cycles slower; result is power of two, new code is the same speed; otherwise
the new code is faster (and this is most of the time so we judge this a win).
Note that the first ROL to dividend shifts in a garbage bit, which is shifted
out by the final one. That is what we are spending in order to make part of
the main loop faster, and in this case we expect to save time overall so it's
a good thing. What's interesting is that the effect is that of PRESERVING the
carry flag of the caller. This is sometimes a big concern for assembly routines
because the carry flag is a great place to pass booleans.
Lastly, here's a slightly rearranged version for clarity:
stx divisor
sty dividend
ldx #16
lda #0
rol dividend
lp rol
cmp divisor
bcc nextbit
sbc divisor
nextbit rol dividend
dex
bne lp
;"dividend" now contains the result of the division.
sta remainder
We are now communicating the dividend bit instead of the result bit in the
carry flag when we branch back to lp. Note that we could also change the
initial ROL to an ASL, making the garbage bit a 0 and thereby setting the
carry flag to 0 at the end of the routine.
Todd Whitesel
toddpw @ tybalt.caltech.edu
-------------------------------------------------------------------------------
divide3.s (6502)
-------------------------------------------------------------------------------
In article shack@pro-ict.cts.com (Randy Shackelford) writes:
>jmk3@crux3.cit.cornell.edu (Jay Krell) writes:
>>Anyone got a 16-bit divide/remainder routine in 6502?
>>Specifically, I'd need to divide by something between
>>91 and 94, the most optimizable being 92, 4 * 23.
>
>Think there's one in Eyes' and Lichty's Programming the 65816 I'll give it a
>look - yep pp. 273-4. Should I post it?
Actually, I think he's looking for 8-bit (6502) code to divide 16-bit
numbers. Here's something I cobbled together a few years ago that
does the job. With a few modifications, you could also do 32-bit
division. The routine assumes that numbers are unsigned. The four
parameters would probably be best placed in page 0 (you'll get
smaller, faster code), but you can use any address you want. The last
5 lines create space for the parameters at the end of the program and
can be deleted if you set aside space for the numbers somewhere else.
The pseudo-ops used here are those used by the MicroSPARC/MindCraft
Assembler; those of you who use Merlin, ORCA/M, or something else will
want to change them to whatever your assembler uses. (Only one
pseudo-op is used: DFS xx, which sets aside xx bytes of memory for
data storage.)
*16-Bit Integer Division Routine
*---------------------------------------------
*Parameter list:
* DIVIDEND (in, gets scrambled),
* DIVISOR (in),
* QUOTIENT (out),
* REMAINDER (out): Self-explanatory
*Length: $4F (79) bytes
DIVIDE LDA #0 ;Zero the remainder
STA REMAINDER
STA REMAINDER+1
LDY #16 ;We're doing 16-bit division here
]LOOP ASL DIVIDEND ;Push a value off the top of the dividend into
ROL DIVIDEND+1 ;the remainder
ROL REMAINDER
ROL REMAINDER+1
SEC ;Subtract the divisor from the remainder
LDA REMAINDER
SBC DIVISOR
STA ]TEMP
LDA REMAINDER+1
SBC DIVISOR+1
STA ]TEMP+1
BMI ]ZERO ;Is the result negative?
]ONE LDA ]TEMP ;No, so the result of the subtraction becomes
STA REMAINDER ;the remainder and 1 gets pushed into the
LDA ]TEMP+1 ;quotient
STA REMAINDER+1
SEC
BCS ]PUSH
]ZERO CLC ;Yes, so just push 0 into the quotient
]PUSH ROL QUOTIENT ;Push it in
ROL QUOTIENT+1
DEY ;Done yet?
BNE [LOOP ;Y-reg is zero when finished
RTS
]TEMP DFS 2 ;Subtraction initially goes here
DIVIDEND DFS 2 ;Next four are self-explanatory
DIVISOR DFS 2
QUOTIENT DFS 2
REMAINDER DFS 2
_/_ Scott Alfter (sknkwrks@cs.unlv.edu) Ask me about SoftDAC--digital
/ v \ Skunk Works BBS offline for maintenance! audio for your Apple IIe/IIc!
(IIGS( (702) 894-9619 300-14400 V.32bis 1:209/263 Apple II, IBM, Trek, & more!
\_^_/ ---===#### Why be politically correct when you can be RIGHT? ####===---
-------------------------------------------------------------------------------
multiply.s (65816)
-------------------------------------------------------------------------------
; want to compute m*n
lda #0
bra m0
lp asl n
m0 lsr m
bcc m1
clc
adc n
m1 bne lp
; result is in A
if you can store m inverted, you can avoid the clc in the loop:
lda m
eor #$ffff
sta m
lda #0
bra m0
lp asl n
m0 lsr m
bcs m1
adc n
m1 bne lp
It's not clear which is going to be faster, as the cutoff is about 6 bis set
in m. If you can arrange to have the input code do the EOR then the cutoff is
two bits set and that's a pretty clear win.
Todd Whitesel
toddpw @ tybalt.caltech.edu
-------------------------------------------------------------------------------
--
Sean Dockery | Tickle us, do we not laugh?
Griffin Software Development Group | Prick us, do we not bleed?
dockery@griffin.cuc.ab.ca | Wrong us, shall we not revenge?