Newsgroups: comp.sys.apple2.programmer
Path: blue.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!gatech!newsfeed.pitt.edu!nntp.club.cc.cmu.edu!news.mic.ucla.edu!unixg.ubc.ca!quartz.ucs.ualberta.ca!acs.ucalgary.ca!cpsc.ucalgary.ca!debug!griffin!dockery
From: dockery@griffin.cuc.ab.ca (Sean Dockery)
Subject: Re: ASM multiply/divide
Message-ID: <Cs8Er0.8I4@griffin.cuc.ab.ca>
Organization: Griffin Software Development
References: <2uhrag$ptp@jadzia.CSOS.ORST.EDU>
Date: Thu, 30 Jun 1994 22:35:23 GMT
Lines: 267

daye@CSOS.ORST.EDU wrote the following article:

| Does anyone have source for a relatively fast 65816 multiply/divide they
| could either post or email to me?  I'm in the middle of a project and have
| come to realize I need this, and rather than do it myselft, I thought I'd
| check if someone already had it...

Following this message are a few routines in both 6502 and 65816 that have
floated about in the past.  They can be easily converted to accommodate any
size of integer data by adding a few more ROLs and ASLs as well as a few
more variables.

As for fast, base two isn't really fast at all.

There is a short cut to the multiplication routine for larger order numbers
which I don't believe any of the routines use.  It is:  For each byte in
the multiplier, multiply it by the multiplicand, shift it for the order of
the byte of the multiplier, and add all those products.

Example for binary numbers:

			    1011
			  x 0110
	        	--------
0 x 1011 << 0 ==	    0000
1 x 1011 << 1 ==	+  1011
1 x 1011 << 2 ==	+ 1011
0 x 1011 << 3 ==	+0000
			--------
			 1000010

When dealing with eight bit or even sixteen bit implementations, this can
cut down on multiplication times by up to a factor of ten or more over a
base two (binary) implementation.

If you think that this resembles "long hand" multiplication, you are
correct.  It may be long hand to us, but it dramatically cuts down on CPU
load as you are only doing one shift (or eight or sixteen bits) for every
eight or sixteen shifts associated with the base two routine.

I've written an arithmetic package for both signed and unsigned 64 bit
integers in C, if you're interested in that.  It works under the assumption
that sizeof (short int) == 2.  I'm not sure if that is true under ORCA/C as
I don't have a IIGS, but I'm fairly certain that it is.

|  Thanks

You're welcome.

| |Evan Day                | "Plans made in the nursery can change the
| |daye@CSOS.ORST.EDU      |  course of history.  Remember that."
-------------------------------------------------------------------------------
divide1.s (65816)
-------------------------------------------------------------------------------
       mx  %00
--------------------------
_Divide
--------------------------
       stx Div1
       sty Div2
       stz Result
       ldx #16
       lda #0
       clc
]Loop  rol Result
       dex
       bmi :End
       asl Div1
       rol
       cmp Div2
       bcc ]Loop
       sbc Div2
       sec                             
       bra ]Loop
:End   sta Remain
       rts

Result ds  2
Remain ds  2
Div1   ds  2
Div2   ds  2
-------------------------------------------------------------------------------
divide2.s (65816)
-------------------------------------------------------------------------------
	stx	divisor
	sty	dividend
	ldx	#16
	lda	#0
lp	asl	dividend
	rol
	cmp	divisor
	bcc	nextbit
	sbc	divisor
	inc	dividend
nextbit	dex
	bne	lp
;"dividend" now contains the result of the division.
	sta	remainder

It is rather sneaky, and yes, it is correct. What it does is essentially the
same computation, but it shifts the result into the dividend location instead
of a seperate result location. However, I like the carry flag technique used
by Shadow's code, so let's add it in:

	stx	divisor
	sty	dividend
	ldx	#16
	lda	#0
lp	rol	dividend
	rol
	cmp	divisor
	bcc	nextbit
	sbc	divisor
nextbit	dex
	bne	lp
	rol	dividend
;"dividend" now contains the result of the division.
	sta	remainder

Note that now when a 1 bit is entered into the result, we save 7 cycles. We
do perform an extra ROL at the end however. The relative performance is
determined by what kind of answers we expect: result of 0, new code is 7
cycles slower; result is power of two, new code is the same speed; otherwise
the new code is faster (and this is most of the time so we judge this a win).

Note that the first ROL to dividend shifts in a garbage bit, which is shifted
out by the final one. That is what we are spending in order to make part of
the main loop faster, and in this case we expect to save time overall so it's
a good thing. What's interesting is that the effect is that of PRESERVING the
carry flag of the caller. This is sometimes a big concern for assembly routines
because the carry flag is a great place to pass booleans.

Lastly, here's a slightly rearranged version for clarity:

	stx	divisor
	sty	dividend
	ldx	#16
	lda	#0
	rol	dividend
lp	rol
	cmp	divisor
	bcc	nextbit
	sbc	divisor
nextbit	rol	dividend
	dex
	bne	lp
;"dividend" now contains the result of the division.
	sta	remainder

We are now communicating the dividend bit instead of the result bit in the
carry flag when we branch back to lp. Note that we could also change the
initial ROL to an ASL, making the garbage bit a 0 and thereby setting the
carry flag to 0 at the end of the routine.

Todd Whitesel
toddpw @ tybalt.caltech.edu
-------------------------------------------------------------------------------
divide3.s (6502)
-------------------------------------------------------------------------------
In article <ar56046@pro-ict.cts.com> shack@pro-ict.cts.com (Randy Shackelford) writes:
>jmk3@crux3.cit.cornell.edu (Jay Krell) writes:
>>Anyone got a 16-bit divide/remainder routine in 6502?
>>Specifically, I'd need to divide by something between
>>91 and 94, the most optimizable being 92, 4 * 23.
>
>Think there's one in Eyes' and Lichty's Programming the 65816 I'll give it a
>look - yep pp. 273-4. Should I post it?

Actually, I think he's looking for 8-bit (6502) code to divide 16-bit
numbers.  Here's something I cobbled together a few years ago that
does the job.  With a few modifications, you could also do 32-bit
division.  The routine assumes that numbers are unsigned.  The four
parameters would probably be best placed in page 0 (you'll get
smaller, faster code), but you can use any address you want.  The last
5 lines create space for the parameters at the end of the program and
can be deleted if you set aside space for the numbers somewhere else.

The pseudo-ops used here are those used by the MicroSPARC/MindCraft
Assembler; those of you who use Merlin, ORCA/M, or something else will
want to change them to whatever your assembler uses.  (Only one
pseudo-op is used: DFS xx, which sets aside xx bytes of memory for
data storage.)

*16-Bit Integer Division Routine
*---------------------------------------------
*Parameter list:
*  DIVIDEND (in, gets scrambled),
*  DIVISOR (in),
*  QUOTIENT (out),
*  REMAINDER (out): Self-explanatory
*Length: $4F (79) bytes
DIVIDE	LDA	#0	;Zero the remainder
	STA	REMAINDER
	STA	REMAINDER+1
	LDY	#16	;We're doing 16-bit division here
]LOOP	ASL	DIVIDEND ;Push a value off the top of the dividend into
	ROL	DIVIDEND+1 ;the remainder
	ROL	REMAINDER
	ROL	REMAINDER+1
	SEC		;Subtract the divisor from the remainder
	LDA	REMAINDER
	SBC	DIVISOR
	STA	]TEMP
	LDA	REMAINDER+1
	SBC	DIVISOR+1
	STA	]TEMP+1
	BMI	]ZERO	;Is the result negative?
]ONE	LDA	]TEMP	;No, so the result of the subtraction becomes
	STA	REMAINDER ;the remainder and 1 gets pushed into the
	LDA	]TEMP+1	;quotient
	STA	REMAINDER+1
	SEC
	BCS	]PUSH
]ZERO	CLC		;Yes, so just push 0 into the quotient
]PUSH	ROL	QUOTIENT ;Push it in
	ROL	QUOTIENT+1
	DEY		;Done yet?
	BNE	[LOOP	;Y-reg is zero when finished
	RTS
]TEMP	DFS	2	;Subtraction initially goes here
DIVIDEND DFS	2	;Next four are self-explanatory
DIVISOR	DFS	2
QUOTIENT DFS	2
REMAINDER DFS	2

  _/_   Scott Alfter (sknkwrks@cs.unlv.edu)       Ask me about SoftDAC--digital
 / v \  Skunk Works BBS offline for maintenance!  audio for your Apple IIe/IIc!
(IIGS(  (702) 894-9619 300-14400 V.32bis 1:209/263 Apple II, IBM, Trek, & more!
 \_^_/  ---===#### Why be politically correct when you can be RIGHT? ####===---
-------------------------------------------------------------------------------
multiply.s (65816)
-------------------------------------------------------------------------------
; want to compute m*n
	lda	#0
	bra	m0
lp	asl	n
m0	lsr	m
	bcc	m1
	clc
	adc	n
m1	bne	lp
; result is in A

if you can store m inverted, you can avoid the clc in the loop:

	lda	m
	eor	#$ffff
	sta	m
	lda	#0
	bra	m0
lp	asl	n
m0	lsr	m
	bcs	m1
	adc	n
m1	bne	lp

It's not clear which is going to be faster, as the cutoff is about 6 bis set
in m. If you can arrange to have the input code do the EOR then the cutoff is
two bits set and that's a pretty clear win.

Todd Whitesel
toddpw @ tybalt.caltech.edu
-------------------------------------------------------------------------------
-- 
Sean Dockery                        |  Tickle us, do we not laugh?
Griffin Software Development Group  |  Prick us, do we not bleed?
dockery@griffin.cuc.ab.ca           |  Wrong us, shall we not revenge?