Newsgroups: comp.sys.apple2.programmer Path: blue.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!gatech!newsfeed.pitt.edu!nntp.club.cc.cmu.edu!news.mic.ucla.edu!unixg.ubc.ca!quartz.ucs.ualberta.ca!acs.ucalgary.ca!cpsc.ucalgary.ca!debug!griffin!dockery From: dockery@griffin.cuc.ab.ca (Sean Dockery) Subject: Re: ASM multiply/divide Message-ID: Organization: Griffin Software Development References: <2uhrag$ptp@jadzia.CSOS.ORST.EDU> Date: Thu, 30 Jun 1994 22:35:23 GMT Lines: 267 daye@CSOS.ORST.EDU wrote the following article: | Does anyone have source for a relatively fast 65816 multiply/divide they | could either post or email to me? I'm in the middle of a project and have | come to realize I need this, and rather than do it myselft, I thought I'd | check if someone already had it... Following this message are a few routines in both 6502 and 65816 that have floated about in the past. They can be easily converted to accommodate any size of integer data by adding a few more ROLs and ASLs as well as a few more variables. As for fast, base two isn't really fast at all. There is a short cut to the multiplication routine for larger order numbers which I don't believe any of the routines use. It is: For each byte in the multiplier, multiply it by the multiplicand, shift it for the order of the byte of the multiplier, and add all those products. Example for binary numbers: 1011 x 0110 -------- 0 x 1011 << 0 == 0000 1 x 1011 << 1 == + 1011 1 x 1011 << 2 == + 1011 0 x 1011 << 3 == +0000 -------- 1000010 When dealing with eight bit or even sixteen bit implementations, this can cut down on multiplication times by up to a factor of ten or more over a base two (binary) implementation. If you think that this resembles "long hand" multiplication, you are correct. It may be long hand to us, but it dramatically cuts down on CPU load as you are only doing one shift (or eight or sixteen bits) for every eight or sixteen shifts associated with the base two routine. I've written an arithmetic package for both signed and unsigned 64 bit integers in C, if you're interested in that. It works under the assumption that sizeof (short int) == 2. I'm not sure if that is true under ORCA/C as I don't have a IIGS, but I'm fairly certain that it is. | Thanks You're welcome. | |Evan Day | "Plans made in the nursery can change the | |daye@CSOS.ORST.EDU | course of history. Remember that." ------------------------------------------------------------------------------- divide1.s (65816) ------------------------------------------------------------------------------- mx %00 -------------------------- _Divide -------------------------- stx Div1 sty Div2 stz Result ldx #16 lda #0 clc ]Loop rol Result dex bmi :End asl Div1 rol cmp Div2 bcc ]Loop sbc Div2 sec bra ]Loop :End sta Remain rts Result ds 2 Remain ds 2 Div1 ds 2 Div2 ds 2 ------------------------------------------------------------------------------- divide2.s (65816) ------------------------------------------------------------------------------- stx divisor sty dividend ldx #16 lda #0 lp asl dividend rol cmp divisor bcc nextbit sbc divisor inc dividend nextbit dex bne lp ;"dividend" now contains the result of the division. sta remainder It is rather sneaky, and yes, it is correct. What it does is essentially the same computation, but it shifts the result into the dividend location instead of a seperate result location. However, I like the carry flag technique used by Shadow's code, so let's add it in: stx divisor sty dividend ldx #16 lda #0 lp rol dividend rol cmp divisor bcc nextbit sbc divisor nextbit dex bne lp rol dividend ;"dividend" now contains the result of the division. sta remainder Note that now when a 1 bit is entered into the result, we save 7 cycles. We do perform an extra ROL at the end however. The relative performance is determined by what kind of answers we expect: result of 0, new code is 7 cycles slower; result is power of two, new code is the same speed; otherwise the new code is faster (and this is most of the time so we judge this a win). Note that the first ROL to dividend shifts in a garbage bit, which is shifted out by the final one. That is what we are spending in order to make part of the main loop faster, and in this case we expect to save time overall so it's a good thing. What's interesting is that the effect is that of PRESERVING the carry flag of the caller. This is sometimes a big concern for assembly routines because the carry flag is a great place to pass booleans. Lastly, here's a slightly rearranged version for clarity: stx divisor sty dividend ldx #16 lda #0 rol dividend lp rol cmp divisor bcc nextbit sbc divisor nextbit rol dividend dex bne lp ;"dividend" now contains the result of the division. sta remainder We are now communicating the dividend bit instead of the result bit in the carry flag when we branch back to lp. Note that we could also change the initial ROL to an ASL, making the garbage bit a 0 and thereby setting the carry flag to 0 at the end of the routine. Todd Whitesel toddpw @ tybalt.caltech.edu ------------------------------------------------------------------------------- divide3.s (6502) ------------------------------------------------------------------------------- In article shack@pro-ict.cts.com (Randy Shackelford) writes: >jmk3@crux3.cit.cornell.edu (Jay Krell) writes: >>Anyone got a 16-bit divide/remainder routine in 6502? >>Specifically, I'd need to divide by something between >>91 and 94, the most optimizable being 92, 4 * 23. > >Think there's one in Eyes' and Lichty's Programming the 65816 I'll give it a >look - yep pp. 273-4. Should I post it? Actually, I think he's looking for 8-bit (6502) code to divide 16-bit numbers. Here's something I cobbled together a few years ago that does the job. With a few modifications, you could also do 32-bit division. The routine assumes that numbers are unsigned. The four parameters would probably be best placed in page 0 (you'll get smaller, faster code), but you can use any address you want. The last 5 lines create space for the parameters at the end of the program and can be deleted if you set aside space for the numbers somewhere else. The pseudo-ops used here are those used by the MicroSPARC/MindCraft Assembler; those of you who use Merlin, ORCA/M, or something else will want to change them to whatever your assembler uses. (Only one pseudo-op is used: DFS xx, which sets aside xx bytes of memory for data storage.) *16-Bit Integer Division Routine *--------------------------------------------- *Parameter list: * DIVIDEND (in, gets scrambled), * DIVISOR (in), * QUOTIENT (out), * REMAINDER (out): Self-explanatory *Length: $4F (79) bytes DIVIDE LDA #0 ;Zero the remainder STA REMAINDER STA REMAINDER+1 LDY #16 ;We're doing 16-bit division here ]LOOP ASL DIVIDEND ;Push a value off the top of the dividend into ROL DIVIDEND+1 ;the remainder ROL REMAINDER ROL REMAINDER+1 SEC ;Subtract the divisor from the remainder LDA REMAINDER SBC DIVISOR STA ]TEMP LDA REMAINDER+1 SBC DIVISOR+1 STA ]TEMP+1 BMI ]ZERO ;Is the result negative? ]ONE LDA ]TEMP ;No, so the result of the subtraction becomes STA REMAINDER ;the remainder and 1 gets pushed into the LDA ]TEMP+1 ;quotient STA REMAINDER+1 SEC BCS ]PUSH ]ZERO CLC ;Yes, so just push 0 into the quotient ]PUSH ROL QUOTIENT ;Push it in ROL QUOTIENT+1 DEY ;Done yet? BNE [LOOP ;Y-reg is zero when finished RTS ]TEMP DFS 2 ;Subtraction initially goes here DIVIDEND DFS 2 ;Next four are self-explanatory DIVISOR DFS 2 QUOTIENT DFS 2 REMAINDER DFS 2 _/_ Scott Alfter (sknkwrks@cs.unlv.edu) Ask me about SoftDAC--digital / v \ Skunk Works BBS offline for maintenance! audio for your Apple IIe/IIc! (IIGS( (702) 894-9619 300-14400 V.32bis 1:209/263 Apple II, IBM, Trek, & more! \_^_/ ---===#### Why be politically correct when you can be RIGHT? ####===--- ------------------------------------------------------------------------------- multiply.s (65816) ------------------------------------------------------------------------------- ; want to compute m*n lda #0 bra m0 lp asl n m0 lsr m bcc m1 clc adc n m1 bne lp ; result is in A if you can store m inverted, you can avoid the clc in the loop: lda m eor #$ffff sta m lda #0 bra m0 lp asl n m0 lsr m bcs m1 adc n m1 bne lp It's not clear which is going to be faster, as the cutoff is about 6 bis set in m. If you can arrange to have the input code do the EOR then the cutoff is two bits set and that's a pretty clear win. Todd Whitesel toddpw @ tybalt.caltech.edu ------------------------------------------------------------------------------- -- Sean Dockery | Tickle us, do we not laugh? Griffin Software Development Group | Prick us, do we not bleed? dockery@griffin.cuc.ab.ca | Wrong us, shall we not revenge?