In article <nospam-4C3750.23050311052004@news-server.woh.rr.com>, John B. Matthews <nospam@nospam.com> wrote:
>In article ><nospam-AAC5F1.22351411052004@news-server-fe-02.columbus.rr.com> , "John B. Matthews" <nospam@nospam.com> wrote:
>
>> In article <370265cd.0405111045.7cd359b7@posting.google.com>, aiiadict@hotmail.com (Rich J.) wrote:
>> >
>> > 16bit/16 bit division with 16 bit result (or 8?), with remainder
>> > 16 bit to 16 bit compare (greater than, less than, equal)
>> > 
>> > all with 6502 instructions.
>> > 
>> > Rich
>> 
>
>Egad! I'd have sworn Sweet 16 does multiply and divide. Instead, 
>see lines 522 & 543 of the system monitor listing
>
>http://members.buckeye-express.com/marksm/6502/MON.TXT
>

The Apple II monitor listing above is a good reference, but those routines
are kinda oddball in that they've been over-optimized for space, not speed or
simplicity.  Just a warning.

Lots of folks have posted 16-bit compare/add/subtract, and you should make
sure you understand those routines before moving on to multiply and
divide.

I'm going to give three multiply routines of increasing complexity.
(I realize you didn't ask for this).  I'll do divide in a few days.

The routines will assume some zero-page locations:

# Multiply VAL1 by VAL2, put result in VAL3
# Assume all numbers are positive (or just treat as unsigned)
VAL1    equ     $00
VAL2    equ     $02
VAL3    equ     $04

1.  Simple shift and add algorithm

Algorithm 1: look at low bit of VAL1.  If set, add VAL2 to VAL3.
Then, always shift VAL2 left and VAL1 right.
Routine does MULADD: VAL3 = VAL3 + VAL1*VAL2.  Set VAL3=0 before starting
to just get MUL.
This is just a 16x16=16 bit multiplier.  The upper 16-bits of the true
32-bit result are lost

MUL     LDY     #$10
MULLOOP
        LSR     VAL1+1
        ROR     VAL1
        BCC     MULSHFT
        CLC
        LDA     VAL2
        ADC     VAL3
        STA     VAL3
        LDA     VAL2+1
        ADe     VAL3+1
        STA     VAL3+1
MULSHFT
        ASL     VAL2
        ROL     VAL2+1
        DEY
        BNE     MULLOOP
        RTS

The basic idea is to do multiply like the following 4x4-bit sample:

      0101	(5)
    x 1101	(13)
     -----
      0101
     0000
    0101
  +0101
  ---------
  01000001	(65)

Make sure you understand how the above works.

2.  16x16 = 32 bit multiply

Here's a version that is a little better--it does a 16x16=32 bit multiply
and leaves the low 16 bit result in VAL1 (upper part in VAL3):


MUL     LDY     #$10
MULLOOP LDA     VAL1
        LSR
        BCC     MULSHFT
        CLC
        LDA     VAL2
        ADC     VAL3
        STA     VAL3
        LDA     VAL2+1
        ADC     VAL3+1
        STA     VAL3+1
MULSHFT
        LSR     VAL3+1
        ROR     VAL3
        ROR     VAL1+1
        ROR     VAL1
        DEY
        BNE     MULLOOP
	RTS

This is only a little slower than the previous algorithm and you get
16x16=32 bit multiply.  The result is placed with low 16 bits in VAL1
and upper 16 bits in VAL3.  It still adds in VAL3.  This algorithm is
similar to the Apple II monitor routine, but not as convoluted (that
algorithm is complicated by an effort to save a few bytes).

3. Difference of squares.

I haven't written this one out in 6502 assembly, but here's the
sample C code:

// input: x, y 8-bit numbers
// vars: a, b, a2, b2 are 16-bit numbers
mul8x8=16 bit:
	a = x + y
	b = x - y;
	if(b < 0) [
		b = -b;
	}
	a2 = table[a >> 1];
	if(a & 1) {
		a2 = a2 + ((2*a - 1) >> 2);
	}
	b2 = table[b >> 1];
	if(b & 1) {
		b2 = b2 + ((2*b - 1) >> 2);
	}

	result = (a2 - b2);


The above produces an 8x8=16 bit multiply using two table lookups into
a 256-entry table of 16-bit entries.  The 512-byte table contains 256 16-bit
entries.  Initialize the table with:

	for(i = 0; i < 256; i++) {
		table[i] = i*i;
	}

How this works is a little tricky, but the basic idea is that instead
of multiplying x*y, instead let a = x+y and b = x-y.  Then a^2 - b^2 =
(x+y)*(x+y) - (x-y)*(x-y) = x^2 + 2xy + y^2 - (x^2 - 2xy + y^2) = 4xy.
So by doing an add and a subtract, and then two table lookups, we have
our product (times 4).

As an example, let x=4, y = 9.  Then a = 13 and b = 5.  13^2 = 169, and
5^2=25.  Then 169-25 = 144.  Divide by 4 to get 36.

However, if x and y are 8-bit numbers, then x+y is a 9-bit number, requiring
512 entries of 16-bits each.  But the table contains squares, and its
easy to figure out the next entry based on the previous entry:
	(n-1)^2 = n^2 - 2n + 1.
	So the delta between n^2 and (n-1)^2 is 2n - 1.

The table is then shrunk by an index bit to be just 256 entries by calculating
the odd squares from the even squares.

Removing the factor of 4 from each entry is the final step that makes
all the numbers work out right.  It's an exercise for the reader to show
why this works.  (Hint: we're only storing even squares).

The code can be further optimized since ((2*b - 1) >> 2) = ((b - 1) >> 1).
And b must always be 8-bit, allowing some optimizations.

In general, this algorithm probably doesn't help 6502 all that much
since you need to call this routine 4 times to get 16x16=32 bit multiply,
or at least 3 times to get 16x16=16 bit multiply.  And all the little
fixups needed don't map well to 8-bit operations on a 6502.  It's a lot of
overhead, but I throw it out to show a different multiply algorithm.

Kent Dickey