In article , John B. Matthews wrote:
>In article > , "John B. Matthews" wrote:
>
>> In article <370265cd.0405111045.7cd359b7@posting.google.com>, aiiadict@hotmail.com (Rich J.) wrote:
>> >
>> > 16bit/16 bit division with 16 bit result (or 8?), with remainder
>> > 16 bit to 16 bit compare (greater than, less than, equal)
>> >
>> > all with 6502 instructions.
>> >
>> > Rich
>>
>
>Egad! I'd have sworn Sweet 16 does multiply and divide. Instead,
>see lines 522 & 543 of the system monitor listing
>
>http://members.buckeye-express.com/marksm/6502/MON.TXT
>
The Apple II monitor listing above is a good reference, but those routines
are kinda oddball in that they've been over-optimized for space, not speed or
simplicity. Just a warning.
Lots of folks have posted 16-bit compare/add/subtract, and you should make
sure you understand those routines before moving on to multiply and
divide.
I'm going to give three multiply routines of increasing complexity.
(I realize you didn't ask for this). I'll do divide in a few days.
The routines will assume some zero-page locations:
# Multiply VAL1 by VAL2, put result in VAL3
# Assume all numbers are positive (or just treat as unsigned)
VAL1 equ $00
VAL2 equ $02
VAL3 equ $04
1. Simple shift and add algorithm
Algorithm 1: look at low bit of VAL1. If set, add VAL2 to VAL3.
Then, always shift VAL2 left and VAL1 right.
Routine does MULADD: VAL3 = VAL3 + VAL1*VAL2. Set VAL3=0 before starting
to just get MUL.
This is just a 16x16=16 bit multiplier. The upper 16-bits of the true
32-bit result are lost
MUL LDY #$10
MULLOOP
LSR VAL1+1
ROR VAL1
BCC MULSHFT
CLC
LDA VAL2
ADC VAL3
STA VAL3
LDA VAL2+1
ADe VAL3+1
STA VAL3+1
MULSHFT
ASL VAL2
ROL VAL2+1
DEY
BNE MULLOOP
RTS
The basic idea is to do multiply like the following 4x4-bit sample:
0101 (5)
x 1101 (13)
-----
0101
0000
0101
+0101
---------
01000001 (65)
Make sure you understand how the above works.
2. 16x16 = 32 bit multiply
Here's a version that is a little better--it does a 16x16=32 bit multiply
and leaves the low 16 bit result in VAL1 (upper part in VAL3):
MUL LDY #$10
MULLOOP LDA VAL1
LSR
BCC MULSHFT
CLC
LDA VAL2
ADC VAL3
STA VAL3
LDA VAL2+1
ADC VAL3+1
STA VAL3+1
MULSHFT
LSR VAL3+1
ROR VAL3
ROR VAL1+1
ROR VAL1
DEY
BNE MULLOOP
RTS
This is only a little slower than the previous algorithm and you get
16x16=32 bit multiply. The result is placed with low 16 bits in VAL1
and upper 16 bits in VAL3. It still adds in VAL3. This algorithm is
similar to the Apple II monitor routine, but not as convoluted (that
algorithm is complicated by an effort to save a few bytes).
3. Difference of squares.
I haven't written this one out in 6502 assembly, but here's the
sample C code:
// input: x, y 8-bit numbers
// vars: a, b, a2, b2 are 16-bit numbers
mul8x8=16 bit:
a = x + y
b = x - y;
if(b < 0) [
b = -b;
}
a2 = table[a >> 1];
if(a & 1) {
a2 = a2 + ((2*a - 1) >> 2);
}
b2 = table[b >> 1];
if(b & 1) {
b2 = b2 + ((2*b - 1) >> 2);
}
result = (a2 - b2);
The above produces an 8x8=16 bit multiply using two table lookups into
a 256-entry table of 16-bit entries. The 512-byte table contains 256 16-bit
entries. Initialize the table with:
for(i = 0; i < 256; i++) {
table[i] = i*i;
}
How this works is a little tricky, but the basic idea is that instead
of multiplying x*y, instead let a = x+y and b = x-y. Then a^2 - b^2 =
(x+y)*(x+y) - (x-y)*(x-y) = x^2 + 2xy + y^2 - (x^2 - 2xy + y^2) = 4xy.
So by doing an add and a subtract, and then two table lookups, we have
our product (times 4).
As an example, let x=4, y = 9. Then a = 13 and b = 5. 13^2 = 169, and
5^2=25. Then 169-25 = 144. Divide by 4 to get 36.
However, if x and y are 8-bit numbers, then x+y is a 9-bit number, requiring
512 entries of 16-bits each. But the table contains squares, and its
easy to figure out the next entry based on the previous entry:
(n-1)^2 = n^2 - 2n + 1.
So the delta between n^2 and (n-1)^2 is 2n - 1.
The table is then shrunk by an index bit to be just 256 entries by calculating
the odd squares from the even squares.
Removing the factor of 4 from each entry is the final step that makes
all the numbers work out right. It's an exercise for the reader to show
why this works. (Hint: we're only storing even squares).
The code can be further optimized since ((2*b - 1) >> 2) = ((b - 1) >> 1).
And b must always be 8-bit, allowing some optimizations.
In general, this algorithm probably doesn't help 6502 all that much
since you need to call this routine 4 times to get 16x16=32 bit multiply,
or at least 3 times to get 16x16=16 bit multiply. And all the little
fixups needed don't map well to 8-bit operations on a 6502. It's a lot of
overhead, but I throw it out to show a different multiply algorithm.
Kent Dickey