Apple II Technical Notes _____________________________________________________________________________ Developer Technical Support Apple IIGS #70: Fast Graphics Hints Written by: Don Marsh & Jim Luther September 1989 This Technical Note discusses techniques for fast animation on the Apple IIGS. _____________________________________________________________________________ QuickDraw II gives programmers a very generalized way to draw something to the Super Hi-Res screen or to other parts of Apple IIGS memory. Unfortunately, the overhead in QuickDraw II makes it an unacceptable tool for all but simple animations. If you bypass QuickDraw II, your application has to write pixel data directly to the Super Hi-Res graphics display buffer. It also has to control the New-Video register at $C029, and set up the scan-line control bytes and color palettes in the graphics display buffer. Chapter 4 of the Apple IIGS Hardware Reference documents where you can find the graphics display buffer in memory and how the scan-line control bytes, color palettes, and pixel data bytes are used in Super Hi-Res graphics mode. The techniques described in this Note should be used with discretion--we do not recommend bypassing the Apple IIGS Toolbox unless it is absolutely necessary. Map the Stack Onto Video Memory To achieve the fastest screen updates possible, you must remove all unnecessary overhead from the instructions that perform graphics memory writes. The obvious method for achieving sequential writes to the graphics memory uses an index register, which must be incremented or decremented between writes. These operations can be avoided by using the stack. Each time a byte or word is pushed onto the stack, the stack pointer is automatically decremented by the appropriate amount. This is faster than doing an indexed store followed by a decrement instruction. But how is the stack mapped onto the graphics memory? The stack can be located in bank $01 instead of bank $00 by writing to the WrCardRAM auxiliary- memory select switch at $C005. Bank $01 is shadowed into $E1 by clearing bit 3 of the Shadow register at $C035. Under these conditions, if the stack pointer is set to $3000, the next byte pushed onto the stack is written to $013000, then shadowed into $E13000. The stack pointer is automatically decremented so the stage is set for another byte to be written at $E12FFF. Warning: While the stack is mapped into bank $01, you may not call any firmware, toolbox or operating system routines (ProDOS 8 or GS/OS). Don't even think about it. Unroll All Loops Another source of overhead is branching instructions in loops. By "straight- lining" the code to move up a scan-line's worth of memory at one time, branch instructions are avoided. Following is an example of this technique. lda |164,y ; accumulator is 16 bits for pha ; best efficiency lda |162,y pha lda |160,y pha In this example, the Y register is used to point to data to be moved to the graphics memory, and hard-coded offsets from the Y register are used to avoid register operations between writes. Hard-Code Instructions and Data In desperate circumstances, it is necessary to remove overhead from the previous code example. This can be accomplished by hard-coding pixel data into your code instead of loading pixel values from a separate data space and transferring them to the graphics memory (as in the example). If you are writing an arbitrary pattern of three or fewer constant values to the screen, for example, the following method is the fastest known: lda #val1 ldx #val2 ldy #val3 pha ; arbitrary pattern of pushes phx phy phy phx In cases where many different values must be written to the screen, pixel data can be written to the screen using immediate push instructions: pea $5389 ; some arbitrary pixel values pea $2378 pea $A3C1 pea $39AF Your program can generate this mixture of PEA instructions and pixel data itself, or it could load pixel data that already has PEA instructions intermixed (thus increasing the data size by one half). Be Aware of Slow-Side and Fast-Side Synchronization Estimating execution speed by counting instruction cycles is always a challenging task on the IIGS, but it is particularly tricky when one is writing to the graphics memory. The graphics memory resides in the side of the IIGS system controlled by the 1 MHz Mega II chip, which means that during all writes to this memory, the fast side of the system controlled by the Fast Processor Interface (FPI) chip must be synchronized with slow side of the system controlled by the Mega II, even if the system is running code at full native speed. This synchronization is performed automatically and transparently by the FPI in the IIGS, and it isn't normally of concern to the programmer. Animation programmers must worry about synchronization delays, however, because slight changes in graphics update code may change the frequency of these delays, and hence the speed of the program. In practical terms, this means that one loop writing data to the graphics memory may run at the same speed as a second loop with a higher cycle count. A careful analysis of the synchronization problem leads to the following tables, which are useful as a rough estimate of the speed attained by different pieces of code. Each entry is based on the number of cycles consumed during consecutive write instructions. For example, a series of PEA instructions requires five cycles for each 16-bit write. A short PHA instruction followed by a branch requires six cycles for each 8-bit write. Fast Cycles per Write (byte) Actual Speed (microseconds/byte) ________________________________________________________________ 3 to 5 2.0 6 to 8 3.0 9 to 11 4.0 ________________________________________________________________ Fast Cycles per Write (word) Actual Speed (microseconds/word) ________________________________________________________________ 4 to 6 3.0 7 to 8 4.0 9 to 11 5.0 ________________________________________________________________ The times given in the tables apply only if the same number of fast cycles separate each consecutive write operation. The first write operation in a set of write instructions usually takes longer than subsequent writes, because the potentially long synchronization operation is accomplished at that time. Unpredictable delays caused by memory refresh slow things down further, although refresh delays byte-wide writes more often than word-wide writes. Therefore, it is usually preferable from a speed standpoint to use word-wide writes to the graphics memory. For more information on synchronization cycle timing within the IIGS, see Chapter 2 of the Apple IIGS Hardware Reference and Apple IIGS Technical Note #68, Tips for I/O Expansion Slot Card Design. Use Change Lists The timing data given in the preceding section shows that it is not possible to perform full-screen updates in the time it takes the IIGS to scan the entire screen. In fact, it would be difficult to update more than one-sixth of the screen in one scan time. Therefore, it is necessary to update only those pixels which have actually changed from the previous frame of animation. One method of doing this is to precalculate the pixels which change by comparing each frame against the preceding frame. For interactive animation, fast methods must be developed for predicting which areas of the screen must be updated (a determination of the exact pixels might require more computation than the actual update would require). Using the Video Counters To achieve "tear-free" screen updates, it is necessary to monitor the location of the scan-line beam when writing to graphics memory. As described in Apple IIGS Technical Note #39, Mega II Video Counters, the VertCnt and HorizCnt Mega II video counter registers at $C02E-C02F allow you to determine which scan line is currently being drawn. By using only the VertCnt register and ignoring the low bit of the 9-bit vertical counter stored in HorizCnt, you can determine within 2 scan lines which scan line is currently being drawn. The VertCnt video counter contains the number of the current scan line divided by two, offset by $80. For example, if the scan-line beam was currently refreshing either scan line four or five, VertCnt would contain $82 (4/2 + $80 or 5/2 + $80). Vertical blanking happens during VertCnt values $7D through $7F and $E4 through $FF. Clever updates can modify twice as many pixels on the screen by sacrificing some smoothness, running at 30 frames per second instead of 60. The technique is as follows: 1. Wait for the scan line beam to reach the first scan line. 2. Start updates from the top of the screen, being careful not to pass the scan line beam. 3. Continue updates while the scan line beam progresses toward the bottom of the screen, then goes into vertical blanking, then restarts at the top of the screen. 4. Finish the update before the scan line beam catches the update point. Careful use of this method allows a frame to be updated during two scans of the screen instead of just one. If you are not sufficiently careful, tearing results. Note: The Apple IIGS main logic board Mega II-VGC registers and interrupts are not synchronous to the Apple II Video Overlay Card video and therefore should not be used for time synchronization with the Apple II Video Overlay Card video output. However, they can be used for time synchronization with the Apple IIGS video output. See the Apple II Video Overlay Card Development Kit for more information. Interrupts It is not possible to support interrupts while sustaining a high graphics update rate, unless jerkiness or tearing is acceptable. Be aware that many system activities such as GS/OS and AppleTalk depend on interrupts and do not function if interrupts are disabled. Further Reference _____________________________________________________________________________ o Apple IIGS Firmware Reference o Apple IIGS Hardware Reference o Apple II Video Overlay Card Development Kit o Apple IIGS Technical Note #39, Mega II Video Counters o Apple IIGS Technical Note #40, VBL Signal o Apple IIGS Technical Note #68, Tips for I/O Expansion Slot Card Design