In article <MqHhb.3861$3A6.1607@twister.austin.rr.com>,
Bryan Parkoff <nospam@nospam.com> wrote:
>    I have read Darek's website at http://www.emulators.com/.  Darek is an
>emulated programmer and he told me that he claims that C/C++ language is the
>WRONG languge for Emulator Project.  He says that he focuses at assembly
>language for best optimization and performance.  It is almost impossible for
>me to tell since he has been programming for over 17 years.  He always
>disagrees with many programmers for general Emulation.

I'm not sure where to start with this, so let's jump to (I'm the
author of KEGS):

>    I can tell that KEGS32 Project has a huge source code and have a very
>long routines.  It is not good for writing emulated practices.

I'm not sure what you mean here.  KEGS is about twice as fast as
any other non-assembly emulator, so it's doing something right.
Note that KEGS's main interpreter loop is a little convoluted since
the code includes PA-RISC assembly right along with the C code.
On PA-RISC machines of 5 years ago, the assembly version was about
twice the speed of the C.  But I knew the instruction timings and
gotchas of PA-RISC down to an insane level and KEGS was optimized
to take all of that into account.  In fact, KEGS uses floating
point to track time since FP operations bundled better with any
other type of instruction than integer ones, so a little FP is
basically "free."

The length of a subroutine means nothing in terms of performance.  An
emulator spends 99%+ of it's time doing a loop of:

	1. Some overhead (e.g., "are we done yet?")
	2. Decode and dispatch an instruction
	3. Do the instruction
	4. Jump back to step 1.

And of this time, about 30-70% of the time is spent in steps 1 and 2.
This is why straight interpreters take an immediate 2-3X performance
hit.  But straight interpreters are the only kind that are fully
portable, so that's what KEGS is.

KEGS is designed so that a C compiler should minimize the work done in
steps 1, 2, and 4.

KEGS main loop is (basically, the real thing's always more complex):
int
enter_engine()
{
	int	pc, a, x, y, sp, dp, k;
	int	opcode;
	float	fcycles;

	while(1) {
		if(fcycles < 0) {
			break;
		}
		opcode = get_byte(pc);
		pc++;
		fcycles -= f_twoclks;
		switch(opcode) {
		case 0x00:	/* BRK */
			etc.
		case 0xff:	/* SBC Long,X */
			etc.
		}
	}
}

Since I wrote the PA-RISC version in assembly, I knew exactly which
variables should go into registers on a RISC machine.  On RISC
machines, accessing global variables are slower than accessing
registers.  So KEGS' main loop basically tries to keep everything
in local variables so the compiler will make them native registers.
If you put things in a struct or in global variables, then you have
to go through memory at a pretty big penalty.

To keep all this CPU state in hardware registers requires everything
to be one big routine--all the instructions have to see the same
local variables.  So the routine is big.  But it should be faster than
any other style since the native jump back to the while(1) will
be faster than a subroutine call (in general...).

This adds complexity to KEGS's handling of instructions since subroutines
called from enter_engine() cannot see any 65816 hardware state (like A,X,Y,
etc).  But it clearly works.

KEGS was tuned in size so that the assembly version on PA-RISC fits
easily into a 32K instruction cache in its entirety.  So it's not
outrageously large.  This is why KEGS has a separate dispatch loop
for 8-bit ACC and 16-bit ACC, but not for the X and Y sizes or for
emulation mode.  All the special code for X and Y sizes and for
emulation mode are just done all the time.

KEGS is also pretty careful about how it handles the PSR bits--the
most-frequently changing ones (Z and N) are kept in separate variables
so the native machine can handle those updates very fast.  But any
time the PSR is accessed as a byte, the pieces need to be assembled
and disassembled.

Other complexities about the Apple II memory map cause KEGS to treat
the entire memory range with an indirection level that's just like
virtual memory.  Plus there's memory shadowing, where writes to one
region need to be reflected into another.  This adds more complexity.
KEGS isn't optimal in how it does all of this (some memory-copying
techniques are faster for some usage patterns, or some lazy updates),
but KEGS has very good worst-case performance.

>    If you want to avoid using classes, I recommend to write global
>variables that are much easier to read that looks like classes.  Lets says,

Global variables have similar performance to local variables on an x86
platform (since both will be in memory in general), but they will be
definitely slower on any RISC computer.  If you're going to just go for
x86, then go ahead and write the main loop in assembly.  A good look
at KEGS should give you an idea of what should go in assembly and
what should stay in C.

I'm not planning to write any more assembly for KEGS since it's already
too fast--tracking 16K cycles in a float gets into some serious precision
issues above an emulated 200MHz, so KEGS actually has a speed cap built-in.

>    All 256 6502's OpCode routines stay in CPU_Run() by using goto statement
>rather than calling functions.  Three local variables are inside classes so
>it won't show global variables.

Computed goto's are a gcc-ism that is non-portable.  A switch() statement
is portable and achieves pretty much the same effect.  Not calling
functions is a good idea.

>    If you choose not to use classes, there is an alternative.
>
>DWORD CPU_Accumulator = 0x00;
>DWORD CPU_XIndex = 0x00;
>DWORD CPU_YIndex = 0x00;
>
>void CPU_Run();
>void CPU_Initialize();
>void CPU_Terminate();
>
>    It is global variable that begins with CPU_ instead of g_ for easy
>reading.  All functions are global functions.  It will reduce minor bugs.
>Use BYTE and WORD keywords are bad practices for Emulator because C/C++
>compiler always adds "AND" instruction to clear 32-Bits, 24-Bits, and
>16_Bits.  DWORD is chosen so "AND" instruction is removed.  It can reduce
>x86's cycles and improve performance.

Avoiding sub-native words sized operations is a good idea.  But don't
forget to mask X and Y to the correct precision before using them,
especially if you're doing 65816 (which is a lot more complex than 6502).

>    I would like to discuss with some Apple II Emulator programmers via
>e-mails if it is possible.  It is not good to post Emulator Project
>Programming Practices on newsgroups because nobody is interested in writing
>Emulator Project until they are interested to play released Apple II
>Emulator software.
>    Please advise.
>
>
>-- 
>Bryan Parkoff

Well, you've started a discussion here.

Kent Dickey