Well Parky, you asked, so I'm going to try to answer!
The way I think about PAE is that it kinda works a bit like a stonking great in-memory pagefile might. It doesn't change the game for 32-bit applications, but it does give the OS more headroom to manage them.
Without PAE, any memory over 4GB can't be "seen" by the OS itself, so it can't be used.
With PAE, the memory manager can see all the installed memory, but it doesn't change the per-process or kernel limits.
So if, for example, you ran 3 database programs at once, each of which used their entire 2GB user address space, with a PAE box and 6+GB, the whole lot would potentially fit into memory (assuming your kernel didn't mind getting squeezy).
So, in really short form: almost the same architecure, but more RAM!
And you're right on the other front - after a certain point on a 32-bit Terminal Server, the limiting factor is likely to be kernel address space, so if you're eyeing PAE as a possible answer and you haven't yet deployed the box, consider going x64 instead.
Now, here's the other attempt I was working on, but un-fact-checked and likely subtly (or grossly) misleading - think of it as a work-in-progress lie. For the actual story, hit "Windows Internals, 4th Edition", by Mark Russinovich and David Solomon.
I'm unlikely ever to finish this, so I figured I might as well post it for... um, well, fun, and to show I care enough to try* (for a while) 🙂
Trial #1: Intro to PAE using Sheep as an accessible metaphor
32 Bit Addressing = 386
The CPUs we know and love today are all descendants of the i386. The '386 was a chip that had 32 address lines, which is a techy way of saying that it could talk to up to 4GB RAM. 32 bits = 4 billion possible individual memory locations.
36 Bit Addressing = PAE
Physical Address Extension is a 36-bit addressing thingamabob that got tacked onto Pentium Pro and later CPUs.
PAE is cool, because the extra 4 bits mean that the processor can talk to a whopping 64GB of RAM, instead of the paltry 4GB that seemed so cool just ten short years ago. Heck, in ten years' time, my phone will probably have 2GB onboard!
Enter The Sheep Metaphor
In short: each process gets a 4GB address space, and the lower 2GB is a play area unique to that process. All processes share their address space with the same kernel, which is the upper 2GB.
Let's say that Sheep is our 32-bit Windows program. When Sheep is started by the OS, it'll be plonked into a 2GB field in which it can play, with a small amount of nasty barbed wire near the very bottom, and a big wall with a tiny window near the top.
The Kernel memory area is another 2GB field beyond the wall, with a sign tacked to the front "All that 2GB field is yours, except this 2GB here. Attempt no grazing here."
Any other programs you run get put in their own totally separate 2GB field (say, SkyscraperBuilder.exe) but they all see the same Kernel field. It's a bit like the dead people in the Sixth Sense (honestly, if you haven't seen it yet, you need to stay in more) - they can't see each other, but the kid (Kernel Kid™?) can see them.
With me so far?
All This Could One Day Be Yours (but you have to allocate it)
Windows doesn't just hand each process a fully allocated 2GB field of memory - (count the number of processes running on your computer at startup; now imagine having to install 2GB of RAM for each process to run!) - it gives it just enough to get it loaded, and then the process has to actually ask for what it needs.
A Sheep might only use less than 1% of its field while it's wandering in a small area and grazing, whereas the SkyscraperBuilder is likely to try to use all the space it has available, and subsequently harangue, harass and attempt to blackmail the planning authority for more.
But at the beginning, they both believe that the field is empty, and they just start asking for memory.
Virtual memory - to cut a very, very involved story of lies and deception rather short - is how the OS manages to allow each application to request and use memory that all seems nice and contiguous to the application, but is actually "backed" by memory in a physical location elsewhere - and that "elsewhere" can be somewhere else in RAM, or on the hard disk, in the pagefile.
The Kernel address space itself is virtualized, though k-mode components are able to "look behind the curtain" if really necessary.
In a situation where you've got less memory than 4GB, VM means that everything gets to actually run, while having this wonderfully seemingly neat memory area to play in, and room to grow.
If you've got the whole 4GB (which is our theoretical maximum at this point in the discussion) and a bunch of tiny programs, everything's going to go swimmingly.
But just flip that on its head for a second - just say your requirements were greater than 4GB. Say that the amount of memory actively used by all the programs on your computer (called the "Working Set") exceeds 4GB. Say that all up, you really need 6GB in RAM at one time, across a bunch of processes.
The CPU can only use 4GB total… so even if you somehow drop 8GB into the machine, you're in for some paging (hitting the hard disk to swap memory in and out of physical RAM) without PAE; the OS can only keep track of so much memory.
But Enable PAE, and whop! The CPU can now use however much RAM you've got in the box (up to 64GB), so less paging happens. The kernel/user split is still the same - we're still talking 2GB user space per application and 2GB kernel space, so it's business as usual to each process on the machine - it's just that the virtual memory manager can now use all the RAM in the box to satisfy demand before having to go to the page file.