Errata: Page 37 says that pipelining is the reason we need to jmp after setting ...

rep_lodsb · on March 25, 2022

This probably no longer applies in any way to modern processors, but the 286 and 386 did have a "prefetch queue" (containing opcode bytes fetched from CS:eIP), as well as a "decoded instruction queue".

Up to 3 instructions could be pre-decoded and stored in that queue, while the execution unit was still busy with some other instruction (keep in mind that back in the day it was multiple clock cycles per instruction, not multiple instructions per cycle!)

Each decoded instruction included a microcode entry point address, depending on the opcode and the state of CR0 at the time of decoding. On the 486 this was further optimized to actually store the first micro-op in that queue.

If you set the "Protect Enable" bit in CR0, it does not affect instructions that have already been through the decode stage, so they would not run with protected mode semantics. A far jump would actually not work in that case, because it would load the CS base with (selector << 4) instead of going through the descriptor table. So you should first do a near jump to flush the pipeline, and only then jump to the protected mode CS.

In practice, you could get away with not doing this step, depending on the exact timing of the code used to switch to protected mode. A lot of instructions are short enough that the processor never has enough time to pre-decode anything following.

The Intel manuals did (and still do?) document you have to do this near jump first, though not the exact details of why.

knorker · on March 25, 2022

Oh yeah, thanks for the context. Checking my old OS it seems I did do that near jump too. I'd forgotten about that.

It looks like the author of this PDF confused the two jumps.

Looks like Linux nowadays does the far jump immediately after setting cr0, thus doing two in one:

https://github.com/torvalds/linux/blob/master/arch/x86/boot/...

But the near and far jump (if you use both) still have VERY different purposes. And the CPU is not really in 32bit mode until all the segment descriptors have been loaded manually.

Linux didn't used to combine them:

https://elixir.bootlin.com/linux/2.2.26/source/arch/i386/boo...

Here we see the near instruction-flushing jump, then a bunch of code (that is still in 16bit mode!), and then the long jump that switches the instruction set to 32bit.

The comments in that last link explains it well, too.

rep_lodsb · on March 25, 2022

The point is that the CPU might still be in real mode at the point in time when the far jump is decoded, and thus execute it with real mode semantics. Coding a seemingly useless near jump was the way to prevent that.

Whoever wrote the modern Linux code doesn't seem to understand any of this. There is even a jump before the switch to protected mode, with the comment "Short jump to serialize on 386/486". Cargo-cult programming at its finest :)

As I said before, it may have always worked in practice without the jump, and modern Linux doesn't even support the 386 anymore.

Actually, with all the crazy out-of-order speculative execution going on, one could expect newer CPUs to have similar requirements, but I guess they also had to get better at providing the illusion that such internal details don't matter. Making LMSW / MOV CR0 a serializing instruction seems to be an easy way to do it.

knorker · on March 25, 2022

> the CPU might still be in real mode at the point in time when the far jump is decoded,

It definitely is in real mode when the far jump is decoded. Or is that not the right technical definition of "real mode"?

Do we leave real mode as soon as CR0 bit 1 is set, even though nothing has changed until the code puts a new value into a segment register? I've always thought of it as when CS becomes 32bit, meaning the long jump.

Anyway, that's just terminology.

> "Short jump to serialize on 386/486". Cargo-cult programming at its finest :)

That's weird. I tracked down the commit:

https://github.com/torvalds/linux/commit/2ee2394b682c0ee99b0...

> Actually, with all the crazy out-of-order speculative execution going on

I would expect those to detect that dependency and not take effect. But I could be wrong.

Edit:

Interesting. The new Intel docs explicitly say to do the far jump "Immediately" after setting the PE bit in CR0.

https://www.intel.co.uk/content/www/uk/en/architecture-and-t... section 9.9.1.

Whereas the example code for the original 386 programming manual has the short jump, many other instructions, and then the far jump.

Haha, but then the actual example code on page 9-20 (new manual) still has the "clear prefetch queue" near jump.

rep_lodsb · on March 25, 2022

>Do we leave real mode as soon as CR0 bit 1 is set

Yes, that's what the bit is defined as doing. Protected mode was introduced on the 16-bit 80286, and 16-bit code segment descriptors are still supported to this day.

Even in real mode, the segment registers have hidden fields containing base, limit and access rights. The difference is that when they get loaded in real mode, the base will be set to the segment shifted left by 4 bits, with the limit¹ and access rights² generally left unchanged. What the PE bit actually affects is how segment load instructions operate, how interrupts/exceptions are handled, and a few other details.

Most instructions run exactly the same microcode in either mode: any memory access will form the address and do protection checks based on whatever is currently in those hidden fields. But a segment load (or far jump) decoded while PE=0 will execute different microcode than one with PE=1.

>That's weird. I tracked down the commit

Shuffling the code around may have fixed some alignment bug? The jump could likely be replaced by two NOPs, in any case the comment is completely wrong.

2.6.22 seems to be the last version using LMSW followed by a near jump, and presumably worked on that CPU (at least there is a comment mentioning bugfixes for Elan), so it isn't likely to be the cause of the problem.

¹ the limit on power-on/reset is 64K, but it is possible to change it to 4G, allowing access to all memory ("unreal mode")

² CS will always be made a writable data segment, something not possible to set up in protected mode without the use of LOADALL

edit: >Interesting. The new Intel docs explicitly say to do the far jump "Immediately" after setting the PE bit in CR0.

Well, it's probably not required to do any kind of jump immediately on modern CPUs, but it wouldn't be the first time Intel got something completely wrong: https://www.os2museum.com/wp/sgdtsidt-fiction-and-reality/

userbinator · on March 25, 2022

Actually, you could continue executing 16bit code, loading only DS/ES with 32bit selectors. Why, I don't know, but you could.

That's almost how flat real mode (aka "unreal mode") works, except it's returning from protmode with 4GB selector limits.