platform(x86): our timer implementation is basically the worst thing ever #349

hawkw · 2024-12-27T19:43:34Z

the x86 kernel's current clocksource for the kernel timer wheel is an ungodly abomination which i fittingly named CLOCK_IDIOTIC. essentially, it works like this:

set the 8253 PIT to fire every 10 ms.

the PIT is x86's Worst clock source, essentially a NTSC colorburst oscillator from the 1970s with some extra bullshit nobody uses jammed onto it and bundled into one IC, which is now just a tiny bit of silicon stuck on modern PC chipsets. this thing sucks So Bad. the highest frequency it oscillates at is approximately 18.2 Hz, which is for TV Reasons that only made sense in 1970.

it's basically impossible to get any kind of reasonable accuracy or precision out of this thing, and pretty much no operating system from this millenium actually uses it for any purpose besides calibrating other, better timers. but, it's really easy to configure and you can use it without doing annoying calibration stuff, since it has a fixed frequency that's unrelated to the CPU's bus frequency. which is why we use it currently.¹
when the PIT interval timer interrupt fires, increment an AtomicU64
when you want a timestamp, read that AtomicU64. you now have a timestamp with the granularity of garbage.

we should really use RDTSC instead of "the worst thing ever". unfortunately, before you can use this timer, you have to calibrate the timestamps it gives you against a known-frequency clock like the PIT.² i previously started trying to bring up the TSC as clock source in Mycelium ages ago and gave up on finishing the TSC calibration routine because it was annoying and i couldn't get good timestamps out of it. this might actually be because i was doing all my development in a QEMU VM, and it turns out that when your VM's host threads get preempted, the guest sees the clock jump around weirdly.

alternatives could include implementing a paravirtualized clock and using that when we're in QEMU, but that's a bunch of work we would do that's only valuable for VMs and not when running on a real computer. and we would still not have an easy way of testing the clock source used on real hardware because i don't really want to just install mnemOS on my laptop and turn it into a $2000 paperweight.

another option is the High-Performance Event Timer, or HPET. this is some microsoft bullshit that i think is mainly only used by windows? i don't know very much about it. it has a MAIN_COUNTER_VAL which can be used as a timestamp, i think. i'm not sure whether calibration is required for this thing, but if it's not, this could be a nice option that works the same in a VM as it does in Real Life, but isn't also "just whang an atomic every time an interrupt fires lol lmao". i'd need to look into that some more.

what i'm saying here is that i'm lazy and stupid. ↩
or the ACPI PMTimer, which is also a fixed-frequency clock like the PIT. pros of this thing include that it has a much higher fixed frequency (3.579545 MHz). cons of it include "you can only talk to it using ACPI", which means it's basically never used by anyone. ↩

The text was updated successfully, but these errors were encountered:

hawkw · 2024-12-27T21:10:47Z

once again i am saved by @iximeow's near-encyclopedic knowledge of x86 arcana: apparently there's CPUID leaves that "just tell you what the TSC frequency is".¹

quoth ixi:

no you're going to be able to sidestep all this bullshit on processors from the last decade and a half
cpuid leaf 80000007 bit 8 in edx is TscInvariant which promises the tsc never stops. and then on both amd and intel the tsc is documented to tick at the P0 frequency, but you can check this with cpuid leaf 15 which tells you the ratio between the cpu frequency and tsc frequency
so as long as you have all of those i think you can actually skip "TSC calibration" and it will only be wrong if the hardware is severely fucked
(in your case i'd detect the virtual machine bit in whichever cpuid leaf, to know if i'm in qemu, and if the tsc calibration tells me the tsc is bad you can use that as an informative message or just ignore it outright)
but literally every cpu since like 2015 is "fine"

so i guess we can just do that. we can always fall back to the stupid thing we do now, if the CPU is not from this decade and those CPUID leaves aren't present.

provided that your computer is not prehistoric ↩

hawkw · 2024-12-27T21:18:13Z

peeking at CPUID seems a lot less life-ruining than trying to figure out why my TSC calibration code is kinda wrong, so this could be a fun weekend project to just go do.

hawkw · 2024-12-27T21:20:10Z

pub enum TscInitError {
    ComputerAncient,
    ComputerJustBroken,
}

iximeow · 2024-12-27T21:51:12Z

HELLO I BRING TIDINGS OF TIMER NEWS

if you're just measuring times rdtsc is great, and immediately useful in ways that docs and the osdev wiki don't talk about (yet?) - at least on processors since like 2015. there are a few bits of information in cpuid that are relevant here:

TscInvariant, bit 8 in CPUID leaf 8000_0007h edx
- this is a declaration of love for constant time by the processor, promising that the TSC will tick always and forever at exactly one frequency in all performance (P-) and power (C-) states
- the last AMD processor in InstLatx64's AMD CPUID collection that did not have an invariant TSC was the Athlon 64 X2 6400+, and my read of this is that really anything K10 (~2007) or later has an invariant TSc.
- the last Intel processor in InstLatx64's Intel CPUID collection that did not have an invariant TSC would almost look like the Xeon E7450 (2008, Penryn), and i think what happened is that in Merom aka Core around 2006 they started always having an invariant TSC. but a few years later (2011) the Atom Z670 was launched and also didn't have it, so TSC invariance wasn't certain on the lower-power parts for a few more years. still, that's about 20 years of TSC invariance from everyone.

Intel

on Intel, other part of sidestepping "TSC calibration" can be backed up by two CPUID leaves:

leaf 15h reports relevant frequency information: eax/ebx gives you a ratio from the processor's "base" frequency to the TSC frequency, and if the TSC is invariant, it'll always be ticking at that rate. the wiki page here has some notes on interpreting some zeroes in these fields. more on that in a moment.
leaf 16h has base/max frequency info that's useful here.

for Intel there are some notes in the SDM that are relevant here too:

Processor timestamp counter
This counter increments at the max non-turbo or P1 frequency, and its value is returned on a RDTSC. Its frequency is fixed.

which suggests the TSC is, well, constant frequency with an easily determinable number. this should be eax in leaf 16h. "should", because leaf 16h does not appear to be in the SDM, though from InstLatx64's Intel CPUID collection it's there and reasonable on everything Skylake (2015) and later.

AMD

AMD doesn't enumerate CPUID leaves 15h or 16h. the TSC frequency on AMD is less easily found from the APM, but is somewhat simpler.. kinda. rdtsc tells you to consult the BIOS and Kernel Developer's Guide for your processor to learn about the effect of power management on the TSC. but the last BKDG was for Bulldozer-Excavator (family 15h), and in Zen it's all Processor Programming Reference all day.

for family 17h model 31h, for example:

• Core::X86::Msr::TSC; the TSC increments at the rate specified by the P0 Pstate. See Core::X86::Msr::PStateDef.

there isn't a CPUID bit to say that this is exactly what the TSC ticks at, but back in the APM there are more general statements that substantiate the TSC ticking at P0 generally..

on mwaitx there's:

When set, EBX contains the maximum wait time expressed in Software P0 clocks, the same clocks counted by the TSC

and on "Time-Stamp Counter" there's:

The TSC in a 1-GHz processor counts for almost 600 years before it wraps.

which suggests the TSC in an X-GHz processor ticks at or close to X (1GHz tick rate gives you 584.9 years to overflow)

finally, the Performance Timestamp Counter mentions The PTSC can be correlated to the architectural TSC that runs at the P0 frequency.

OK. P0 frequency. AMD reports P-states in MSRs [C001_0064,C001_006B] aka "PStateDef". CpuDfsId tells you the clock divider to use when calculating the frequency for a P-state, P0 usually has this set to 08h for an 8/8 ratio. CpuFid picks a core frequency. unfortunately, the format of PStateDef changed when going from Bulldozer to Zen. the field names are the same, but their meanings and bit locations are a bit different.

`CpuFid`

on Zen and later this is in units of 25MHz (from the handful of public PPRs that are available). on Bulldozer (2011) through Excavator ("family 15h") it is in units of 100MHz.

on Zen it is bits [0..7] of the first PStateDef register, so on my 7950x that's 0x8c for a 3500MHz P0 frequency. reasonable number.
on Bulldozer it is bits [0..5] of.. one of the PStateDef registers. before Zen the boosted frequencies were architecturally-visible P-state definitions, so the first few definitions are faster than P0. then 0x10 is added to CpuFid here for a real frequency (in MHz). InstLatx64 has MSR readings in the CPUID dumps as well, and looking at a Piledriver case this seems about right: 0x16 as the low six bits, plus 0x10 gives 38 * 100MHz for a base frequency. the CPU this was captured from was a AMD A10-5800K with base clock advertised at 3.8GHz so that seems right.

family 15h and "max non-boosted frequency

the way Bulldozer-through-Excavator cores declared which P-state is P0 is by declaring the number of P-states that reflect a boosted state, and P0 is the next one after that. that number of boosted states is part of a register in one of the FCH PCI devices: D18F4x15c[NumBoostStates], which is "device 18h, function 4, register 0x15c". i don't blame you if you want to pretend that AMD before Zen doesn't exist. but if this reads "2", the first two PStateDef MSRs are boosted states, so PStateDef[2] aka 0xc0010066 would be the definition for P0 where the P0 frequency can be determined.

`CpuDfsId`

on Zen this is bits [8..13], but again is almost certainly 8 for the P0 definition

on Bulldozer this is bits [6..8] where 0 means "divide by 1", and seems to be in line with the checking i did against a few InstLatx64 dumps.

timers

note that TSC readings don't produce interrupts so if you don't want to spin as part of waiting on timers you'll want to use the LAPIC and that is safe to do without Nonsense on anything that has CPUID leaf 6 eax[2] set ("ARAT"). on AMD this seems to be Zen and later but special shouts to Excavator mk2 APUs in Bristol Ridge that seemed to get it too. on Intel this feature is present in ... Clarkdale (2011) maybe, but definitely Sandy Bridge and later.

this only matters if you're doing C-states and might actually stop the CPU with mwait or however ACPI asks you to, where the LAPIC might stop ticking when the core powers off lol

so wtf about the TSC right

my advice would be that you could have codepaths for blessed CPUs whose behavior you can find docs for (hopefully stuff i've linked above more or less) that let you skip TSC calibration, or use TSC calibration as a cross-check that you've read the bits right and the math checks out. that gives you a way to work decently in a VM too, where you can check for CPUID leaf 1h ecx[31] ("hypervisor") to see if the "TSC calibration" is doomed by being in a guest anyway. then if you can't calibrate the TSC and the processor isn't one you understand, that's a good time to give up and use the PIT or something

(though all the virt extensions let you scale guest clock speeds so if you don't calibrate the TSC and your VMM fiddles with the TSC ratio you'll end up with Bad Behavior. that's more for migrating between different physical processors who have different TSC rates though, so it should generally be more "for good, guest doesn't care" than "for evil, guest will self-immolates now")

hawkw · 2024-12-27T22:49:09Z

thanks @iximeow, this is really lovely! really appreciate all the research.

unfortunately i don't have a collection of "every x86 CPU made since 2007" to test against, but i'm gonna see how much i can get working...

hawkw · 2024-12-28T03:59:10Z

poking around a little, it appears that Linux does attempt to use CPUID to determine the TSC frequency, but will just immediately give up on AMD CPUs:

if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
		return 0;

i suspect that this probably either means that finding the appropriate AMD documentation was more annoying, or whoever contributed this to linux just didn't care. or both.

Also interesting is that on Denverton (Atom C3xxxx) SoCs, the CPUID leaves are missing and linux just hardcodes something, which we could borrow i guess.

hawkw · 2024-12-28T18:43:18Z

Some interesting learnings about QEMU behaviors:

QEMU seems not to provide the hypervisor info leaf (0x40000010) that's supposed to be used to provide the guest with the TSC and local APIC bus frequencies
AFAIK, QEMU seems to not set TscInvariant, on both a generic CPU (-cpu qemu64), with the host CPU via KVM (-cpu host -machine accel=kvm) or with a random real CPU (in this case, -cpu 'EPYC-Milan-v2').

these behaviors seem to be the case regardless of whether QEMU is KVM-accelerated or fully emulating the guest. it's possible there's flags i should be passing to make it give me something useful

hawkw · 2024-12-29T17:17:54Z

for those of you following along at home (or, for Future Elizas, when i forget this), the way you get QEMU to pass through the host CPU's InvariantTsc flag when running with KVM is, apparently:

$ qemu-system-x86_64 -machine accel=kvm -cpu host,migratable=no,+invtsc

as per https://wiki.qemu.org/ChangeLog/2.1:

KVM
New "invtsc" (Invariant TSC) CPU feature. When enabled, this will block migration and savevm, so it is not enabled by default on any CPU model. To enable invtsc, the migratable=no flag (supported only by -cpu host, by now) is required. So, invtsc is available only if using: -cpu host,migratable=no,+invtsc.

hawkw added the platform: x86_64 Specific to the x86_64 hardware platform label Dec 27, 2024

hawkw self-assigned this Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

platform(x86): our timer implementation is basically the worst thing ever #349

platform(x86): our timer implementation is basically the worst thing ever #349

hawkw commented Dec 27, 2024 •

edited

Loading

hawkw commented Dec 27, 2024 •

edited

Loading

hawkw commented Dec 27, 2024

hawkw commented Dec 27, 2024

iximeow commented Dec 27, 2024 •

edited

Loading

hawkw commented Dec 27, 2024

hawkw commented Dec 28, 2024

hawkw commented Dec 28, 2024

hawkw commented Dec 29, 2024 •

edited

Loading

platform(x86): our timer implementation is basically the worst thing ever #349

platform(x86): our timer implementation is basically the worst thing ever #349

Comments

hawkw commented Dec 27, 2024 • edited Loading

Footnotes

hawkw commented Dec 27, 2024 • edited Loading

Footnotes

hawkw commented Dec 27, 2024

hawkw commented Dec 27, 2024

iximeow commented Dec 27, 2024 • edited Loading

Intel

AMD

CpuFid

family 15h and "max non-boosted frequency

CpuDfsId

timers

so wtf about the TSC right

hawkw commented Dec 27, 2024

hawkw commented Dec 28, 2024

hawkw commented Dec 28, 2024

hawkw commented Dec 29, 2024 • edited Loading

hawkw commented Dec 27, 2024 •

edited

Loading

hawkw commented Dec 27, 2024 •

edited

Loading

iximeow commented Dec 27, 2024 •

edited

Loading

`CpuFid`

`CpuDfsId`

hawkw commented Dec 29, 2024 •

edited

Loading