Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kern: rework ARMv8-M context switch, fixing a bug? #1963

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Dec 26, 2024

I was playing with reducing ARMv8-M context switch time, by looking into the same sort of changes to MPU loading that I recently did on v6/7.

I noticed that we only ever bitwise-OR values into the MAIR registers. In other words, the first task that gets activated loads its exact attributes into MAIR for each region; the second task combines its attributes using bitwise OR; and so on until the registers contain the bitwise OR of all regions in all tasks. (Note that these are the cacheability/sharability attributes, not the access permissions, which are set correctly. The main implication of this would be accidentally caching device memory or breaking DMA.)

That seemed bad, so I removed it and reworked the MPU loading loop. Now we build up the MAIR contents in a local before writing them -- no more read-modify-write.

Halting the processor in task code, I was able to observe a meaningful distinction in MAIR contents: previously tasks were unable to set memory as device (which involves more 1 bits than not-device), and now they can. The fact that this hasn't bitten us speaks to the relative simplicity of our current ARMv8-M parts!

I don't have detailed measurements of the code, but this knocks 32 bytes off the routine, which is likely a performance win too.

@cbiffle cbiffle requested a review from labbott December 26, 2024 04:24
I was playing with reducing ARMv8-M context switch time, by looking into
the same sort of changes to MPU loading that I recently did on v6/7.

I noticed that we only ever bitwise-OR values into the MAIR registers.
In other words, the first task that gets activated loads its exact
attributes into MAIR for each region; the _second_ task combines its
attributes using bitwise OR; and so on until the registers contain the
bitwise OR of all regions in all tasks. (Note that these are the
cacheability/sharability attributes, _not_ the access permissions, which
are set correctly. The main implication of this would be accidentally
caching device memory or breaking DMA.)

That seemed bad, so I removed it and reworked the MPU loading loop. Now
we build up the MAIR contents in a local before writing them -- no more
read-modify-write.

Halting the processor in task code, I was able to observe a meaningful
distinction in MAIR contents: previously tasks were unable to set memory
as device (which involves more 1 bits than not-device), and now they
can. The fact that this hasn't bitten us speaks to the relative
simplicity of our current ARMv8-M parts!

I don't have detailed measurements of the code, but this knocks 32 bytes
off the routine, which is likely a performance win too.
@cbiffle cbiffle force-pushed the cbiffle/armv8m-mair-bug branch from da02fc1 to 69bebe8 Compare December 26, 2024 04:27
// the same index as the region. This lets us treat MAIR as an array
// corresponding to the regions.
//
// We unfortunately can't do this at compile time, because regions can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: This seems to call for a TaskDescExt kind of thing, possibly stored ouf-of-band in a separate array, which could then contain the MAIR0 and MAIR1 values precomputed.

Right now, the mair field for each region takes up 4 bytes (due to padding) while storing only a single byte of data. With an unrealistic assumption that each task only uses, on average, two unique regions that already means that the mair fields add 8 bytes of data to the flash. I assume something like 4-5 unique regions might be a more reasonable average, meaning 16-20 bytes are added for each task to, eventually, store only the 4-5 bytes of data. Moving this data to be task-specific would duplicate data but still lead to a reduction in flash size due to the removal of the padding.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it deserves a review from @labbott or someone else who knows the deep Cortex-M lore, but I was happy to give it a look anyway. Commented on a few things, but nothing major --- I'll leave that to someone who actually knows more about what these registers do :)

Comment on lines +499 to +500
/// the hardware, which improves code generation. We do not actually rely on the
/// in-memory representation of this struct otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting that!

Comment on lines +507 to +509
/// This is the contents of the RLAR register with the enable bit set. We
/// can write it with the enable bit set directly, since we disable the MPU
/// before doing so.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky, feel free to ignore me: this comment kinda feels like it's written in dialogue with the comment that was there previously, which stated that we set the enable bit separately, and now it's saying that we don't do that any more. Which is great, but...it feels like there's some missing context around how the MPU works on v8-M. Is the context here that the enable bit in RLAR cannot be set while the MPU is enabled? It could be nice to say that a little more explicitly.

Edit: Oh, I see that this is discussed later on in a comment in compute_region_extension_data. That rationale makes sense...but, I feel like it might be worth moving some of that discussion here? Otherwise, this comment doesn't stand up that well on its own, and it's the one a reader is likelier to encounter first.

This is, admittedly, a nitpick, since the rationale is clearly documented, just...a little bit later.

Comment on lines +608 to +609
mpu.mair[0].write(u32::from_le_bytes(mairs[..4].try_into().unwrap()));
mpu.mair[1].write(u32::from_le_bytes(mairs[4..].try_into().unwrap()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, the try_into() is a bit of a bummer here, since these are always going to be fixed-size arrays, and it would be nice to be able to reason about that statically and avoid checking that they actually are 4 bytes at runtime. but, i guess the alternative is to declare two 4-byte arrays and make the for loop above do some extra logic to reason about which one it's in, which also feels kind of sad. i dunno...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't love unwrap but I also don't have great suggestions.

Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this

Comment on lines +608 to +609
mpu.mair[0].write(u32::from_le_bytes(mairs[..4].try_into().unwrap()));
mpu.mair[1].write(u32::from_le_bytes(mairs[4..].try_into().unwrap()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't love unwrap but I also don't have great suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants