-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kern: rework ARMv8-M context switch, fixing a bug? #1963
base: master
Are you sure you want to change the base?
Conversation
I was playing with reducing ARMv8-M context switch time, by looking into the same sort of changes to MPU loading that I recently did on v6/7. I noticed that we only ever bitwise-OR values into the MAIR registers. In other words, the first task that gets activated loads its exact attributes into MAIR for each region; the _second_ task combines its attributes using bitwise OR; and so on until the registers contain the bitwise OR of all regions in all tasks. (Note that these are the cacheability/sharability attributes, _not_ the access permissions, which are set correctly. The main implication of this would be accidentally caching device memory or breaking DMA.) That seemed bad, so I removed it and reworked the MPU loading loop. Now we build up the MAIR contents in a local before writing them -- no more read-modify-write. Halting the processor in task code, I was able to observe a meaningful distinction in MAIR contents: previously tasks were unable to set memory as device (which involves more 1 bits than not-device), and now they can. The fact that this hasn't bitten us speaks to the relative simplicity of our current ARMv8-M parts! I don't have detailed measurements of the code, but this knocks 32 bytes off the routine, which is likely a performance win too.
da02fc1
to
69bebe8
Compare
// the same index as the region. This lets us treat MAIR as an array | ||
// corresponding to the regions. | ||
// | ||
// We unfortunately can't do this at compile time, because regions can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: This seems to call for a TaskDescExt
kind of thing, possibly stored ouf-of-band in a separate array, which could then contain the MAIR0 and MAIR1 values precomputed.
Right now, the mair
field for each region takes up 4 bytes (due to padding) while storing only a single byte of data. With an unrealistic assumption that each task only uses, on average, two unique regions that already means that the mair fields add 8 bytes of data to the flash. I assume something like 4-5 unique regions might be a more reasonable average, meaning 16-20 bytes are added for each task to, eventually, store only the 4-5 bytes of data. Moving this data to be task-specific would duplicate data but still lead to a reduction in flash size due to the removal of the padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it deserves a review from @labbott or someone else who knows the deep Cortex-M lore, but I was happy to give it a look anyway. Commented on a few things, but nothing major --- I'll leave that to someone who actually knows more about what these registers do :)
/// the hardware, which improves code generation. We do not actually rely on the | ||
/// in-memory representation of this struct otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for documenting that!
/// This is the contents of the RLAR register with the enable bit set. We | ||
/// can write it with the enable bit set directly, since we disable the MPU | ||
/// before doing so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky, feel free to ignore me: this comment kinda feels like it's written in dialogue with the comment that was there previously, which stated that we set the enable bit separately, and now it's saying that we don't do that any more. Which is great, but...it feels like there's some missing context around how the MPU works on v8-M. Is the context here that the enable bit in RLAR cannot be set while the MPU is enabled? It could be nice to say that a little more explicitly.
Edit: Oh, I see that this is discussed later on in a comment in compute_region_extension_data
. That rationale makes sense...but, I feel like it might be worth moving some of that discussion here? Otherwise, this comment doesn't stand up that well on its own, and it's the one a reader is likelier to encounter first.
This is, admittedly, a nitpick, since the rationale is clearly documented, just...a little bit later.
mpu.mair[0].write(u32::from_le_bytes(mairs[..4].try_into().unwrap())); | ||
mpu.mair[1].write(u32::from_le_bytes(mairs[4..].try_into().unwrap())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, the try_into()
is a bit of a bummer here, since these are always going to be fixed-size arrays, and it would be nice to be able to reason about that statically and avoid checking that they actually are 4 bytes at runtime. but, i guess the alternative is to declare two 4-byte arrays and make the for
loop above do some extra logic to reason about which one it's in, which also feels kind of sad. i dunno...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't love unwrap
but I also don't have great suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this
mpu.mair[0].write(u32::from_le_bytes(mairs[..4].try_into().unwrap())); | ||
mpu.mair[1].write(u32::from_le_bytes(mairs[4..].try_into().unwrap())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't love unwrap
but I also don't have great suggestions.
I was playing with reducing ARMv8-M context switch time, by looking into the same sort of changes to MPU loading that I recently did on v6/7.
I noticed that we only ever bitwise-OR values into the MAIR registers. In other words, the first task that gets activated loads its exact attributes into MAIR for each region; the second task combines its attributes using bitwise OR; and so on until the registers contain the bitwise OR of all regions in all tasks. (Note that these are the cacheability/sharability attributes, not the access permissions, which are set correctly. The main implication of this would be accidentally caching device memory or breaking DMA.)
That seemed bad, so I removed it and reworked the MPU loading loop. Now we build up the MAIR contents in a local before writing them -- no more read-modify-write.
Halting the processor in task code, I was able to observe a meaningful distinction in MAIR contents: previously tasks were unable to set memory as device (which involves more 1 bits than not-device), and now they can. The fact that this hasn't bitten us speaks to the relative simplicity of our current ARMv8-M parts!
I don't have detailed measurements of the code, but this knocks 32 bytes off the routine, which is likely a performance win too.