Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another batch of syscall perf enhancements #1964

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Dec 26, 2024

This is unspooling more commits from my performance fix branch in my fork. As with the previous performance PR, this is a collection of semi-related commits that may be easier to review commit-by-commit than as a big diff. Boy I miss Gerrit.

This is only a ~10% overall improvement because most of the truly low hanging fruit has been eaten. But, this significantly improves the performance of kernel memory copies and SEND with an empty message (unusual but not unheard of). This also knocks a few hundred bytes off a typical kernel by removing some panic sites.

This was generating code for the ..copy_len slicing _and_ the equality
check implied by copy_from_slice. Now it just does the latter.

This only takes about 3% off safe_copy, but produces smaller binaries by
removing an additional panic site.
I had gone to some lengths in the original implementation to only use
safe code, and in general it worked pretty well. However, it was
generating more panic sites (and checks) than I'd like.

This new version is shorter, clearer (IMO), and generates somewhat
faster code.

By shortening the various routines involved, this reduces the mean cost
of safe_copy during IPC syscalls by about 3%.
Empty copies are somewhat unusual, but appear in IPC interfaces that
follow certain patterns, or in lease accesses that hit the end.
Previously they were expensive nops. This makes them cheap nops.

This reduces the cost of a zero-byte send by 24% (to about 928 cycles on
a Cortex-M4, if you're curious).
@cbiffle cbiffle requested a review from hawkw December 26, 2024 04:39
TaskState is a nontrivial enum. It turns out, we were comparing it often
enough that the compiler thought it'd be good to out-line the comparison
into a callable function.

That function is big and slow because it covers every case. Being
out-lined, it has to be generic and can't use any context to optimize.

So, I've gone through and replaced all TaskState comparisons with
pattern deconstruction. This knocks like 256 bytes off a typical kernel
and speeds up syscall-heavy code by 3% on average.
The missing must_use was just an oversight -- access check functions
that return flags, instead of faulting or returning Result, should
always be must_use.

inline(always) turns out to significantly improve performance of
Task::can_access by specializing the generated code for the predicate it
uses.
@cbiffle cbiffle force-pushed the cbiffle/more-syscall-perf branch from aa584cb to af2f519 Compare December 26, 2024 04:40
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! nothing too complicated in this one.

Comment on lines +177 to +179
/// Shortening `a` and `b` ensures that the slices returned from `try_read`
/// / `try_write` have the same length, without the need for further
/// slicing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Love to see panic sites disappear!

@@ -252,7 +252,10 @@ impl TaskState {
/// Checks if a task in this state is trying to deliver a message to
/// `target`.
pub fn is_sending_to(&self, target: TaskId) -> bool {
self == &TaskState::Healthy(SchedState::InSend(target))
match self {
TaskState::Healthy(SchedState::InSend(t)) => *t == target,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using matches instead of generating whole PartialEq/Eq implementations is a smart move when optimizing for binary size, i'll have to remember that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants