Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stackmargin results are misleading #1872

Open
cbiffle opened this issue Sep 19, 2024 · 10 comments
Open

stackmargin results are misleading #1872

cbiffle opened this issue Sep 19, 2024 · 10 comments
Labels
affects-humility Will have debug interface impact and may require corresponding changes in the debugger 🤔 design Forward-looking discussions and proposals developer-experience Fixing this would have a positive impact on developer experience robustness Fixing this would improve robustness of deployed firmware

Comments

@cbiffle
Copy link
Collaborator

cbiffle commented Sep 19, 2024

(This could arguably be in the Humility repo, but I suspect that any changes we make to fix the problem will be in Hubris, so I'm filing it here; YOLO.)

tl;dr: The humility stackmargin results can lull you into a false sense of security. We should figure out how to fix this.

What stackmargin does

When a task is (re)initialized, we fill the stack memory with a recognizable pattern of bits. humility stackmargin scans the task's stack area from low addresses to high, and reports the first point where this recognizable pattern of bits is altered. This is roughly the highest watermark of used stack memory in this incarnation of the task.

What programs do

The issue is that programs don't completely write their stack frames. There are usually locals allocated in the stack that are only used on conditional paths. This is not theoretical -- we've been bitten by this in a couple of situations:

  • A function in a task allocates like 2048 bytes of stack. This bumps the value of the SP register outside of the task's allocated stack region. (On ARMv7-M the hardware does not detect this. On ARMv8-M it can.)
  • The function only uses locals in, say, the high 32 bytes of that region. This means there are no actual writes outside of the allocated stack region, so no memory management fault occurs.
  • This situation is actually fine... until an interrupt occurs with the SP out of range. At that point, the hardware notices that it's not able to deposit the exception frame on the stack, and the task is killed.
  • Because the upper bytes of the stack frame are not written, this is invisible to stackmargin.

This means that an "out of stack" condition can hide, unnoticed by either the kernel or stackmargin, for an arbitrary period of time until it happens to be noticed by an interrupt. (At that point, it remains invisible to stackmargin, since stackmargin doesn't consider crashed tasks.)

And so

So stackmargin right now is always an over-estimate. In some cases, it's an extreme over-estimate. This is bad, because we use stackmargin to check whether e.g. a compiler upgrade will cause problems by changing inlining decisions. Currently, we could observe a higher stackmargin but actually have a new stack overflow problem (if the leaf function in the deepest stack has a very large partially used frame).

We should figure out how to fix this.

I'd like to open the floor to brainstorming on this. Here are my initial thoughts.

  • Hardware stack limits are only a partial solution: The ARMv8-M stack limit registers are the right way to handle this "hidden overflow" condition, but --- counter-intuitively! --- don't actually improve stackmargin output. The SP can be bumped up to the stack limit, but if the final frame is only partially written, stackmargin will still insist we have plenty of stack free. (Hardware stack limits do at least limit us to not going below 0 stackmargin.)
  • Any kernel entry or interrupt already affects stack margin. Since a syscall or interrupt begins by writing the task's state to its stack, stackmargin will be able to see the deepest stack at the time of kernel entry. This implies that any method for measuring transient stack pointers that involves a kernel entry --- such as sampling SP from a timer interrupt --- will crash a task that has silently run out of stack, which is arguably good. However, note that we already take regular interrupts, and tasks have been able to overflow their stacks without taking one. We could crank up interrupt frequency to defend against this, but at the cost of CPU utilization and latency. (I don't like this class of approach.)
  • Toolchain support for stack probing and the like might help. Rust has existing support for emitting stack probes as stack frames get bigger, which prevents this class of problem on virtually addressed machines. These are not enabled by default on our targets, but we could potentially fix that. We'd need to look carefully at the stack probes --- if they're written by Big Computer People who assume paged memory, they may only probe the lowest address in each page, whereas we want them to probe the lowest address, period, and not round to any page boundary (since we don't have pages). This would produce a proactive crash like a stack limit register, but would not help stackmargin. If the stack probe could be made to do a write, however... that would get us exact results!
@cbiffle cbiffle added developer-experience Fixing this would have a positive impact on developer experience affects-humility Will have debug interface impact and may require corresponding changes in the debugger robustness Fixing this would improve robustness of deployed firmware 🤔 design Forward-looking discussions and proposals labels Sep 19, 2024
@cbiffle
Copy link
Collaborator Author

cbiffle commented Sep 19, 2024

In an ideal hardware world, there'd be a register that records the numerically-lowest (i.e. deepest stack) value of the SP as a high watermark, and the kernel could read that out and replace it on context switch. None of our target processors have such a feature.

Systems like FreeRTOS attempt to indicate a stack high water mark by recording SP values any time a task is preempted. As noted above on the bullet about kernel entry, we effectively already get this for free --- FreeRTOS's behavior mostly makes sense because it's a non-memory-protected system vulnerable to stack clash, so you could overflow a stack and still take an interrupt/syscall.

@aapoalas
Copy link
Contributor

aapoalas commented Sep 19, 2024

In an ideal hardware world, there'd be a register that records the numerically-lowest (i.e. deepest stack) value of the SP as a high watermark, and the kernel could read that out and replace it on context switch. None of our target processors have such a feature.

What about doing this manually? Create a similar kind of system as ringbuf, only it's sized to 4 bytes and tracing into it reads the current stack pointer, and does a compare and write if greater. Then, pepper this trace to any and all functions even remotely suspected of having a chance of exhausting the stack (both in userland and kernel land?). cfg out the traces and the "stackbuf" itself, or possibly setup the system itself to stamp out no code whatsoever if the stackbuf feature is not enabled.

Turn on the feature in lab builds. Done.

The downsides are obviously that placing these traces is manual and thus won't catch everything, and running the traces may be cheap but they're not free, and you may not enjoy the increase in testing time.

@cbiffle
Copy link
Collaborator Author

cbiffle commented Sep 19, 2024

The downsides are obviously that placing these traces is manual and thus won't catch everything, and running the traces may be cheap but they're not free, and you may not enjoy the increase in testing time.

Yeah, these feel like pretty significant drawbacks to me -- a protection mechanism that only works when people remember to copy-paste some code won't be reliable for long.

@aapoalas
Copy link
Contributor

Yeah, these feel like pretty significant drawbacks to me -- a protection mechanism that only works when people remember to copy-paste some code won't be reliable for long.

Yeah, that's true. If it turned out that doing the "tracing" was fast enough to not really affect performance then this might be somewhat mitigated by making this an attribute macro that is applied to all functions: It would be harder to forget a #[hubris_function] attribute from a function definition than it would be to forget a stackref!(); macro from the function body.

But it would still definitely be possible.

@cbrune
Copy link

cbrune commented Sep 19, 2024

Systems like FreeRTOS attempt to indicate a stack high water mark by recording SP values any time a task is preempted. As noted above on the bullet about kernel entry, we effectively already get this for free

Sampling the SP any time a task is preempted sounds interesting. The kernel could note the SP high water mark for the task. That accounting does not sound too costly. The humility stackmargin command could do what it does today, plus read out the SP high water mark for the task.

As you said the SP high water mark does not indicate that data was actually written, so there is a gap in knowledge there. However, the stackmargin command could also report that the reported SP high water mark exceeds the max stack value specified in the app.toml for the task which should give one pause to ponder.

@cbrune
Copy link

cbrune commented Sep 19, 2024

Perhaps the SP high water mark accounting could be a tracepoint that is a NO-OP by default, so it does not impact production. Humility could poke some bits and activate the accounting only when profiling.

@aapoalas
Copy link
Contributor

aapoalas commented Sep 19, 2024

So I ended up looking into and thinking about this a bit. As I see it, there are two main options:

  • Stack probing
  • Interrupt based sampling

Interrupt based sampling has been kind of well talked above, I feel: It's fairly easy to add on top of what Hubris already does, but has the downside of not detecting everything. So, stack probing!

Best as I can tell, LLVM probably doesn't natively support stack-probing for 32-bit ARM currently, or at least I was unable to find code in llvm-project that would point to that being a thing. Additionally, this patch is unreviewed and unmerged [1]. See also the closed rust-embedded issue by the same author, referencing this LLVM patch [2]. But on the other hand there exists this merged patch about changing the default max page size on ARM32 so I'm a bit confused: [3].

Still, since Hubris has no paging any max page size stuff is meaningless. Best as I can tell, you'd need to insert the stack probe call (or inline it) to the beginning of every call (unless you can statically prove it cannot overflow the stack). ... Would you also have different stack base address for each task? Or an otherwise statically knowable SP max value? That would make a stack probe easier to implement. Though that's a side point.

I have a feeling that getting stack probes to work "natively" will be an uphill battle. Overriding Rust's __rust_probestack might be possible somehow? But the question of how to get Rust to insert the probestack call to every function would remain. As a result, having a function attribute (eg. #[hubris_stackprobe]) might end up being the easiest way to actually do this, maybe even the only possible way.

P.S. This may also be interesting [4].

@cbrune
Copy link

cbrune commented Sep 20, 2024

Best as I can tell, you'd need to insert the stack probe call (or inline it) to the beginning of every call

That does seem to be the only way to get the information in the absence of hardware support (a hardware maintained SP high water mark register mentioned above). It would cost software, in both code footprint and execution cycles.

warning crazy talk ahead

For sake of argument, let's say the problem of plunking down some code in every function prologue is solved -- it's a tool in the toolbox.

As embedded folks know, having debug/profiling code conditionally included in images is problematic -- when you want it, it is not in the image you are running. You enable it, recompile and now the image is too big. Blah. It would be nice to have the debug code compiled in all the time (paying a code foot print penalty of course), but have it remain dormant or inactive (no runtime penalty) until activated.

Would it be possible to have "dtrace like" probes (nops and trampolines) in hubris that humility could activate for profiling?

Thinking about the stack probe problem and calling something in every function prologue (or perhaps only functions suitably attributed). The probe might be 4 bytes? 8 bytes? And the trampoline "real code" would be 100 bytes (wag), which would be common for the task, only one copy per task.

Clearly you wouldn't want too many latent probes since code footprint is often at a premium. Trade offs abound.

Just a thought.

On the general "investigating stack usage" topic I have used cargo-call-stack to investigate. While you cannot just throw a hubris task at it, I did take some troublesome sections of code from a project to make a toy example and feed that to cargo-call-stack. It did provide some insight, but was an awkward process.

@cbiffle
Copy link
Collaborator Author

cbiffle commented Dec 12, 2024

Would it be possible to have "dtrace like" probes (nops and trampolines) in hubris that humility could activate for profiling?

There are a bunch of cases where it'd be nice to have something like this, @cbrune, but there's a problem: unlike on Unix, our code is in ROM.

Certain ARM microcontrollers implement a "Flash Patch Unit" that can be used to interpose on flash instruction reads and substitute different values --- this was used for breakpoints and hooks. This facility is gradually going away on newer parts, however, because manufacturers have realized that people (hi) can use it to defeat their DRM and un-protect their ROMs. I always saw that as a feature, but, to each their own I suppose.

Without the ability to flash-patch, we can't really replace instructions in a task on Hubris. We could potentially use the breakpoint unit --- set a breakpoint at the location you want to intercept, and have Humility notice that and emulate the tracepoint. There are two implementation difficulties there:

  1. Humility has avoided having anything to do with breakpoints so far, and isn't really structured to run in the background while you do other things.
  2. The breakpoint units on these parts have a limited number of breakpoints available (typically 4-8). So that sets a maximum number of tracepoints you could have across the entire application.

If anyone sees other ways of implementing this sort of thing, I'm definitely here for it. I may have missed opportunities.

@cbiffle
Copy link
Collaborator Author

cbiffle commented Dec 12, 2024

Sampling the SP any time a task is preempted sounds interesting. The kernel could note the SP high water mark for the task. That accounting does not sound too costly.

That accounting would be suuuuper cheap, actually. It'd be lost in the noise relative to the current syscall entry/exit sequences. Roughly three machine cycles, I think? I'd need to implement it.

But I think something like this would be worth doing, personally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-humility Will have debug interface impact and may require corresponding changes in the debugger 🤔 design Forward-looking discussions and proposals developer-experience Fixing this would have a positive impact on developer experience robustness Fixing this would improve robustness of deployed firmware
Projects
None yet
Development

No branches or pull requests

3 participants