Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose instance auto-restart status in the console #2469

Open
hawkw opened this issue Sep 24, 2024 · 7 comments
Open

Expose instance auto-restart status in the console #2469

hawkw opened this issue Sep 24, 2024 · 7 comments
Assignees
Milestone

Comments

@hawkw
Copy link
Member

hawkw commented Sep 24, 2024

PR oxidecomputer/omicron#6503 implemented automatic restarts of instances in the Failed state. This change introduced some additional instance state that should be exposed to users. In particular:

  • When a Failed instance is automatically restarted, a cooldown timer is started for that instance. If that instance fails again while the cooldown period is still active, it will not be automatically restarted again until the cooldown period has elapsed.
  • Some instances may be configured with auto-restart policies that do not permit them to be restarted when they are Failed.

New fields were added to the external-API instance message to report state related to automatic restarts. Instances now have an auto_restart_enabled: boolean field that indicates if their auto-restart policy permits restarting the instance, and an auto_restart_cooldown_expiration: string representing the date and time at which the cooldown period will have completed (allowing the instance to be restarted again). See: https://github.com/oxidecomputer/omicron/blob/45813be40b62167eff75333c410515e8bee24211/openapi/nexus.json#L15094-L15104

This data should probably be exposed to users: if an instance is in the Failed state, the user will want to know why it has not yet been automatically restarted, whether it will ever be automatically restarted, and if it will, when that will happen. We probably only need to display this information for instances which are Failed. If a Failed instance has auto_restart_enabled set to false, we should tell the user that auto-restart is disabled for that instance. Otherwise, if there is an auto_restart_cooldown_expiration timestamp, we should tell the user that the instance will be restarted only after that time. If auto_restart_enabled is not false and there is no auto_restart_cooldown_expiration timestamp, then the instance will be automatically restarted --- we might want to indicate that as well.

@askfongjojo askfongjojo added this to the 13 milestone Dec 14, 2024
@askfongjojo
Copy link

Tagged this for v13 as this is not gaining the visibility it deserves.

@benjaminleonard
Copy link
Contributor

We can probably slip this into the properties table state and perhaps the instance list table too if we figure out an elegant popover.

I need to wrap my head around the state flow of that a little – thank you for your documentation on this!

Regarding the policy itself – we should also be adding the ability to manage auto_restart_policy to:

a. Instance create form (we can probably tuck into the advanced accordion)
b. Instance view page – perhaps a settings tab, with the idea other items might eventually be there also

@benjaminleonard
Copy link
Contributor

Few initial questions @hawkw

  1. You had mentioned we might want to show that an instance is starting as a result of auto-restart. Is there a way to discern that from the current API? E.g. auto_restart_cooldown_expiration is present and instance is Starting
  2. Does the cooldown reset each time the instance state is Failed?
  3. Do you anticipate a significant time between the cooldown expiration and attempting to start the instance? Does it get queued?
  4. Do we anticipate ever needing to bubble up some logs as to why an instance has failed? I suppose there's some complicated stuff there around permissions; is it likely to just be a system issue, or can a dodgy image / bad configuration of something cause it?

@benjaminleonard
Copy link
Contributor

Accidentally closed!

@hawkw
Copy link
Member Author

hawkw commented Dec 19, 2024

  • You had mentioned we might want to show that an instance is starting as a result of auto-restart. Is there a way to discern that from the current API? E.g. auto_restart_cooldown_expiration is present and instance is Starting

The control plane internally tracks why an instance is being started in the instance_start saga, but that information isn't currently stored in the database outside of the saga, and it's not exposed in the API for viewing instance states, so I don't think you currently have any way to determine that. Wiring that through probably won't require too much additional work, but we've not done it yet.

I'd definitely like to get that into the console (and CLI etc) eventually, but I'd file it under "future work" for now.

  • Does the cooldown reset each time the instance state is Failed?

The cooldown period starts when an instance is automatically restarted. It does not reset if the instance is automatically restarted and then fails again: the intention behind the cooldown is primarily to reduce the impact on the rest of the system when an instance crashes every time it's restarted.

In order to avoid an instance restarting and then immediately crashing again in a hot loop, potentially impacting other instances, we restart the instance immediately the first time it crashes. Then, we start tracking the cooldown period once we restart the instance. If the instance fails again before the cooldown has elapsed, it will be restarted once the cooldown has elapsed. If it fails after the cooldown has elapsed, it will be restarted immediately. Each time it fails, the cooldown is reset. This way, we will not immediately restart the instance multiple times in short succession, but if it fails today and then fails again a few weeks later, it will be automatically restarted immediately both times.

  • Do you anticipate a significant time between the cooldown expiration and attempting to start the instance? Does it get queued?

It shouldn't take too long. I believe the task responsible for automatically restarting failed instances will run about once every minute, and there's some internal bookkeeping that must be done before an instance can be restarted in order to clean up any resources left behind by its past incarnation. So, there's some delay. I would generally expect a Failed instance that's eligible to be restarted to transition back to Starting within 2-5 minutes of it going to Failed.

  • Do we anticipate ever needing to bubble up some logs as to why an instance has failed? I suppose there's some complicated stuff there around permissions; is it likely to just be a system issue, or can a dodgy image / bad configuration of something cause it?

I think this falls under the purview of the ongoing fault management work --- we'll definitely want to generate more detailed reports of why an instance has failed within FMA. At present, the control plane doesn't really know anything about why an instance has failed.

@benjaminleonard benjaminleonard self-assigned this Jan 7, 2025
@benjaminleonard
Copy link
Contributor

Proposed design treatment for auto-restart status in the console:

In cases where auto-restart is relevant we show the auto-restart icon alongside the state badge. Having it inline here means we can also show it on the instances table list.

Design States

1. Failed State with Auto-restart Enabled

Image

  • Shows auto-restart icon + state badge
  • Displays policy details from API (showing "Default" when null)
  • Dynamic state handling:
    • Countdown timer + "Waiting" when in cooldown
    • Spinner when cooldown passed but not yet starting
  • Includes documentation to documentation

2. Failed State with Auto-restart Disabled

Image

  • Similar to enabled state but with N/A status
  • Consider: Should this state be conditionally rendered?

3. Starting State

Image

  • Pending API support?

4. Failed State (Restart Queued)

Image

  • Indicates when cooldown has passed
  • Shows "Queued for restart" status
  • Includes direct link to policy settings
  • Helps users understand transition state

5. Edit Policy View

Image

  • Auto-restart policy management
  • Also includes when instance will restart and inherited auto-restart if policy is null
  • Under "Settings" as I expect we'll have more items to go in here, better than bloating the tab bar too much

Alternative Consideration

Could surface auto-restart state outside the popover, helpful for viewing at a glance without popover but would not be present on instance list view.

Image

cc: @david-crespo @hawkw @charliepark

@david-crespo
Copy link
Collaborator

Overall, looks good to me. I think we might want to tweak some details of the wording to be closer to the API language to make it easier to match up to the API docs, for example using the word "cooldown" somewhere when we're in the waiting state. For state 3, when the instance is in the starting state after an auto-restart, I'm not sure we have a way to distinguish that from a regular start. So we may want to cut that one, i.e., starting is just starting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants